Word2vec is basically an auto-encoder

Translate original link: http: //mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

This tutorial includes training the word2vec skip gram model. Through this tutorial, I hope to skip the introduction and abstract understanding of the regular Word2Vec, but to discuss some details. In particular, the network structure of Skip Gram.

model

The skipgram model is often surprised by its simple structure. I think there is basically no detour, or complex model explanations are required.

Let's start with the high-level semantics. Word2Vec uses a little trick you may have seen on other machine learning problems. We train a simple network to perform a task, but most of us don't use that network to predict the task to be tested. Instead, use this activity (goal) to learn weight W on the network. We will see that these learned weights W are the word vectors we want.

Fake task

Now let's discuss the training task (fake task) that we want to use to build the network. Then we will discuss why this task can indirectly learn the desired word vector.

For a certain word (input word) in a sentence, randomly select a word nearby. The goal of the network is to predict the probability that we will select that nearby word.

The probability of output is actually related to each word. For example, if you type the word "Soviet", there is a high probability that the output should be Union and Russia instead of watermelon or kangaroo.

We train the network by entering the word pairs that we found in the training text into the network. The following example shows some word pairs in "The quick brown fox jumps over the lazy dog". I used window size = 2 and the words highlighted in blue are the input words. A

The network learns the statistical probability of these pairs. For example, the internet is going to get a lot more (Soviet Union) than it is (Soviet Union, Bigfoot). When the training is complete and you type the word "Soviet Union", the association has a higher probability of predicting it than Bigfoot.

Model details

First, you can't convert a word to a string and put it on the network, so we need to find another form. So let's first create a dictionary of words. For example, if we have 10,000 different words, we create a dictionary of 10,000 words.

"Ants" can become a 10,000-dimensional vector. Each dimension of this vector represents one word. We set the dimension representing "ants" to 1 and the other dimensions to 0.

The output of the network is also a 10,000 dimensional vector that represents the predicted probability of each word.

The network structure is as follows:

There is no activation function in the network and the output uses Softmax. We will come back to this later.

When this network is used to train the word pair, the training input is a one-hot vector and the training target output is also a one-hot vector. However, when actually testing this network, the output is a probability distribution. A
(Translator's note: for example, the Soviet Union has only one intersection with United / Russia, so it converges to a 50%, 50% distribution. Instead of a common distribution, it is 100%. Of course, during the test, you can also get the result with the highest probability of finding it. It's also a hot one.)

Hidden level

We have to learn a 300-dimensional word vector. So the hidden layer we are using consists of 10,000 rows (the number of words in the dictionary) and 300 columns (each neuron of the hidden layer).

If you look at it from a line perspective, the hidden plane W is actually the word vector you want. A

So the ultimate goal is to let the network learn the weights in the hidden layer. When our training is complete, we can remove the second half of the output layer.

OK, let's go back to the definition of the previous model.

Now you may be wondering, the one hot vector is basically a vector of all zeros, and only one is a vector. So what result will this produce? If you use 1 × 10,0001 × 10,000Reiten10,000 × 30010,000 × 300For the matrix, it corresponds to the selection only for row "1". Here is a simple example:

This means that the hidden layer of this network is actually like a look-up table. The output dimension 300 is the word vector of that word.

Output layer

'Ant'1 × 3001 × 300 The word vector is then passed to the output layer. The output level is a softmax classifier (translator press: I think it's not a regression, it's a classification). The main purpose is to normalize each output between 0 and 1 and the sum of these outputs is 1.

More specifically, each word vector (1 × 3001 × 300) is multiplied by a W (300 × 10,000300 × 10,000) is mapped again to 10,000 dimensions, and each dimension represents the likelihood of neighboring words. And these probabilities make a softmax normal (i.e., the expexp function to normalize). The following is an explanatory diagram. A

intuition

Let's get an intuitive feel for this network.

When two different words have the same context, our model must predict similar output. To make similar predictions, the network makes the word vectors of these two different words as similar as possible. So if two words have the same context, we can get a word vector that is relatively close.

So which words have the same context? For example, some synonyms, intelligent and intelligent, and some related words, engine and gear.

This can also help you edit some form changes. Like ants and ants because they should have a similar context.

Next Step

You may find that the skip gram has lots of weights to learn. For example, 300-dimensional features, 10,000 word lists, 3MB × 23MB × 2, the weights need to be learned, one for each input and output layer. A
So if you're studying in a larger dictionary, word2vec's author introduced some tricks to make the training more workable. This will be in the next chapter, Introduction. (I also did a translation, welcome everyone to read on ~)