This article is based on week 2 of course Sequence Models on Coursera. In this article, I try to summarise and explain the concept of word representation and word embedding.
Word Representation :
Generally, we represent a word in natural language processing through a vocabulary where every word is represented by a one-hot encoded vector. Suppose we have a vocabulary(V) of 10,000 words.
V = [a, aaron, …, zulu, <UNK>]
Let’s take the word ‘ Man’ is at position 5391 in the vocabulary, then it can be represented by a one-hot encoded vector (O5391 ). The position of 1 in the sparse vector O₅₃₉₁ is the index of word Man in the vocabulary.
O5391 = [0,0,0,0,...,1,...,0,0,0]
Like that, we can other words in the vocabulary which can are represented by one-hot encoded vectors Women (O9853), King(O4914), Queen(O7157), Apple(O 456), Orange(O6257).
But this method is not an effective method to feed our algorithms to learn sequence models because the algorithm is not able to capture the relationship between different examples.
Suppose we train our model for the sentence:
I want a glass of orange juice.
And want to predict the next word for the sentence:
I want a glass of apple _____.
Even if both examples are almost the same and our algorithm well trained, but it failed to predict the next word in test example. The reason behind this is that in the case of one-hot encoded vector representation the inner product between two one-hot encoded vectors is 0. Even if take the Euclidean distance between for any two vectors it is also 0.
We know that the next word would juice in the example we take, but the algorithm is not able to find any relationship between the words of the above two sentences, and it fails to predict the word in the sentence.
To solve this problem we take the help of word embeddings, it is the featurized representation of words. For each word in the vocabulary, we can learn a set of features and values.
Instead of taking a sparse one-hot encoded vector, we take a dense featured vector for each word. We can take different properties of each word and give a weightage of how much the property belongs to the word. For example, Gender property highly belongs to man, woman, king, and queen but not related to Apple and Orange, so high weightage is provided for those four, and less weightage is provided for apple and orange. Thus we are able to establish a relationship between the words which have the same properties. Now our algorithm is also capable to find the relation between apple and orange, and predict the next word according to it.
In the image shown above, there are 300 properties are taken for each word and converted into a vector, now the word man represented by (e₅₃₉₁)
e₅₃₉₁ = [-1, 0.01, 0.03, 0.04, ...]
Here e represents the embedded vector.
We can take help t-SNE (t-distributed stochastic neighbor embedding) machine-learning algorithm to visualize these words into a 2-D plot.
In this graph, we can see that the words which have the same properties are neighbors and the word which does not have any common property are far away.
Using Word Embedding for Name entity recognition :
Name entity recognization is the task where we have to identify a name in the given text. For example in the text
Sally Jhonsan is an orange farmer.
We know that Sally Jhonson is the name in the example above, and we train our model for such examples. Now if we give a test example to our algorithm like
Robert Lin is an apple farmer.
By using word embedding the algorithm easily recognize that apple farmer and orange farmer are relatable then it can predict that Rober Lin is the name in the test example.
It is an easy take for our algorithm, now we take another example :
Robert Lin is a durain cultivator.
Durian is a fruit popular in Singapore and a few other countries. And this term is not available in the vocabulary available as training data because our training data set is small. There is a high chance that the algorithm fails to recognize the name in the sentence.
In such case word embedding are useful, it can examine large unlabeled datasets easily. Large datasets can be available online for free. So the word embedding can be learned from that text corpus or dataset and it transferred to our model. In this manner, word embedding can be used through transfer learning for our model. Transfer learning is useful when our text corpus is small.
In the example given above, using embedding for large text corpus and transfer the knowledge gained to our text corpus the algorithm can identify durian and orange share the same properties and framer is related to cultivator, so it can predict the name in the sentence is Robert Lin.
Properties Of Word Embeddings
Word embedding also helps in analogy reasoning. Analogy reasoning is not only a use case of word embedding, but it also helps us to understand how the word embedding works and what we can do through word embedding.
The images show word embedding for a set of words. Now we want to know that if Man is related to Woman in the same manner King belongs to whom?
It can be represented as
Man ----> Woman
King ----> ?
Here we are using only a 4-dimensional vector instead of using 50 dimensional or 100 dimensional or more. Embedded vectors for words Man, Woman, King, Queen can we wrote as :
eman = [ -1 0.04 0.03 0.09]
ewoman = [ 1 0.02 0.02 0.01]
eking = [ -0.95 0.93 0.70 0.02]
equeen = [ -0.97 0.95 0.69 0.01]
If we take the difference between vectors eman and ewoman then it approximately is :
eman - ewoman ≈[-2 0 0 0]
And if we take the difference between vectors eking and equeen then it is also approximately the same as above.
eking - equeen ≈[-2 0 0 0]
So our algorithm able to figure out if Man is related to Woman then King is also related to Queen in the same manner.
Actually, the algorithm does not simply take the difference between king and queen, the algorithm finds the maximum for the equation given below
eman - ewoman ≈ eking - ew
find word w : argmax Sim(ew, eking - eman +ewoman)
Here w represents the words in the text corpus. So we perform the mathematical operation for eking, eman, and ewoman and check the similarity of the resultant vector with all other embedded vectors. The embedded vector which shows the maximum similarity is the output.
To calculate the similarity we use the Cosine similarity function because when the angle between two vectors is 0 then their Cosin value is 1 which is maximum. Hense we can get the most similar value and if two vectors are at an angle of 90 then it gives 0, which means that two vectors are not related at all. And if the angle between the vector is 180 then it gives -1 which means that vectors are related but in opposite direction. The formula of cosine similarity between two vectors A and B is :
We can consider the normal euclidian distance formula to calculate the similarity but cosine similarity gives a more convenient result so it is used most often. The main difference between euclidian distance and cosine similarity is that cosine similarity normalizes the vectors as shown in the formula and euclidian distance is generally used for the measure of dissimilarity.
This is how the word embedding is useful for analogy reasoning.
For sequence learning tasks in natural language processing one-hot encoded representation of vectors is not useful. For such tasks, we have to use word embedding. Word embedding turns the sparse vectors into dense vectors by considering the properties of the words in the vocabulary. If our training text corpus is small we can take the help of transfer learning and make our algorithm more robust. Word embedding can be understood using analogy reasoning and it gives precise information on how to word embedding works. Cosine similarity is widely used to measure to calculate the similarity between two vectors.
Thank you for reading this article. It is just an explanation of lectures in the course Sequence models. If you find any mistake or want to add some information feel to comment. Contact me on LinkedIn for further discussion.