Word Embedding : Text Analysis : NLP : Part-2 : Intuition behind Word2Vec
Advance techniques for large unstructured text data
In my previous article Word Embedding : Text Analysis : NLP : Part-1 we discussed about techniques like OneHotEncoding and TF-IDF to convert text data into numerical data. By this methods we can easily convert text data in vector format to understand it by machines but the limitations with this methods.
- The semantic information of the words in a sentence can not be stored because it takes a weight of the words in a document.
- The relevance of the word in a particular class can not be determined as this method count the weight if any word based on documents.
For example, if we are working with speech, images or large sequences of text data the traditional methods like Bag of words with one hot encoded values or TF-IDF would lead to large sparse metrics and this lead to overfitting of a ML model. Also, it can not store the sematic information between words. To solve these issues and work with long sequences we will discuss more advance word embedding methods like Word2Vec, GloVe and FastText which are based on deep learning techniques. Let’s take a look on the embedding techniques mentioned above.
Word2Vec
Word2Vec was published by google in 2013 to represent words in a dense vector form with a deep learning technique. This is a kind of unsupervised network which is trained on words which were created as a vocabulary from large textual data and it will generate embedding for each words in a vector space . Here, the dimensionality of the vector can be controlled which was a problem with BOW and TF-IDF as they convert texts into high dimensional metrics.
Word2Vec is a method to generate an embedding with two different deep learning architecture which are : Skip Gram and Continuous Bag Of Words (CBOW)
Continuous Bag of Words (CBOW) Model
This methods takes the context of the word as an input and tries to predict word according the context. Let’s take an example “The boy is learning NLP”. We can predict the word target_word “boy” with the help of context_window. if we take a context window of 2 the combination would be like this [the,is] ad with the help of neural network the target word can be predicted.
The architecture is single layer here the input and target are one hot encoded. There are 2 weights between input — hidden and hidden-output layers. The input is multiplies with input-hidden weights and the hidden is multiplied with hidden-output weights to get output. The error between target and predicted is calculated and in back propagation the weights get updated. Linear activation function is used between
The Skip-gram Model
Skip gram model actually works in a reverse pattern than CBOW model, it tries to predict context words based on given target words. Input vector of skip-gram model is similar to CBOW model. The difference is in the target variable because we are going to predict context vector on both the side. With respect to target variable error vectors are calculated separately and final error will be obtained from it which will be back propagated for weight updating. To represent vector of the word, weights between the input and the hidden layer are taken after training.
This is the architecture of Skip-Gram model. The input vector is a dimension of 1xV, the hidden layer is of VxN and the output layer will be of 1xV. We have input vector, hidden activation function and hidden layer weights with the help of this we will get output of 2 context words ( as we are predicting 2 context words so ). This output is sent to softmax layer to convert into probabilities. The error is calculated with the subtraction of actual target word with the softmax value which we got from the output. Finally, element wise sum is taken over all vector to get final error vector which will be propagated back to update the weights.
The skip-grams mode is trained on n-grams that allows token to be skipped which can be seen in below diagram. The context of a word is represented through a pairs of (target_word, context_word)
where context_word
appears in the neighboring context of target_word
.
Consider the sentence of 8 words
The wide road shimmered in the hot sun.
The context can be defined by window size for all 8 words of the sentence and the window size represents the span of the word on either side of target word which are considered as a context word. We can see window size of 2 is taken in below example.
The Skip-Gram model also use the negative sampling which we can see in above figure. In first line the pair [wide, shimmered] is not occurred in a sentence with the window size of 2.
Implement Word2Vec Model with Gensim
We can implement Word2Vec models with the help of pytorch, Tesnsorflow as well as pretrained libraries. Here, we will try to implement Word2Vec using Gensim library. Gensim is efficient and scalable free APT to implement the Word2Vec model. We will take a small corpus then we will tokenize the data and using given parameters we will build Word2Vec model.
from gensim.models import word2vec
import nltkraw_text = """Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The result is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.""".split()# tokenize sentences in corpus
wpt = nltk.WordPunctTokenizer()
tokenized_corpus = [wpt.tokenize(document) for document in raw_text]# Set values for various parameters
feature_size = 100 # Word vector dimensionality
window_context = 30 # Context window size
min_word_count = 1 # Minimum word count
sample = 1e-3 # Downsample setting for frequent wordsw2v_model = word2vec.Word2Vec(tokenized_corpus, size=feature_size,
window=window_context, min_count=min_word_count,
sample=sample, iter=100)
# view similar words based on gensim's model
similar_words = {search_term: [item[0] for item in w2v_model.wv.most_similar([search_term], topn=5)]
for search_term in ['extract','information','insights']}
similar_words
##Output {'extract': ['linguistics', '"', 'of', 'including', 'a'],
'information': ['program', 'within', 'subfield', 'result', 'contained'],
'insights': ['well', ')', 'how', 'categorize', 'particular']}
Here, we can see we have used parameters like
feature_size : Dimensionality of embedding vector
window : Context window size
min_count : Minimum word count
sample : down sample setting
After training Word2Vec model we can see the similar words. Here, we have taken a small corpus so that is not giving perfect result. Let’s plot this similar words so that we can get better idea.
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as pltwords = sum([[k] + v for k, v in similar_words.items()], [])
wvs = w2v_model.wv[words]tsne = TSNE(n_components=2, random_state=0, n_iter=10000, perplexity=2)
np.set_printoptions(suppress=True)
T = tsne.fit_transform(wvs)
labels = wordsplt.figure(figsize=(14, 8))
plt.scatter(T[:, 0], T[:, 1], c='orange', edgecolors='r')
for label, x, y in zip(labels, T[:, 0], T[:, 1]):
plt.annotate(label, xy=(x+1, y+1), xytext=(0, 0), textcoords='offset points')
We can also check the vector representation of any word. Here, we have take 100 as a dimension so we can see ‘Natural’ word in 100 dimensional vector.
w2v_model.wv['Natural']##Output
array([-0.00181083, 0.00448997, 0.00354504, -0.00115317, -0.0043177 ,
0.00021277, 0.00230537, -0.00092127, 0.00368612, 0.00396339,
-0.00432571, 0.00493624, 0.00061124, -0.0022428 , -0.0034374 ,
0.00289598, -0.00166694, 0.00477126, -0.00183866, 0.00382926,
-0.00392748, -0.00271961, -0.0046588 , -0.00226145, 0.00373776,
-0.00416674, -0.00012754, -0.00381866, 0.00321343, -0.00240004,
0.00363028, -0.00328013, -0.00178409, -0.00235526, -0.0009552 ,
0.0038814 , 0.00266416, -0.00086921, 0.00155356, 0.00076463,
0.00114544, 0.00430724, -0.00231419, -0.00074246, 0.00282576,
0.00001543, 0.00400576, 0.00251022, 0.00396486, -0.00001084,
-0.00405879, 0.00177121, 0.00315285, -0.00270734, -0.00044774,
-0.00409922, 0.00101802, -0.00142405, 0.00274875, 0.00204634,
0.00324002, -0.00225011, -0.00113855, -0.00219695, 0.00319166,
0.00071592, -0.00315284, 0.00185551, 0.00097711, -0.00070271,
-0.00369302, -0.00356723, -0.00135671, 0.00045015, -0.00433108,
0.00392874, 0.00140838, -0.00117412, 0.00413575, 0.00170295,
-0.00394475, 0.00001962, -0.00357439, -0.00057754, -0.00097144,
0.0035095 , -0.00366997, -0.00320746, 0.00418892, -0.0023406 ,
0.00005771, -0.00132968, 0.00197431, -0.00197416, 0.00487064,
-0.00363371, -0.00282901, -0.00457178, 0.00202645, -0.00152074],
dtype=float32)
Conclusion:
In this blog we have seen Word2Vec in depth next we will learn about other advanced methods like GloV2 and FastText.
Suggestions are highly recommended.