Word embedding made significant improvement in many NLP tasks possible. Its understanding of word schematics and ability to represent different length texts to fixed vectors made it very popular among many complex NLP tasks. Most of the machine learning algorithms can be directly applied to word embeddings for classification and regressions tasks as the length of vector is fixed. In this blog, we will try to look at the packages which help us implement Word2Vec using 2 popular methods named CBOW and Skip-Gram. Also we will look at some properties and visualisations of embeddings.
Training CBOW and skip-gram..
We can just take the above short paragraph as text of word embedding. We will see how we can write code to represent the words of above text in the dense space.
As explained in earlier blogs, first we will need to do to tokenisation using NLTK and We then user Word2Vec in gensim library. Parameter “sg” specifies the training algorithm CBOW (0), Skip-Gram (1).
Her we can clearly see the dense vector representation of word “ the ”. Its a 50 dimension vector which is again given as parameter to gensim library Word2Vec function.
The most common visualisation method is to project the 100 dimensions of a word down into 2 dimensions.
Dimensional reduction techniques like PCA and TSNE can be applied to the dense vectors to create a 2 or 3 dimensional vectors. Let’s discuss the concept of TSNE briefly and also understand why it is popular in visualising word embeddings.
TSNE (t-distributed Stochastic Neighbor Embedding)
t-Distributed Stochastic Neighbor Embedding (t-SNE) is primarily used for data exploration and visualising high-dimensional data. It helps us decompose high-dimensional data into 2 or 3 dimensional data, which makes it easy for us to plot and get a intuition for these high-dimensional data points.
The t-SNE algorithm calculates a similarity measure between pairs of instances in the high dimensional space and in the low dimensional space. It tried to preserve the similarity from higher to lower dimensional space. But how do we quantify the similarity in higher and lower dimensional spaces since some scale invariant measure of similarity will help us preserving similarity in higher and lower dimensional spaces.
Similarity Measure in Higher Dimensions (Joint Probability): For each data point we will centre a Gaussian distribution over that point. Then we measure the density of all other points. The affinities in the original or space are represented by Gaussian joint probabilities.
Similarity Measure in Lower Dimensions(Joint Probability): Instead of using a Gaussian distribution you use a Student t-distribution with one degree of freedom. So the affinities in the embedded space are represented by Student’s t-distributions joint probabilities.
Cost Function: In-order to make the preserve the similarity measure from higher dimensions to lower dimension, we will need to find a metric/cost function that finds the distance between joint probabilities.
Kullback-Liebler divergence (KL) is our choice since it is very popular metric which calculates the distance between 2 probability distributions. We can use gradient descent to minimise our KL cost function.
TNSE is popular technique for visualising word embeddings because its ability to preserve small pairwise distances or local similarities unlike other dimensionality techniques like PCA which are concerned with preserving large pairwise distances to maximise variance.
Hierarchical Clustering Visualisation
Another popular visualisation method is to use a clustering algorithm to show a hierarchical representation of which words are similar to others in the embedding space.
This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample.
The code of generating above plot:
Allocation harm: Embedding analogies also exhibit gender stereotypes and other biases that are implicit in the text. For example: “doctor” profession is close to “man” and “nurse” profession is close to “women”. An NLP engineer must keep this inherent bias of embeddings in mind when modelling using embeddings. Debiasing of embeddings deal with removing the bias from these embeddings.
First and seconds order co-occurance: Two words are said to have first-order co-occurrence if they are typically nearby each other. Two words have second-order if they have similar neighbours.