Improving word embedding using Kernel PCA Skip gram model

Vishwani Gupta
7 min readJan 24, 2018

--

The Natural Language Processing problems are not trivial, making them “AI-hard problem” in the field of computer science. In order to understand a sentence, one should not only understand the words used in it but also the context in which those words are used since a word can have very different meaning when used in different contexts. This makes language understanding ambiguous.
The solution comes by focusing on the “atomic units” of a language, “words”. To get the precise results, there is a dire need of a numeric vector representation for the words. These word representations help in extracting relations between various human languages, text or speech data.

There are many representations proposed over time:

  1. One hot vectors
  2. Skip gram model and CBOW model by Tomas Mikolov
  3. fastText by Facebook

These approaches does not produce good word vectors when we have a relatively smaller data set since they depend on learning from the context in which the words are used and do not consider the morphological information embedded in the words. FastText has tried to use sub word information but don’t produce high quality word vectors with smaller datasets.

We propose an alternative approach which not only make use of the sub word information but also performs well on smaller datasets. This approach can be applied on both morphological and non-morphological languages. The main challenges we are trying to solve for generating word embeddings are:

  1. The need to train on a huge text corpus to extract relationship between words.
  2. Word embeddings for out-of-vocabulary words using n-grams with significant increase in vocabulary size.
  3. Slow rate of training even after application of optimization techniques such as softmax and negative sampling.

Below, we first discuss nuts and bolts of the algorithm and then compare our model, Kernel PCA (KPCA) skip gram model with word2vec skip gram model trained using same parameters.

Nuts and bolts of the approach:

Kernel Function:

The kernel trick is a mathematical tool which can be applied to any algorithm which consists of a dot product between two vectors. Kernel trick states that to compute the dot products of vectors in the higher dimensional space, a kernel function can be used directly using the lower dimensional vectors. Therefore every dot product can be replaced by a kernel function. A Kernel function, can thus be defined as a function that takes as inputs vectors in the original space and returns the dot product of the vectors in the feature space. The feature space is usually a higher dimensional space where data can be linearly separable.

Transformation from one space to another using a Kernel function φ (Mooney
[2007])

In order for a function to be a valid Kernel function, it should fulfill the “Mercer’s theorem”. According to the Mercers’ theorem, every positive definite symmetric function can be seen as a kernel function. Here a positive definite symmetric functions correspond to a “positive definite symmetric Gram matrix” which can be understood as a matrix which has all positive eigenvalues. Therefore a function which fulfill these conditions can be refereed as kernel function.

We usually, prefer to work with Gaussian Kernel:

The Gaussian RBF kernel is very popular and makes a good default kernel as the feature space of this kernel has an infinite number of dimensions.

Principle Component Analysis:

Principal component analysis(PCA) is a very important tool in data analysis. It helps in reducing a complex data set to a lower dimensional space thus revealing the hidden, dynamics. It helps in getting the most meaningful basis to re-express any noisy data set.

Kernel PCA:

Kernel Principle Component Analysis is the nonlinear form of PCA, which allows to generalize standard PCA to nonlinear dimensionality reduction. In other words, it can be referred as a non-linear dimensionality reduction technique through the use of kernel function. Schölkopf et al. [1997] have shown that the generalization of PCA for non-linear data can be done via a kernel function.

Kernel PCA: A nonlinear kernel function κis used instead of the standard
dot product. It implicitly perform PCA in a possibly high-dimensional space F which is
nonlinearly related to input space. (Schölkopf et al. [1997])

Skip gram model:

Skip gram is a model proposed by Mikolov et al., which tries to maximize classification of a word based on the words in the same sentence. In other words, the model predicts the context words given their center word.

Skip-Gram Model (Rong [2014])

Here distance of words to the input word, represented by C, also comes in play. This distance is important since the more distant words are less related to the center word. So one can assign less weight to distant context words thus helps to attain better context words.

Approach:

  1. Pre-processing the dataset like Text8, 20News set, wiki dump.
  2. Implementing Kernel PCA on words in a vocabulary.
  3. Computing Kernel PCA of out-of-vocabulary words.
  4. Training a word2vec model(Skip gram) with Kernel PCA Em-
    beddings in Tensor flow.

Comparison:

We have evaluated the performance of our Kernel PCA skip gram model by comparing it against the performance of the basic skip gram model trained using the same parameters. Since the model follows an unsupervised approach, evaluation of the model can only be done using “intrinsic evaluation tasks”, which assess how well the vectors capture the word meaning and the relationships between those words. This is achieved using evaluating the models on the following tasks:

  1. word similarity tasks, that include finding a word’s nearest neighbors.
  2. word analogy tasks, that include calculating the semantic and the syntactic similarity between the words.

Word similarity tasks:

The k =4-nearest neighbors are generated for vectors of 300 dimension on a
dataset 20NewsGroup of size 17 Million words (Non morphological Language)
The k =4-nearest neighbors for vectors of 300 dimension generated from training the models on the subset of the dataset, news.2013.de of size 52 Million words for the morphological rich language: German.
Two-dimensional PCA projection of the 300-dimensional skip gram vectors of countries and their capital cities trained on English language in 100,000 iterations on subset of dataset Text8. Countries(Red labels)and Captials(green labels)
Two-dimensional PCA projection of the 300-dimensional Skip-gram vectors of countries and their capital cities trained on German language in 100,000 iterations on subset of dataset news.2013.de.shuffled.corpus. Countries(Red labels)and Captials(green labels)
Comparison of Word Analogy accuracy of the two models trained for 1 Epoch using same parameters with 300-dimensional word vectors. NOTE: For vocabulary size of 118,000 for KPCA embeddings, we compute the KPCA for 20,000 words and the additional words(OOV) vectors
Word Analogy accuracy of both models through different epochs on Semantic-Syntactic data set
Comparison of architectures of the two models trained for 1 Epoch using same parameters with 300-dimensional word vectors on enWikidump of size 1 Billion words with Vocabulary size 30,000

tsne plots:

t-SNE visualization of KPCA Skip gram embeddings with Vocabulary size 118,000 words, trained in 100,000 iteration on Text8 dataset of size 17 Million words. If analyzed closely, we find clusters of nationality, nations, first names etc. Another important thing to note is that clusters of the nationality and the nations also lie close to each other.
t-SNE visualization of Skip gram model with vocabulary size 118,000 trained in 100,000 iteration on Text8 dataset. On analyzing this plot, we cannot find any prominent cluster. The bad performance of the model can be attributed to the fact that we are training the model on a relatively smaller database for 1 Epoch. The performance ought to improve when we train the model for more epoch.

From the above results, we can summarize that KPCA Skip gram Model performs far better than the basic skip gram model. This can be attributed to the fact that we already have trained the vectors for morphological similarities. The model just needs to learn the semantic and syntactic similarity. Since basic skip gram model learns morphological similarity implicitly so initializing the model with those helps in speed up the process of convergence. The performance is even better when the language is rich in morphemes such as German.

Conclusion:

This morphological information is encoded in the Kernel PCA embeddings which are obtained by applying Kernel PCA on some defined similarity function of the words. Therefore when we initialized the skip gram model by these morphological similarity words, we found that our model achieve fairly good accuracies in fewer iterations. We have also evaluated the embeddings by intrinsic methods of “word similarity” and “word analogy”. In every task, our model has performed better than the basic skip gram model. From the evaluation results, we understood that, a skip gram model can learn similarity between words based on context only when it has seen sufficient number of examples. This can be achieved by either training the skip gram model with a huge dataset or training it with a relatively smaller dataset but with numerous epochs. But when we fed the skip gram model with morphological embeddings, it already has a good initial point from where it can improve these embeddings. This works, especially well, when we are training on a morphologically rich language since semantic and morphological similarities converges in these languages therefore feeding the network with Kernel PCA embeddings, gives a boost to the network enabling it to learn meaningful word vectors even faster than the non morphological languages.

We also observe that it is possible to obtain fair quality word vectors using Kernel PCA skip gram model in just 1 epoch of training even on a smaller data set. Since skip gram model does not perform well when trained on a relatively smaller data set, our approach can be used in situations where we don’t have a huge data set. One such case arises when we have a very new data set such as NEWS articles. Because the current news events changes very often, it is difficult to get good word embeddings for new words which get associated with some context. For example: When Donald Trump won the presidential elections, the newly trained word embeddings for the word ‘president’ were still lie closer to ‘Obama’ and not close to ‘Trump’. This is because the skip gram model has not seen enough examples where word ‘president’ is used in context of ‘trump’. In such situations, our approach can do better as we can train on smaller datasets.

For more details, please visit:

--

--

Vishwani Gupta

Applied machine learning enthusiast. Trying to make a difference in this society.