Improving word embedding using Kernel PCA Skip gram model
The Natural Language Processing problems are not trivial, making them “AI-hard problem” in the field of computer science. In order to understand a sentence, one should not only understand the words used in it but also the context in which those words are used since a word can have very different meaning when used in different contexts. This makes language understanding ambiguous.
The solution comes by focusing on the “atomic units” of a language, “words”. To get the precise results, there is a dire need of a numeric vector representation for the words. These word representations help in extracting relations between various human languages, text or speech data.
There are many representations proposed over time:
- One hot vectors
- Skip gram model and CBOW model by Tomas Mikolov
- fastText by Facebook
These approaches does not produce good word vectors when we have a relatively smaller data set since they depend on learning from the context in which the words are used and do not consider the morphological information embedded in the words. FastText has tried to use sub word information but don’t produce high quality word vectors with smaller datasets.
We propose an alternative approach which not only make use of the sub word information but also performs well on smaller datasets. This approach can be applied on both morphological and non-morphological languages. The main challenges we are trying to solve for generating word embeddings are:
- The need to train on a huge text corpus to extract relationship between words.
- Word embeddings for out-of-vocabulary words using n-grams with significant increase in vocabulary size.
- Slow rate of training even after application of optimization techniques such as softmax and negative sampling.
Below, we first discuss nuts and bolts of the algorithm and then compare our model, Kernel PCA (KPCA) skip gram model with word2vec skip gram model trained using same parameters.
Nuts and bolts of the approach:
Kernel Function:
The kernel trick is a mathematical tool which can be applied to any algorithm which consists of a dot product between two vectors. Kernel trick states that to compute the dot products of vectors in the higher dimensional space, a kernel function can be used directly using the lower dimensional vectors. Therefore every dot product can be replaced by a kernel function. A Kernel function, can thus be defined as a function that takes as inputs vectors in the original space and returns the dot product of the vectors in the feature space. The feature space is usually a higher dimensional space where data can be linearly separable.
[2007])
In order for a function to be a valid Kernel function, it should fulfill the “Mercer’s theorem”. According to the Mercers’ theorem, every positive definite symmetric function can be seen as a kernel function. Here a positive definite symmetric functions correspond to a “positive definite symmetric Gram matrix” which can be understood as a matrix which has all positive eigenvalues. Therefore a function which fulfill these conditions can be refereed as kernel function.
We usually, prefer to work with Gaussian Kernel:
The Gaussian RBF kernel is very popular and makes a good default kernel as the feature space of this kernel has an infinite number of dimensions.
Principle Component Analysis:
Principal component analysis(PCA) is a very important tool in data analysis. It helps in reducing a complex data set to a lower dimensional space thus revealing the hidden, dynamics. It helps in getting the most meaningful basis to re-express any noisy data set.
Kernel PCA:
Kernel Principle Component Analysis is the nonlinear form of PCA, which allows to generalize standard PCA to nonlinear dimensionality reduction. In other words, it can be referred as a non-linear dimensionality reduction technique through the use of kernel function. Schölkopf et al. [1997] have shown that the generalization of PCA for non-linear data can be done via a kernel function.
dot product. It implicitly perform PCA in a possibly high-dimensional space F which is
nonlinearly related to input space. (Schölkopf et al. [1997])
Skip gram model:
Skip gram is a model proposed by Mikolov et al., which tries to maximize classification of a word based on the words in the same sentence. In other words, the model predicts the context words given their center word.
Here distance of words to the input word, represented by C, also comes in play. This distance is important since the more distant words are less related to the center word. So one can assign less weight to distant context words thus helps to attain better context words.
Approach:
- Pre-processing the dataset like Text8, 20News set, wiki dump.
- Implementing Kernel PCA on words in a vocabulary.
- Computing Kernel PCA of out-of-vocabulary words.
- Training a word2vec model(Skip gram) with Kernel PCA Em-
beddings in Tensor flow.
Comparison:
We have evaluated the performance of our Kernel PCA skip gram model by comparing it against the performance of the basic skip gram model trained using the same parameters. Since the model follows an unsupervised approach, evaluation of the model can only be done using “intrinsic evaluation tasks”, which assess how well the vectors capture the word meaning and the relationships between those words. This is achieved using evaluating the models on the following tasks:
- word similarity tasks, that include finding a word’s nearest neighbors.
- word analogy tasks, that include calculating the semantic and the syntactic similarity between the words.
Word similarity tasks:
dataset 20NewsGroup of size 17 Million words (Non morphological Language)
tsne plots:
From the above results, we can summarize that KPCA Skip gram Model performs far better than the basic skip gram model. This can be attributed to the fact that we already have trained the vectors for morphological similarities. The model just needs to learn the semantic and syntactic similarity. Since basic skip gram model learns morphological similarity implicitly so initializing the model with those helps in speed up the process of convergence. The performance is even better when the language is rich in morphemes such as German.
Conclusion:
This morphological information is encoded in the Kernel PCA embeddings which are obtained by applying Kernel PCA on some defined similarity function of the words. Therefore when we initialized the skip gram model by these morphological similarity words, we found that our model achieve fairly good accuracies in fewer iterations. We have also evaluated the embeddings by intrinsic methods of “word similarity” and “word analogy”. In every task, our model has performed better than the basic skip gram model. From the evaluation results, we understood that, a skip gram model can learn similarity between words based on context only when it has seen sufficient number of examples. This can be achieved by either training the skip gram model with a huge dataset or training it with a relatively smaller dataset but with numerous epochs. But when we fed the skip gram model with morphological embeddings, it already has a good initial point from where it can improve these embeddings. This works, especially well, when we are training on a morphologically rich language since semantic and morphological similarities converges in these languages therefore feeding the network with Kernel PCA embeddings, gives a boost to the network enabling it to learn meaningful word vectors even faster than the non morphological languages.
We also observe that it is possible to obtain fair quality word vectors using Kernel PCA skip gram model in just 1 epoch of training even on a smaller data set. Since skip gram model does not perform well when trained on a relatively smaller data set, our approach can be used in situations where we don’t have a huge data set. One such case arises when we have a very new data set such as NEWS articles. Because the current news events changes very often, it is difficult to get good word embeddings for new words which get associated with some context. For example: When Donald Trump won the presidential elections, the newly trained word embeddings for the word ‘president’ were still lie closer to ‘Obama’ and not close to ‘Trump’. This is because the skip gram model has not seen enough examples where word ‘president’ is used in context of ‘trump’. In such situations, our approach can do better as we can train on smaller datasets.
For more details, please visit: