Word Embedding: Understanding the Fundamentals
In the journey of learning about AI, we have previously discussed the fundamental component used in machine learning techniques, vectors, in a previous article in the series "Zero to AI." Today, we will dive into another critical component in Natural Language Processing (NLP), word embeddings.
Reasons why we need word embeddings.
- High-dimensional representation: Words can have complex relationships, and it is challenging to represent these relationships using traditional one-hot encoding. Word embeddings provide a lower-dimensional representation that can capture the relationships between words.
- Comprehensible representation: One-hot encoded vectors are sparse and do not provide insight into the relationships between words. Word embeddings, on the other hand, provide a continuous and dense representation that can be used to measure similarity and distance between words.
- Improved model performance: Most NLP models require numerical inputs, and one-hot encoded vectors are unsuitable for these models. Word embeddings, on the other hand, provide a better representation for NLP models, which can lead to improved performance in tasks such as text classification and sentiment analysis.
- Transfer learning: Word embeddings trained on a large corpus can be used as pre-trained models for other NLP tasks, reducing the training data required for a new job.
Word Embedding
Word embedding represents words as vectors in a high-dimensional space, making it easier to perform mathematical operations and perform similarity analysis. The goal of word embeddings is to provide a way of representing words in a way that can be used in models and algorithms. There are different approaches to generating word embeddings, each with pros and cons.
The first approach is one-hot encoding, where each word is represented as a vector with a 1 in the position corresponding to the word and 0s everywhere else. This approach is simple but has several drawbacks. For example, anagrams would be represented the same way, and it does not give any information about what the words mean.
Another approach is hand-engineered rules, where each word is assigned a number based on predefined categories and weights. This approach works for a limited number of words but is not scalable since it requires some knowledge about each word you want to encode.
The following approach uses raw term frequency, where a word vector represents the raw count of how many times the word appears in a document. This approach may help distinguish between different documents, but overrepresented words like "the" and "is" and meaningful words that do not show up often will be lost.
A variation of this approach is TF-IDF, or weighted term frequency, where instead of the raw count, the word vector is normalized by the frequency of the word across all documents. This approach may help distinguish between different documents but still needs to provide information about the meaning of the words.
Another method is to use context windows, which look at the words around a specific word to understand its meaning. By analyzing co-occurrence or how often words appear in the same context, a matrix can be created that represents the relationships between words. However, this method requires a large corpus of documents and can result in a sparse matrix. To handle this, the matrix can be reduced through dimensionality reduction approaches.
Finally, we can utilize the power of neural networks to generate word embeddings. The two most common models used to generate word embeddings are CBOW (Continuous Bag of Words) and SkipGram. CBOW predicts a word given its context window, while SkipGram predicts the next word given the word. SkipGram is the most widely used model and is the basis for the popular word2vec embedding.
In SkipGram, a one-hot encoded vector is used to represent a word. This vector is then input into a hidden layer where weights are assigned to it. The model's objective is to predict the probability of every word in a corpus being the next word. The output layer of the model is not what we are interested in; instead, it is the hidden layer that contains the weight matrix, which represents the word embedding.
The hidden layer of the model is a weight matrix with one row per word and one column per neuron. This weight matrix results from forward and backward propagation and gradient descent, which learn the correct weights for each word. This is the condensed representation of the word that preserves its context.
In this series, we will discuss more details on CBOW and Skip-gram.
In conclusion, word embeddings are a powerful tool for representing words in a way that can be used for various natural language processing tasks. They have the advantage of capturing the semantic meaning of words. Still, there are also some disadvantages, such as the computational cost of training and potential bias towards the corpus they are trained on.
References
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/