Word Embedding in NLP

Farheenshaukat
5 min readMay 10, 2023

--

Word embedding is a technique used in natural language processing (NLP) to represent words as dense vectors in a high-dimensional space. These word vectors capture semantic and syntactic relationships between words, and are used as inputs to many NLP tasks such as text classification, sentiment analysis, and machine translation.

The traditional approach to representing words in NLP is one-hot encoding, where each word is represented as a sparse vector with a single non-zero element. However, this approach has several limitations, such as the inability to capture semantic similarity between words and the high dimensionality of the resulting feature space.

Word embedding techniques, such as Word2Vec, GloVe, and fastText, overcome these limitations by representing each word as a dense vector with a fixed length. These vectors are learned from large corpora of text using neural networks, and are designed to capture the context in which words appear in text.

The resulting word embeddings can be used to perform a variety of NLP tasks, such as finding words with similar meanings, completing analogies, and even generating new text. Word embeddings have become a popular and effective tool in NLP, and have been shown to improve the performance of many NLP models.

Word2Vec:

Word2Vec is a family of models that learns distributed representations of words from large corpora of text using neural networks. The basic idea behind Word2Vec is to predict the probability of a word given its context (i.e., the surrounding words) or the probability of a context given a word. The neural network learns to represent words as dense vectors in a high-dimensional space such that words that appear in similar contexts have similar vector representations. There are two main architectures in Word2Vec: the Continuous Bag-of-Words (CBOW) and the Skip-Gram. CBOW learns to predict a word given its context, while Skip-Gram learns to predict the context given a word. Word2Vec has been shown to be effective at capturing syntactic and semantic relationships between words.

CBOW And Skip Gram:

CBOW (Continuous Bag-of-Words) and Skip-gram are two architectures for training Word2Vec models, which are neural network models that learn distributed representations of words.

CBOW is a model that predicts the target word based on the surrounding context words. The input layer of the CBOW model consists of the one-hot encoded vectors of the context words, and the output layer consists of a softmax layer that predicts the probability distribution over the target vocabulary. The goal of the CBOW model is to learn the parameters of the hidden layer, which represents the dense vector representation of the target word that is being predicted. The CBOW model is faster to train than the Skip-gram model, and tends to perform better when the training data is small.

On the other hand, Skip-gram is a model that predicts the surrounding context words given the target word. The input layer of the Skip-gram model consists of the one-hot encoded vector of the target word, and the output layer consists of softmax layers that predict the probability distribution over the context vocabulary. The goal of the Skip-gram model is to learn the parameters of the hidden layer, which represents the dense vector representation of the target word. The Skip-gram model is slower to train than the CBOW model, but tends to perform better when the training data is large.

In practice, the choice of architecture depends on the specific task and the size of the training data. CBOW is more efficient and tends to work better for smaller datasets or for tasks where context is less important (e.g., sentiment analysis), while Skip-gram is more powerful and tends to work better for larger datasets or for tasks where context is more important (e.g., language modeling).

Glove:

GloVe (Global Vectors for Word Representation) is another word embedding technique that learns vector representations of words from co-occurrence statistics in a corpus of text. The basic idea behind GloVe is to construct a matrix of word co-occurrence counts and then factorize this matrix to obtain low-dimensional vector representations of words. GloVe is based on the observation that ratios of word co-occurrence probabilities contain useful information about the relationships between words. GloVe has been shown to be effective at capturing both semantic and syntactic relationships between words, and it has been used in a variety of NLP tasks, such as sentiment analysis, machine translation, and question answering.

Conclusion:

Both Word2Vec and GloVe have their own strengths and weaknesses, and which one to use depends on the specific task and data at hand. Overall, both techniques have revolutionized the field of NLP by enabling us to represent words as dense vectors in a high-dimensional space, and use these vectors as input to various NLP models.

Thanks

--

--