Learning The Relationship Between Words in NLP: Power Of Word Embeddings

Mohaddeseh Tabrizian
4 min readJul 10, 2023

--

  1. Introduction

If you are interested in NLP tasks such as machine translation, sentiment analysis, and text classification, reading about word embeddings is a great way to start. The development of word embeddings has been a breakthrough in NLP, allowing models to better understand the meaning of words. Word embeddings offer a valuable and widely adopted approach for representing words in various NLP tasks due to their simplicity, efficiency, and compatibility.

2. What are word embeddings?

The table represents word embeddings, which are essentially representations of words that we input to a model to be trained on for the purpose of understanding these words. These embeddings capture the relationships between words. For instance, words like “Man” and “Woman” are associated with the concept of gender, so they possess high values indicating this relationship. Additionally, they have opposite values to demonstrate their contrasting genders. It is important to note that while gender and age are mentioned for clarity, our model learns more nuanced and abstract concepts associated with words, rather than these specific examples.

3.What problem have word embeddings solved?

Before word embeddings, approaches like one-hot encoding and bag-of-words were commonly used to represent words in NLP tasks. However, these approaches suffered from the problem of high dimensionality. Word embeddings addressed these problems by representing words as vectors in a lower-dimensional space. These embeddings capture semantic information, leading to more efficient representations for NLP tasks.

In one-hot encoding, each word in a vocabulary is represented as a binary vector where only one element is 1 and the rest are 0 like the table shown above. The length of this vector is equal to the size of the vocabulary, resulting in a high-dimensional space. For example, if the vocabulary has 100,000 words, each word would be represented by a vector of length 100,000 (⊙ _ ⊙ ).

Similarly, the bag-of-words represents text (the text is the bag) as a sparse vector where each dimension represents a word in the vocabulary. The value of each dimension corresponds to the frequency or presence of the corresponding word in the document, resulting in high-dimensional and sparse feature vectors.

4.Two Models for Learning Word Embeddings:

4.1.Word2vec

There are two variants for word2vec (Skip-grams and CBOWs)

Skip-grams:

With context and target words, we create a supervised learning problem. For example, in the sentence above, the word “cat” is a target word, and our window size is +/- 2 words. Therefore, “this”, “is”, and “cute” can be considered as context words. After selecting these pairs, we train our model to predict target words based on the context words. We choose a central word and a window size. The window size determines how many words we can skip before considering the next word whereas in a traditional bigram model we only could consider consecutive pairs of words like “This cat”, “cat is”, and “is cute”. This feature of skipgrams is adding flexibility to our model.

CBOW:

In the Continuous Bag of Words (CBOW), the intuition behind skip-grams (context, target words) remains the same, but the difference lies in the selection of the target and context words. In CBOW, the central word of a chosen window becomes the target word, and the remaining words within the window are considered as context words. For example, with a window size of 4, we select 4 continuous words out of the text and choose the middle word as the target word, while the other 3 words become the context words.

4.2.GloVe(Global Vectors for Word Representation)

GloVe unlike word2vec uses both local context and global co-occurrence statistics, resulting in embeddings that capture semantic and syntactic relationships both locally and globally👏🏻. Finally the GloVe model produces word vectors in which each word is represented as a point in a high-dimensional space. These vectors can be used in various natural language processing tasks, such as word similarity calculations, text classification, and language generation.

  • The first step is to create a matrix that counts how often words co-occur in a text.
  • Next, a word-word association matrix is created from the co-occurrence matrix. This matrix contains information about the probability of word co-occurrences.
  • Randomly initialize word vectors for each word in the vocabulary.
  • Iterate to optimize the word vectors.

5.Conclusion

Word embedding models such as GloVe and Word2Vec are indeed powerful tools in NLP. However, once we have developed a good understanding of these models, it is worthwhile to explore more robust, efficient, and popular models like ELMo, GPT, and BERT😇. These models have significantly advanced the field of NLP and have enhanced our understanding of natural language processing.

--

--