Understanding Word Embeddings and Their role in Natural Language Processing

Poatek
Poatek
Published in
5 min readFeb 3, 2023

If you are reading this article then you are probably, just like me, astonished by the incredible results achieved by the modern NLP engines, most famously Chat GPT. If these words are new to you, allow me to explain them: NLP, or Natural Language Processing, is a field of Artificial Intelligence whose aim is to work with ordinary human language. Think about the word assistant from your smartphone that you probably do not care about, or the powerful and already referred Chat GPT, an impressive AI capable of human-level communication — you are invited to engage in a conversation with it through here. Well, with the presented concepts in mind, the goal of this article will be to underline a very clever technique that encodes the natural human language in such a way that enables many NLP projects to make sense of it.

First, a few words about the problem of working with human language: the theory behind NLP strongly relies on quantitative analysis, like statistics and machine learning methods to process and analyze large amounts of data. This means that the linguistic domain of our problem needs, somehow, to be numerically represented to enable a machine to work with it. Here enters the challenge: think about the mesmerizing complexity of human language and it shall become clear how modeling it into a numerical representation is a tough task.

Let’s try to address the natural language representation problem through a very simple approach: suppose the size of the vocabulary we will be working with is m, then we can represent each word by a sparse vector with m dimensions, containing m -1 number 0’s and one number 1. Each word is then defined by the index in which the number 1 lies. An example of this encoding technique can be seen below:

Figure 1: One Hot Encoding

While this approach successfully encodes each word of our vocabulary into a unique vector, its limitations are clear: not only the words are high-dimensional and sparse, which makes it harder for a model to learn and train from them, but the representation does not encode any semantic meaning or relation between words. As stated by the Austrian philosopher and linguist Ludwig Wittgenstein, “the meaning of a word is its use in the language”. Therefore, a decent language encoding must consider how the words relate to each other, and that’s where Word Embedding comes in. In this technique, we want to assign each word a dense and lower dimensional vector in a way that the relationship between words is preserved in the continuous vector space. Although this article won’t get into details on the process of assigning words to vectors, you can read more about how the algorithm word2vec handles that here.

Although when seen in isolation a single word embedding might look just like a vector filled with many random numbers, if we analyze a broader context of embeddings, we will notice the presence of some interesting patterns. A fascinating property a set of word embeddings has is its capacity of encoding similarity between words. Empirically, we can see that similar words have similar vectors, and their similarity is measurable by calculating the Euclidian distance — while lower distances indicate higher levels of similarity, higher distances indicate the opposite. For instance, consider we are abusing terminology and referring to the embedding of a word and the word interchangeably, we have ‘boy’ closer to ‘man’ than to ‘woman’ in the vector space. Visually, we can see that similar words are close in the following example of a 2D word embedding space.

Figure 2: Vector Space

Amazing, right? But we can go further: linear operations between vectors can represent word relationships that transcend semantic similarity. One classic example is the following: the subtraction between ‘man’ and ‘woman’ ends up being close to the subtraction between ‘king’ and ‘queen’ — in this case, since what differentiates both words in these pairs is the concept of gender our embeddings may be clever enough to figure that out if trained well. This property enables us to perform analogy reasoning: suppose you forgot the name of the capital of France and you could only count on the embeddings to find it out, one way you could do this is by subtracting a pair of vectors that resemble the capital-to-country relation, like ‘Rome’ and ‘Italy’, and adding it to ‘France’. There is a good chance that the resulting vector is close to the embedding of the word you are searching for, in this case ‘Paris’. Again, it is important to remind that these properties will only hold if the embeddings are correctly trained — which includes choosing appropriate hyperparameters and feeding the model a large text dataset.

Figure 3: Linear Relationships between embeddings

Yes, these results are mind-blowing, but keep in mind that this technique is far from perfect. One of its main flaws is that the embeddings are based on the relationships between words in a dataset, so the generated embeddings may not generalize well to other datasets. Not only that but we are unable to encode the full meaning of a word — our Embeddings lack the capacity of capturing the full context and connotations a word has, we are only able to encode certain word relationships. Nevertheless, there are more modern encoding techniques that try to overcome these flaws — a really cool one that I encourage you to read more about is BERT (Bidirectional Encoder Representations From Transformers).

Well, I will cut short from here. I hope you are as fascinated as I was when I read for the first time about Word Embeddings. Feel encouraged to engage in your own NLP research and experimentation, see you in the next blog post!

João Pedro Ferreira Pereira

References:

[1] Introduction To Word Embeddings available at <https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec- 652d0c2060fa>

[2] Efficient Estimation of Word Representations in Vector Space < https://arxiv.org/pdf/1301.3781.pdf>

[3] How Does BERT NLP Optimization Work? < https://www.turing.com/kb/how-bert- nlp-optimization-model-works >

[4] An Introduction to Word Embeddings For Text Analysis <https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/>
[5] Word2Vec Explained < https://towardsdatascience.com/word2vec-explained- 49c52b4ccb71>

--

--

Poatek
Poatek
Editor for

We’re a software engineering company filled with the best tech talent!📍Porto Alegre, São Paulo, Miami and Lisbon linktr.ee/poatek.official