‘words are numbers’ — A quick introduction about Word Embedding

Saksham
5 min readJul 19, 2023

--

Hey ChatGPT, how do I [secrets of the universe]?”- sounds familiar right, but have you ever wondered how the hell does a chat-bot understand what I am saying- well I don’t think I can do justice to that topic in this one article but I can still at least leave you with some techniques used in Natural Language Processing that might make it easier for you to at least get a better idea about this wild and amazing world of language processing, starting which today we will be talking about word embedding- what they are and how are they computed.

I’ll let you in on a lil secret- computers don’t actually understand a word you say :( they only understand numbers and to be honest- I don’t understand numbers, and neither does any one else [for the most part] that kind of leaves a big divide between us and computers, luckily people are smart- and smart people have come up with ingenious ways to solve this problem and to close the gap- one of which techniques is word embedding.

Before word embedding, language analysis used techniques like One-Hot Representation.

It will be easier to explain this with an example- suppose you have a sentence “It rains here, yet again” — we can represent “rains” as [ 0, 1, 0 , 0 ,0 ] placing a 1 at the position of the word, now this representation is something, and something is always better that nothing but this and other representations that followed including TF and TF-IDF(more on both of them some other day, I promise), got us moving forward but only so far, as all of these numeric representation of words suffer from one very critical problem, context. Context drives language- it tells you the bark was from a tree or a dog, words are numbers sure but they are not just digits, numbers carry meaning and so our quest to convert words into numbers demand us to find a way to carry context of the word into a numeric representation, and this problem was eventually solved(almost completely) by researchers at google in their paper titled Efficient Estimation of Word Representations in Vector Space [2013] the paper introduces us a new model for vector representation of words called Word2Vec.

(The image above gives us an idea about context- see how king and man are close to each other, as well as woman to queen- this is no coincidence, that is the power of embedding.) [By Singerep — Own work, CC BY-SA 4.0]

Word2vec can utilize either of two model architectures to produce these distributed representations of words: continuous bag-of-words (CBOW) or continuous skip-gram. In both architectures, word2vec considers both individual words and a sliding window of context words surrounding individual words as it iterates over the entire corpus, wait that’s a lot to take in so let us take a step back.

Word2vec creates a context window around every word in the corpus, one at a time and uses those words to form a vector representation for each word by calculating the conditional probabilities of the word wrt. to context words to help ‘predict’ the occurrence of the word, it is complicated maths that we won’t get into in this blog however I leave you with some images for a basic idea. (Note: each word in training process is treated as context as well as center word both)

Word2vec as mentioned above uses two either of two models for this vector creation, we will first talk about CBOW or continuous bag of words.

CBOW —

The continuous bag-of-words (CBOW) model represents a neural network architecture utilized in natural language processing applications like language translation and text classification. Its primary objective involves predicting a central word by considering the contextual information provided by the surrounding words. By inputting a window of neighboring words, the CBOW model endeavors to anticipate the target word positioned at the center of the window. Through training on extensive textual datasets, the model acquires the ability to make accurate predictions by discerning underlying patterns within the input data.

The cat chased the [?]— for most of us, we would anticipate the word mouse to fill the [?], this is pretty much what CBOW also aims to do, although there is much more going under the hood.

Next, we will be talking about continuous skip-grams.

Skip-gram

Skip gram models use the exact opposite approach that CBOW uses- unlike other similar models, the skip-gram model focuses on figuring out the words around a particular word. It does this by looking at a bunch of sentences and trying to guess the nearby words for each target word. By doing this over and over again with a lot of sentences, the skip-gram model learns to find connections between words and creates special codes that capture the meaning and relationships of words in a language.

Let’s see- if I ask you to think of a color, and then I ask your friend to predict what you will say, they MIGHT get it,(there is a high chance they won’t) but we can be pretty sure they will come up with some color and not a shape maybe- you see there is a higher probability of some words being the correct answer than others- this is in essence what a skip-gram utilizes.

Although there exist many more techniques, methods and ideas that are used in NLP — the idea of representing words as vectors carrying semantic information has revolutionized the field of language understanding and processing. They have enabled significant advancements in various language-related tasks and offer insights into word relationships and biases. As research continues, word embeddings hold great promise for enhancing human-computer interaction, information retrieval, and language understanding in diverse domains. Their potential to unlock the depths of linguistic understanding makes them a key component in the future of natural language processing.

--

--

Saksham

pretty interested in ML, natural sciences and the world around me