Techniques to represent text as numbers for NLP

Michael (Misha) Smirnov
4 min readJan 12, 2022

--

Word Embeddings Visualized

We can’t just take a paragraph and feed it directly into a mathematical algorithm; the characters, words, or phrases need to be converted into numbers. There are a number of approaches to do this, and they all have their own upsides and downsides based on your need.

One-Hot encoding
This one’s simple — each word or character gets a number, and numbers are represented as a vector of ones and zeros. For example, the phrase below has three words:

Inga goes hiking

Each word can be converted to a number (1, 2, 3), and each number to a one-hot vector [1 0 0], [0 1 0], [0 0 1]. We have a vocabulary of 3, so no other words can be encoded with these vectors. Now, we can make a new sentence:

Hiking inga - [0 0 1], [1 0 0]

It’s simple, but it’s bad for large vocabularies since the vectors get huge.

Bag of Words
Common for text classification problems, bag of words represents the text under consideration as a collection of words while ignoring order and context. Each word gets a unique integer, and each document is represented by a vector of numbers representing word counts. Thus, the phrase “Inga goes hiking” could be represented as [1 1 1], and the phrase “Inga hiking goes hiking hiking” could be [1 1 3].

Bag of words is simple to understand and implement, works well for comparing text documents for common features (are these articles both about sports?), but the vector sizes get large with a big vocabulary, the numbers don’t capture similarity between word meanings, and the relationships between words are lost (“man eats dog” is treated the same as “dog eats man”).

Bag of N-grams
Instead of breaking text up into words, bag of n-grams breaks up text into phrases. Each phrase, or chunk of n contiguous words, is referred to as an n-gram. The vocabulary becomes a collection of n-grams, and like bag of words, the vector for each document holds a collection of the amount of times each n-gram in the vocabulary pops up.

N-grams capture some context where bag of words does not. However, the dimensionality and sparsity of the dataset increases rapidly as the n of the n-gram increases, and there is no good way of addressing new data outside of the vocabulary.

TF-IDF
Term frequency-inverse document frequency. TD-IDF treats some words as more important than others. It quantifies the importance of a given word relative to other words in the document and in the corpus. Words that appear more often in one document and less often in others are considered important to the meaning of a document, and thus get assigned a higher score. Words that have similar frequencies across documents are given lower scores.

Term frequency, TF, is how often a word appears in a document.
Inverse document frequency, IDF, is the importance of a word across a corpus. The product of both TF and IDF gives us a TF-IDF score for a word, and each document is encoded as a vector of the TF-IDF scores for all the words in the document.

TF-IDF scores are great for calculating similarity between documents, and is often used for information retrieval and text classification. Like the previous techniques, it contains oversized vectors and cannot handle words outside of the given vocabulary.

Word Embeddings
The embedding of a word is meant to best capture its meaning in a low dimensional space. Word2vec, introduces in 2013, essentially introduced word embeddings into NLP, and allowed for capturing word analogy relationships such as

Word2vec can capture relational information between words

The algorithm creates vectors for each word in a corpus to quantify its relationship to other words around it, as well as its general meaning. If two words appear in a similar context, such as boy and girl, they likely have similar meanings.

Word embeddings tend to be pretrained, and can be downloaded without wasting resources training on things like Wikipedia. Embeddings are robust and practical. However, they still run into the out of vocabulary problem — newly introduces words don’t mesh with the model. This can be remedied by using character and word n-grams together, so new words are built from existing parts. This works well for portmanteaus (words made from other words), one example being Facebook's fastText algorithm (both a clumsy portmanteau and a character n-gram algorithm!).

Still yet, word embeddings do not take word order into account when summarizing a document.

Document Embeddings
Tools like Doc2vec builds on word embeddings to create embeddings of entire paragraphs, thus retaining word context. This method has been applied in text classification, document tagging, text recommendation systems, and simple chatbots (the crap ones that do little more than read back an FAQ).

Embeddings are a powerful tool, but still come with the same problem as many other machine learning techniques. They rely heavily on the data they’re pretrained on — for example, depending on what you trained with, apple could be related more to microsoft, or to an orange. Still, embeddings are the tool that many of the latest and greatest powerful models are build on. Now go and code some up!

  • and don’t forget to follow me on twitter @SaladZombie

--

--

Michael (Misha) Smirnov

Data Scientist at Amazon. PhD in Neuroscience. Coder, creator, woodworker, all around cool guy who likes high fives.