Text vectorization algorithms in NLP

Algorithms for converting text sequences to numerical vectors

Mehul Gupta
Data Science in your pocket

--

Photo by Mika Baumeister on Unsplash

As we all know, we really can’t feed text directly into ML models for any NLP problem. So, as a usual practice, we convert these text sequences to numerical arrays in some way or the other during the pre-processing time. The vectorization method being used can have a great impact on your final model performance so it's an important step that requires some attention. In this post, we will be discussing a few methods which can be used to boost up the final results.

My debut book “LangChain in your Pocket” is out now

What to expect from this post

One Hot Encoding

BoW & BoN

TF-IDF

Word2Vec (CBoW & Skip Gram)

fastText

GloVe

Doc2Vec (Distributed Memory & Distributed BoW)

Some jargons

Distributional similarity: It relates to the concept where the meaning of a word is interpreted given the context in which it is used. For ex: ‘He went to a fair’ & ‘This wasn’t fair to him. Here, both the ‘fair’ have different meanings given in different sentences.

Distributional hypothesis: It hypothesizes that words used in similar contexts are similar. For example, Both ‘pineapple’ & ‘banana’ when used in a similar context (maybe around the activity of eating), should be considered similar objects (as at the root level, they belong to the same family i.e. something that can be eaten). Hence, their vector representations should also be similar.

Distributional representation: Word representations (vectors) we get based on the context (by context, we mean vicinity/neighboring words) are called Distributional representation. A few common features of such representations

*They are usually based on some sort of statistic calculated over the context

*Are easily interpretable & no black box.

*Usually of very high dimension

*Sparse in nature

Distributed representations: Word representations derived from Distributional representation that usually have the following features

*Dense in nature.

*Hard to interpret i.e. you wont be able to get the logic why a certain word is given such numerical representation & are black box

*Low dimensional in nature

*No single value in the array carries any specific meaning. The meaning is distributed throughout the vector

There is a long debate going on the internet around Distributional & Distributed representations correct explanation . You might find some definitions pretty contradictory to the above explanations. Couldn’t find a single source hence tried summarizing everything !!

Distributional Representations

One Hot Encoding (OHE)

In this method, each word in the vocabulary V is assigned an integer index, i(from 0 to V-1) & the vector representation for each word is of the length V with all 0s except 1 at the ith index for the corresponding word. It is a word-level representation

How to get an OHE? though sklearn has an inbuilt function sklearn. preprocessing.OneHotEncoder, we shall code this small function

def OHE(text):
tokens = set(text.lower().split())
length = len(tokens)
index_map = {x:index for x,index in zip(tokens,range(length))}
ohe_matrix = []

for token in tokens:
ohe = np.zeros(length)
ohe[index_map[token]] = 1
print(token,ohe)
ohe_matrix.append(ohe)

Let’s run this code for a sample text

As evident from this, we have a few concerns

  • The vector representation doesn’t hold any semantic meaning. For example: If we get boy, boys & jail as 3 tokens, we know vector representation for boy & boys should be closer (small Euclidean distance between vector representations or cosine score around 1) but in the case of OHE, every token is at the same distance from any other token. So the boy is as similar to boys as to jail according to OHE. Hence, unable to capture the meaning of tokens
  • Sparse representation (most values 0 in presentation) can lead to computational inefficiencies. As most corpora (some text resources) have huge vocabularies (think of the English dictionary), this can be a big issue.
  • Assume we train our OHE for some text T1 & wished to get representations for a Text T2. This can lead to problems as it won’t have a solution for tokens it hasn’t seen in T1 but is present in T2. This is called the OOV (out of vocabulary) problem.
  • We don’t get a fixed-length vector representation per sentence. If a sentence has 10 tokens & another has 9, their vector representation doesn’t have the same size as each token is of the length V. This will be a problem while feeding training samples to a model
  • Very high dimension as each word has a vector length of total vocab. Remember the dimensionality curse !!

Bag of Words (BoW)

As the name suggests, Bag Of Word is indeed just a bag of words ignoring 2 crucial things in its vector representations 1) Order of tokens in which they appear in sentence/sequence and 2) Semantic meaning of the token. Bag of Words is majorly based on the frequency of tokens present in a sentence & nothing else. It is a sentence-level representation. So what we do is

  • Each word in the vocabulary V is assigned an integer index, i(from 0 to V-1) similar to OHE
  • A vector of length V with all 0s is initialized for each sentence & frequencies of each token are assigned to the corresponding index assigned in step one.

The below example will help

If we look closely, we can observe a few things

  • The representation is now of fixed length irrespective of the sentence length
  • The representation dimension has reduced drastically compared to OHE where we would have such vector representation for just one token/word. Though, as vocab can be huge, the representation can still be parsed.
  • Vector representation does store any information around word order within the sentence. It also misses out on the semantic meaning of the word.
  • Any sentence with the same words will have a similar presentation. Though, slight variations can make the representation drastically different. ‘I run’ & ‘they ran’ will have completely different representation
  • OOV problem is still a challenge

Bag of N-grams (BoN)

BoN can be taken as an upgraded version of BoW where instead of counting the frequency of individual tokens, we create groups of N tokens & count their frequency instead. Each group is called an N-gram. It is a sentence-level representation

But what’s the intuition behind this?

In the previous 2 versions, we discussed the major drawback observed was the method was missing out on the context of the words used. By grouping continuous words together, we may be able to capture some meaning.

The BoW can be considered as BoN with n=1. Let’s see an example for clarification with n=3

As discussed above, this representation does improve semantic understanding of the sentence & sentences with the same phrase/group of words will have a similar representation. Though we still have a long way to go as

  • We haven’t found a solution to OOVs
  • Sparse representation poses a big challenge
  • As vocab increases, the dimensionality of representation increases

TF-IDF

TF-IDF is a single float value per word that solves a very particular problem that may come in handy in text classification a lot i.e. word importance. As of now, in all the 3 versions we read above, there exists no concept of word importance i.e. how important a particular word is in a document/sentence. It is a sentence-level representation

It moves with intuition than if a certain word is present in every other sentence, chances that it might not be important. For example: is, a, the, etc. while words that are rarely found in the sentences may be of higher importance.

It moves with a simple formula:

TF(x) : (Frequency of word ‘x’ in sentence s) / (total tokens in sentence s)

IDF(x): log(total sentences/total sentences with word ‘x’)

TF_IDF(x): TF(x) * IDF(x)

For terms like ’is’, ’a’,’ the’, TF-IDF would be comparatively lower than words like ‘Google’, ’Samsung’, etc. in some tech articles.

Understanding the output vector

  • Any word not in a sentence has TF-IDF=0 (basketball=0) as frequency is 0 in a given document
  • It is able to assign the highest TF-IDF score to ‘he’ (0.64)

like all previous methods, TF-IDF also struggles with the same problems of handling OOVs, sparsity & big dimensions but it does capture some semantics as well.

Done with the basic methods, let’s make things a bit complicated !! but before that

https://tenor.com/bjYlb.gif

Distributed Representations

One key thing that we were missing in the above methods is their inability to capture the semantic meanings of words. This one addition in representations can be a great boost up & similar words can be mathematically similar as well (like a low Euclidean distance between similar words).

Word2Vec

A very popular name in the word embedding space, Word2Vec is a neural network-based model for learning word embeddings. It does solve the 2 most critical problems we were facing earlier: lower dimension representations & capturing the meaning of the words using the context (vicinity words). It is a word-level representation.

Continuous Bag Of Words (CBoW)

No, it has no similarity to BoW we studied earlier. The idea behind CBoW is to train a NN that gives context words (vicinity words) as input & predicts the target word. What we do is

  • Choose an even number ‘m’. Now for a target word, we will consider its ‘m’ neighboring word on either side (left & right) & prepare training data set. Have a look below
https://www.oreilly.com/library/view/practical-natural-language/9781492054047/

Here, given the source text, for each word, we prepare training samples where vicinity words act as features & the blue highlighted ones are the target

  • Next, we train our 2 layered NN where we add up the values coming from different tokens in the 1st layer (figure added below)

Skip Gram

An exact opposite to CBoW, Skip Gram tries to predict context words given a single word as input. How is the training set prepared? pretty similar to CBoW but reversed

  • Select ‘m’ similar to CBoW. We will be considering m words on both sides of every word for prediction but as separate samples. The blue token becomes the input.
https://www.oreilly.com/library/view/practical-natural-language/9781492054047/
  • Next, similar to CBoW, we have a 2-layered NN where each vicinity word is fed & the same target word is predicted against each input word.

The below image visually shows how the two models are different for the sentence ‘I am selling these fine leather jackets

Word representations · fastText

Interestingly, we require the weight of the hidden layer that acts as word embedding. The model trained can be dumped !!

So, which one out of CBoW or Skip Gram is better?

There is no binary here. Depending on the dataset & problem, any one of them can be picked. According to the original paper, Skip Gram is great when you have a small dataset & a special focus is on rare words while CBoW is a good choice when the data set is huge & focus is on frequent words. Also, CBoW trains faster !!

Before ending, let’s see how Word2Vec can be implemented using gensim

from gensim.models import Word2Vectext='''Santiago is a Shepherd who has a recurring dream which is supposedly prophetic. Inspired on learning this, he undertakes a journey to Egypt to discover the meaning of life and fulfill his destiny. During the course of his travels, he learns of his true purpose and meets many characters, including an “Alchemist”, that teach him valuable lessons about achieving his dreams. Santiago sets his sights on obtaining a certain kind of “treasure” for which he travels to Egypt. The key message is, “when you want something, all the universe conspires in helping you to achieve it.” Towards the final arc, Santiago gets robbed by bandits who end up revealing that the “treasure” he was looking for is buried in the place where his journey began. The end.'''
text = [x.lower().split() for x in text.split('.')]
#sg hyperparameter =0 means CBoW
cbow = Word2Vec(text, vector_size=100, window=2, sg=0, min_count=1)
#sg=1 means skip_gram
skip_gram = Word2Vec(text, vector_size=100, window=2, sg=1,min_count=1)

The above models are trained on ‘The Alchemist's summary. Let’s observe which words are closer to ‘Santiago’

As it can be observed, both the models performed almost the same with minor differences over the dataset. The results might look a bit off as the dataset is real small. For considerable data, it should work

Global Vectors (GloVe)

As we observed in the case of Word2Vec, only vicinity words are used to derive an embedding for a given word. This may be restrictive in the long run as the embedding will have a limited context. Can there be a way we can use the entire corpus to derive embedding for a word? GloVe, a word-level vector representation scheme

GloVe uses word-word co-occurrence probabilities to incorporate the essence of the entire corpus. As it is a global statistic(derived using the entire dataset), the embedding generated is called GloVe or Global Vectors. But first of all, what the heck is this?

A word-word co-occurrence matrix is nothing but a 2d array that stores frequency of every possible pair of word in the corpus. It can be imagined as a correlation matrix structure with instead of correlation values, we have frequency of the 2 words occurring together in a specified window. The below example should clarify it:

For the above given 3 sentences, we have the following word-word co-occurrence matrix.

Next, we will calculate word-word cooccurrence probabilities using the above matrix. It's pretty easy

For P(A|B) = Freq(A∩B)/Freq(B)

So, in the above example, If we wish to calculate P(a|is)= 2/3 = 0.66

Below is the word-word co-occurrence probability matrix from the GloVe paper calculated over a humongous dataset.

This one is interesting. Assume we wish to know how the words ‘ice’ & ‘steam’ are related. For this, let’s assume we have [‘solid’, ’gas’, ’water’, ’fashion’] to know how they are related. Before jumping onto the mathematics, we know

Solid is more related to ice than steam

Gas is more related to steam than ice

Water is equally related to both

Fashion isn’t related to any of them

Can this be interpreted using the co-occurrence probability matrix? Yes

If you notice the last row of the matrix, we have got P(k|ice)/P(k|steam) where k=probe words

Now, if

P(k|ice)/P(k|steam) >>1 ,k is more associated with ice. As in case of P(solid|ice)/P(solid|steam)=8.9

P(k|ice)/P(k|steam) <<1, k is more related to steam. As P(gas|ice)/P(gas|steam)=0.085

P(k|ice)/P(k|steam)≈1, k is either related to both or unrelated. As in case of water & fashion.

word-word Co-occurrence probability matrix do work !!

So, after the above intuition, what GloVe tries to do is

F(i, j, k) = P(k | i )/ P(k | j)

Hence, we wish to find a function F such that it takes 3 vectors as input i, j & k where i=ice, k=steam & k=probe words in our case for now. After a mathematical derivation (present in the paper), we finally get to

Wᵢᵀ Wₖ+ bᵢ + bₖ = log(1+Xᵢₖ)

Where W is the word vector corresponding to i & k, b represents bias terms & X represents word-word co-occurrence frequency.

So, what we are trying here is fitting a model such that

Wᵢᵀ Wₖ+ bᵢ + bₖ — log(1+Xᵢₖ) ≈ 0 or log(1+Xᵢₖ) can be taken as the target variable & W is the weight matrix we are trying to learn eventually becomes our word embedding matrix similar to CBoW/Skip Gram. In the paper, some sorts of weights are also assigned but that can be left for now.

Note: GloVe can be easily implemented using the glove_python library available here

fastText

So, we have seen in word2vec how we consider k neighbor of a word & either

  • Use these words together to predict the target word (CBoW)
  • Use vicinity words one at a time to predict the target(Skip Gram)

To get word embeddings. fastText can be taken as an extension of the skip-gram model with subtle changes

Instead of words as a whole, the word is split into characters, n-grams using these characters are formed & used to train the Skip-Gram model. Before this splitting, special characters at the end & beginning of each word are added. So ‘titan’ becomes ‘<titan>’ & then broken into n-grams of characters like ‘<t’, ‘it’, ‘ta’…’n>’, ‘<ti’, ‘tit’, …’ an>’, etc

This n-gram at character levels helps in understanding sub-word embeddings as well. Hence, different versions of the same word like drives, driving, drove, etc can be understood by the system. Also, may resolve the OOVs problem to some extent as well because we might have subword embeddings with us if not the entire word

Let’s have a look at the below example

m=1 & n=3 for forming n-grams

Here, we have a sentence You are naughty but kind. m=1 (m similar to CBoW or Skip Gram discussed earlier) & n=3 (n similar to BoN) for n-grams at character level.

Time for fastText implementation using gensim (using the same text as in Word2Vec)

from gensim.models import FastText
#text = same as word2vec
fasttext = FastText(vector_size=100, window=2, min_count=1)
fasttext.build_vocab(text)
fasttext.train(text, total_examples=len(text), epochs=10)

Let’s again see what are the top similar words to ‘he’ using fastText. The embeddings are looking off as again, the dataset is pretty miniature just for demo purposes

We have talked enough about generating embeddings for single words in a sentence or at max single sentences. Is there a way we can generate an embedding for an entire paragraph? or page?

Doc2Vec

An extension of Word2Vec, Doc2Vec is a useful model to get document/para/sentence embedding in a fixed-length vector. Similar to Word2Vec, it also has 2 versions:

Distributed Memory

It can be taken as an alias for CBoW for document vectorization where we feed in context words and predict target words as we did in CBoW. The only difference being a vector called Paragraph ID (unique for all paras, assuming it to be some sort of vector but not mentioned anywhere specifically) is also fed alongside context words as shown below. The training data preparation remains similar to CBoW.

https://www.oreilly.com/library/view/practical-natural-language/9781492054047/

Distributed Bag Of Words

An alias to Skip Gram, it aims at predicting multiple words given the paragraph id instead of a single word/term. The training data preparation remains similar to Skip Gram.

https://www.oreilly.com/library/view/practical-natural-language/9781492054047/

Let’s look at how this can be done in python using, again, gensim

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
#text = same as word2vec
text = [x.lower().split() for x in text.split('.')]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(text)]
doc2vec = Doc2Vec(documents, vector_size=100, window=2, min_count=1)
#representation for new text
doc2vec.infer_vector('he was beaten'.split())

Note: We must remember that in all the models trained above, we are never interested in the model itself but the weights of the hidden layer that act as embeddings, be it document level or word level

The era of transformers

The most complex embeddings of all & the most popular ones right now. The idea here is to learn a language model that is able to generate embeddings for a word given the sequence. This is usually done in 2 steps

  • Pretraining: Training a heavy NN (like BERT, GPT) to learn the language model using some big, common corpus
  • Fine Tuning: Depending on the task for which the embeddings are required, using transfer learning, training the pretrained model to fine-tune the weights accordingly.

Though even such complicated methods aren’t free of issues, the major ones being they are humongous in size & are often biased given the dataset on which they are trained.

Note: As already discussed around BERT & Transformers at length previously, skipping them this time !!

Before wrapping, a huge shoutout to the writers of the book: https://www.oreilly.com/library/view/practical-natural-language/9781492054047/ for writing such a gem !!

--

--