Chapter 9.1 : NLP - Word vectors.

Madhu Sanjeevi ( Mady )
Deep Math Machine learning.ai
6 min readOct 21, 2017

Last story we talked about the basic fundamentals of natural language processing and data preprocessing, this story we talk about how documents will be converted to vectors of values.

Let’s get going!!!!

In the last story only we talked about count vectorization and how the documents are converted into vectors with the help of count vectorizer.

This is how the documents are converted into vectors.

Just taking count doesn’t help in real data, because we can have words repeated many times in one document while not in other documents.

Then we might get high count vectors for one document and low count vectors for others. This can be a problem, so to solve this problem we use a technique called TF-IDF(Term frequency-Inverse term frequency).

TF-IDF(Term frequency-Inverse term frequency)

tf-idf is a weighting factor which is used to get the important features from the documents(corpus).

It actually tells us how important a word is to a document in a corpus, the importance of a word increases proportionally to the number of times the word appears in the individual document, this is called Term Frequency(TF).

Ex : document 1:

Mady loves programming. He programs all day, he will be a world class programmer one day

if we apply tokenization, steeming and stopwords (we discussed in the last story) to this document, we get features with high count like → program(3), day(2),love(1) and etc….

TF = (no of times the word appear in the doc) / (total no of words in the doc)

Here program is the highest frequent term in the document.

so program is a good feature if we consider TF.

However, if multiple documents contain the word “program” many times then we might say…

it’s also a frequent word in all other documents in our corpus so it does not give much meaning so it probably may not be an important feature.

To adjust this we use IDF.

The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents.

IDF — Log(total no of documents / no of documents with the term t in it).

so TF-IDF = TF * IDF.

You can read more about it in wiki

So finally by using TF-IDF we get the most important features from the documents(corpus) with weights.

Here is the comparison between count vectorizer and tfidf vectorizer based on the last story discussion and data.

We can use Scikit learn to this job

if we take one word as a feature , that’s called Unigram. if we take two words at a time as a feature , that’s called Bigram. three words at a time as a feature , that’s called Trigram

if we take n words at a time as a feature , that’s called n-gram

with bigrams.

These values make more sense(adds much meaning) than just count values.

This table is called a sparse matrix ( a lot of zeros ) , we can now pass this table as X_Train to any machine learning model to train.

Let’s come back

The count vectorizer and tfidf vectorizer focus only on the count, they just only take care of how many times a word appeared in the corpus , they just don’t care about the order for example

“ this movie is good “

“good is the movie”

if we use count vectorizer or tfidf vectorizer, we get the same vectors for example , let’s say we have good and movie are features and use (count vec)

These don’t understand the meaning so we don’t get useful representations.

to get the meaningful numerical representations, we use one of the important model in natural language processing which is word2vec.

let’s just forget about TFIDF for while and start with fresh mind, as we know if we want to feed words into machine learning models, we need to convert the words into some set of numeric vectors.

A good way of doing this would be to use a “one-hot” method of converting the word into a sparse representation with only one element of the vector set to 1, the rest being zero.

but it would be extremely costly as the vocabulary gets increased and miss the meaning of the words.

for example if vocabulary size is 10000 then we have a 10000 sized vector for every word in the document for all documents in corpus.

Word 2 vec takes care of two things

  1. Converts this high dimensional vector (10000 sized) into low dimensional vector (let’s say 200 sized)

The conversion of 10,000 columned matrix into a 200 columned matrix is called word embedding.

2. Maintains the word context (meaning)

the word context / meaning can be created using 2 simple algorithms which are

  1. Continuous Bag-of-Words model (CBOW)

It predicts one word based on the surrounding words (it takes an entire context as an observation or takes a window sized context as an observation)

Ex: Text= “Mady goes crazy about machine leaning” and window size is 3

it takes 3 words at a time predicts the center word based on the surrounding words → [ [“Mady”,”crazy” ] , “goes”] → “goes” is the target word , and the other two are inputs.

2. Skip-Gram model

It takes one word as input and try to predict the surrounding (neighboring) words,

[“Mady”, “goes”],[“goes”,”crazy”] → “goes” is the input word and “Mady” and “Crazy” are the surrounding words (Output probabilities)

word 2 vec works how?

We can use any of these models to get the word vectors , let’s use skip gram model as it performs better, so here is how word2 vec works

  1. Take a 3 layer neural network with 1 input layer , 1 hidden layer and 1 output layer

Assume we have 10000 unique words in dictionary so for one word “goes” we get a 10000 sized vector as an input then we take 200 neurons and do the neural network training for all words using skip gram model ( I have already talked about neural network training here)

Once the training is complete , we get the final weights for hidden layer and output layer

2. Ignore the last (output layer) and keep the input and hidden layer.

so we get the 200 sized weights( scores ) for every word

Now, input a word from within the vocabulary. The output given at the hidden layer is the word embeddingof the input word.

Simply word 2 vec

→ it’s a neural network training for all the words in our dictionary to get the weights(vectors )

→ it has word embeddings for every word in the dictionary

What’s the ultimate result for word 2 vec?

Well, we get the similar vectors for similar words for ex:

we get the similar vectors for the words “India”, “China”, “America”, “Germany”, and etc… from a big corpus.

even if we are not labeling or telling that those are country names.

so if I give a text like “ I live in ____” we can get the predictions like “India”, “China”, “America”, “Germany”, and etc…

That’s the power of word 2 vec.

That’s it for this story. Hope you got an idea about how words can be represented numerically.

In the next story we will discuss how to write the code for word2 vec and the neural network training step by step with some more examples

Until then See Ya!!!

--

--