Featurization of Text data

BOW, TF-IDF, Word2Vec, TF-IDF Weighted Word2Vec

Rana singh

Published in

Analytics Vidhya

5 min readSep 12, 2019

1 — Bag of Words

It first constructs a dictionary of the set of all the words in the TEXT. It consists of all unique words in the TEXT. It represents word as a sparse matrix.

For each document(row), find unique words where each word is a different dimension. Each cell consists of the number of times the word occurs in the respective row.

d will be very large where most of the cells have zero value. This is the reason a sparse matrix will be formed.

If two vectors are very similar then they will be very closer.

So length between two vectors is d=|(Term1-Term2)| norm equal to square root of d.

Code:

Drawback:

BOW does not take semantic meaning into consideration. Ex. tasty and delicious have the same meaning but BOW considers as separate.

bi-gram, tri-gram, and n-gram

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Removing stop words like “not” should be avoided before building n-grams. It represents word as a dense vector.

ngram_range : tuple (min_n, max_n)

The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

TF-IDF (term frequency-inverse document frequency)

TF- the number of times the word t occurs in document d divided by the total number of the words in document d. In other words, it is the probability of finding a word in document d.

If a word occurs in more documents then IDF decreases. The cell value is a multiplication of TF * IDF. More importance to rare words in documents and more important if a word is frequent in a document/review.

code:

the dense output of tf-idf vectorization. https://stackoverflow.com/questions/48429367/appending-2-dimensional-list-dense-output-of-tfidf-result-into-pandas-datafram

Drawback::

Still does not take sementing meaning of words.

Word2Vector

Word2vec basically place the word in the feature space is such a way that their location is determined by their meaning i.e. words having similar meaning are clustered together and the distance between two words also have the same meaning.

Cosine Similarity

lets first understand what is cosine similarity because word2vec uses cosine similarity for finding out the most similar word. Cosine similarity is not only telling the similarity between two vectors but it also test for orthogonality of vector. Cosine similarity is represented by formula:

if angle are close to zero than we can say that vectors are very similar to each other and if theta is 90 than we can say vectors are orthogonal to each other (orthogonal vector not related to each other ) and if theta is 180 we can say that both the vector are opposite to each other.

# http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.XXpERSgza01

Case1: Want to train your own W2V

# http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.W17SRFAzZPY
you can comment this whole cell or change these variables according to your need.

Case2:: want to train google w2v train on google news

# in this project we are using a pretrained model by google
# its 3.3G file, once you load this into your memory
# it occupies ~9Gb, so please do this step only if you have >12G of ram
# we will provide a pickle file wich contains a dict ,
# and it contains all our courpus words as keys and model[word] as values
# To use this code-snippet, download “GoogleNews-vectors-negative300.bin”
# from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
# it’s 1.9GB in size.

want to check the number of the word occurs:

Avg Word2Vector

We need to give large text corpus where for every word it creates a vector. it tries to learn the relationship between vector automatically from raw text. larger the dimension it has, larger it is rich in information the vector is.

properties:

if word w1 and w2 are similar than vector v1 and v2 will be closer.
automatic learn the relationship between words/vector.

we are looking into Male-Female graph we are observing that distance between man and woman is same as distance between king (male) and queen (woman) Not only different gender but if we look into same-gender we observe that distance between queen and woman and distance between king and man are same(king and man, queen and woman represent same-gender comparison hence they must be equal distance)

how to convert each document to vector?

suppose you have w1, w2, …wn word in one document(row). in order to convert into vector.

each word has one vector, we will convert average word2vec than divide by the number of word in a document.

code:

TFIDF weighted Word2Vec

in this method first, we will calculate tfidf value of each word. than follow the same approach as above section by multiplying tfidf value with the corresponding word and then divided the sum by sum tfidf value.

code: