Featurization

Yassar
3 min readNov 4, 2019

What is Featurization?

Featurization is a way to change some form of data (text data, graph data, time-series data,…)into a numerical vector.

Featurization is different from feature engineering. Feature engineering is just transforming the numerical features somehow so that the machine learning models work well. In feature engineering, features are already in the numerical form. whereas in featurization data not need to be in the form of numerical vector.

Why Featurization?

The machine learning model cannot work with row text data directly. In the end, machine learning models work with numerical (categorical, real,…) features. So it is import to change some type of data into numerical vector so that we can leverage the whole power of linear algebra (making the decision boundary between data points)and statistics tools with other types of data also.

In this blog, I am discussing about text data only. Text data is just a sequence of words. In some applications (language identification, text classification, chatbots, summarization system, text classification, sentiment analysis, heading generator, reviews analysis,…) text data is very important. We need to featurize text data to make the assumptions using machine learning model.

Techniques to featurize text data!

  1. Bag of Words (BoW)
  2. Tfidf Vectorizer
  3. Weighted Word2Vec
  4. Tfidf Word2Vec

1. Bag of Words (BoW):

BoW is a simple and flexible technique to change the text data into numerical vectors. It is based upon counting the occurrence of a word in the document(sentence, review).

Algorithm

  1. First, it creates the dictionary (total different words in the corpus) of the corpus. These dictionary words are used as features of the corpus/dataset.
  2. Convert each document into a row vector just to see the presence of the same words in the dictionary and in the document.

Example

let we have a corpus of 4 documents (sentence, review) for which we have to create the numeric vectors.

corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?'
]

the first step of the BoW technique is to build the dictionary (vocabulary or bag) of all different words in the corpus. For this, we will use “CountVectorizer”.

Code

Output

Dictionary:  ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

CounVectorizer automatically sort the words of the dictionary in alphabetical order. Punctuations are removed by default but we can handle the stop words according to our need. for full documentation of CountVectorizer see this.

The second step of BoW is to transform the corpus documents into numeric vectors.

Code

Output

Document-Vector 1 : [0 1 1 1 0 0 1 0 1]
Document-Vector 2 : [0 2 0 1 0 1 1 0 1]
Document-Vector 3 : [1 0 0 1 1 0 1 1 1]
Document-Vector 4 : [0 1 1 1 0 0 1 0 1]

It counts how many times a word is coming in document and put that count number at the corresponding position of the word in the document vector. the length of each document vector will be equal to the length of the dictionary. if a word is present in the dictionary but not in the document then put “0 ”at the corresponding position of the word. CountVectorizer creates the sparse vector (most of the vector elements are zero). That’s why “.toarray()” function is used that shows the sparse vector as dense vector. The dimension of the sparse matrix is number of documents * size of dictionary”.

Advantages of BoW

  1. simple. Just counting occurrences of a word in the document.
  2. easy to implement. Just a few lines of code are enough to create the numeric vector of a text document.
  3. Flexible. Easy to change as a binary BoW.

Disadvantages of BoW

  1. It doesn't capture the semantic similarity between words (sentences). Ex- “Good” and “nice” both words are semantically same but it sees both words as different words.
  2. Creates a high dimensional sparse matrix. It creates a sparse vector where most of the elements are zero. because of this, it increases the complexities.
  3. It doesn't capture the word importance or weights. BoW is simply a collection(bag) of different words. It gives equal importance to each word. But in the real world, if a word is a rare word in the corpus, it may be an important word or if a word is coming many times in the corpus then maybe it is not an important word. BoW doesn't give any weightage or importance to a particular word. Because of this Tfidf comes into the picture.

--

--