Fundamentals of Bag Of Words and TF-IDF

Published in

Analytics Vidhya

8 min readSep 4, 2019

This post will take you to basic understanding concepts of how a machine checks if given two sentences are having a similar meaning or not. It involves some basic concepts of NLP.

Natural-language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to fruitfully process large amounts of natural language data.

In simple terms, Natural language processing (NLP) is the ability of computers to understand human speech as it is spoken. NLP helps to analyze, understand, and derive meaning from human language in a smart and useful way.

Why do we need such models to check similarity between sentences?

A problem with modeling text is that it is messy, and techniques like machine learning algorithms prefer well defined fixed-length inputs and outputs.

Machine learning algorithms cannot work with the raw text directly; the text must be converted into numbers. Specifically, vectors of numbers.

Concept: Similar text must result in closer vector.

“In language processing, the vectors x are derived from textual data, in order to reflect various linguistic properties of the text.”

Popular and simple method of feature extraction with text data which are currently used are:

Bag-of-Words
TF-IDF
Word2Vec

Bag Of Words (BOW):

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.

The bag-of-words model is simple to understand and implement and has seen great success in problems such as language modeling and document classification.

It involves two things:

A vocabulary of known words: This step revolves around constructing a document corpus which consists of all the unique words in the whole of the text present in the data provided. It is sort of like a dictionary where each index will correspond to one word and each word is a different dimension.

Example: If we are given 4 reviews for an Italian pasta dish.
Review 1 : This pasta is very tasty and affordable.
Review 2: This pasta is not tasty and is affordable.
Review 3 : This pasta is delicious and cheap.
Review 4: Pasta is tasty and pasta tastes good.

Now if we count the number of unique words in all the four reviews we will be getting a total of 12 unique words. Below are the 12 unique words :

‘This’
‘pasta’
‘is’
‘very’
‘tasty’
‘and’
‘affordable’
‘not’
‘delicious’
‘cheap’
‘tastes’
‘good’

2. A measure of the presence of known words : Now if we take the first review and plot count of each word in the below table we will have where row 1 corresponds to the index of the unique words and row 2 corresponds to the number of times a word occurs in a review. (Here review 1)

We will make a sparse vector of d-unique words and for each document (review) we will fill it will number of times the corresponding word occurs in a document.

Review 4: Pasta is tasty and pasta tastes good.

For example in review 4 “pasta” has count 2 whereas in review 1 it is 2.

After converting the reviews into such vectors we can compare different sentences and calculate the Euclidean distance between them so as to check if two sentences are similar or not. If there would be no common words distance would be much larger and vice-versa.

BOW doesn’t work very well when there are small changes in the terminology we are using as here we have sentences with similar meaning but with just different words.

This results in a vector with lots of zero scores called a sparse vector or sparse representation.

Sparse vectors require more memory and computational resources when modeling and the vast number of positions or dimensions can make the modeling process very challenging for traditional algorithms.

As such, there is pressure to decrease the size of the vocabulary when using a bag-of-words model.

There are simple text cleaning techniques that can be used as a first step, such as:

· Ignoring case
· Ignoring punctuation
· Ignoring frequent words that don’t contain much information, called stop words, like “a,” “of,” etc.
· Fixing misspelled words.
· Reducing words to their stem (e.g. “play” from “playing”) using stemming algorithms.

N-grams Model:

A more sophisticated approach is to create a vocabulary of grouped words. This changes both the scope of the vocabulary and allows the bag-of-words to capture a little bit more meaning from the document.

In this approach, each word or token is called a “gram”. Creating a vocabulary of two-word pairs is, in turn, called a bigram model. Again, only the bigrams that appear in the corpus are modeled, not all possible bigrams.

An N-gram is an N-token sequence of words: a 2-gram (more commonly called a bigram) is a two-word sequence of words like “please turn”, “turn your”, or “your homework”, and a 3-gram (more commonly called a trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”.

Code for BOW:

TF-IDF:

tf–idf or TFIDF, short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf.

This concept includes:

· Counts. Count the number of times each word appears in a document.
· Frequencies. Calculate the frequency that each word appears in a document out of all the words in the document.

Term frequency :

Term frequency (TF) is used in connection with information retrieval and shows how frequently an expression (term, word) occurs in a document. Term frequency indicates the significance of a particular term within the overall document. It is the number of times a word wi occurs in a review rj with respect to the total number of words in review rj.

TF can be said as what is the probability of finding a word in a document (review).

Inverse document frequency:

The inverse document frequency is a measure of how much information the word provides, i.e., if it’s common or rare across all documents. It is used to calculate the weight of rare words across all documents in the corpus. The words that occur rarely in the corpus have a high IDF score. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient):

Term frequency–Inverse document frequency:

TF–IDF is calculated as

A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the IDF's log function is always greater than or equal to 1, the value of IDF (and tf–idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the IDF and tf–idf closer to 0.

TF-IDF gives larger values for less frequent words in the document corpus. TF-IDF value is high when both IDF and TF values are high i.e the word is rare in the whole document but frequent in a document.

TF-IDF also doesn’t take the semantic meaning of the words.

Let’s take an example to get a clearer understanding.

Sentence 1: The car is driven on the road.

Sentence 2: The truck is driven on the highway.

In this example, each sentence is a separate document.

We will now calculate the TF-IDF for the above two documents, which represent our corpus.

From the above table, we can see that the TF-IDF of common words was zero, which shows they are not significant. On the other hand, the TF-IDF of “car”, “truck”, “road”, and “highway” are non-zero. These words have more significance.

Reviewing- TFIDF is the product of the TF and IDF scores of the term.

TF = number of times the term appears in the doc/total number of words in the doc

IDF = ln(number of docs/number docs the term appears in)

Higher the TFIDF score, the rarer the term is and vice-versa.

TFIDF is successfully used by search engines like Google, as a ranking factor for content.

The whole idea is to weigh down the frequent terms while scaling up the rare ones.

Code for TF-IDF:

TFIDF is successfully used by search engines like Google, as a ranking factor for content.

The whole idea is to weigh down the frequent terms while scaling up the rare ones.

Word2Vec :

Word2Vec model is used for learning vector representations of words called “word embeddings”. This is typically done as a preprocessing step, after which the learned vectors are fed into a discriminative model (typically an RNN) to generate predictions and perform all sorts of interesting things. It takes the semantic meaning of words. I would include this Word2Vec in my next blog in an exhaustive manner. — https://medium.com/analytics-vidhya/deep-dive-into-word2vec-7fcefa765c17

References :

tf-idf

In information retrieval, tf-idf or TFIDF, short for term frequency-inverse document frequency, is a numerical…

en.wikipedia.org

A Gentle Introduction to the Bag-of-Words Model

The bag-of-words model is a way of representing text data when modeling text with machine learning algorithms. The…

machinelearningmastery.com