NLP: Text Mining Algorithms

Explaining N-Grams, Bag Of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms and their implementation in Python

Farhad Malik
Jun 28 · 4 min read

This article aims to clearly explain the most widely used text mining algorithms used in the NLP projects. It will explain 3 algorithms including:

  1. N-Grams
  2. Bag of Words (BoW)
  3. Term Frequency-Inverse Document Frequency (TF-IDF).
Photo by Sergi Kabrera on Unsplash

1. N-Grams

N-Grams is an important concept to understand in text analytics. Essentially, N-Grams is a set of 1 or more consecutive sequence of items that occur next to each other. As mentioned above, N is a numerical value that implies the n items of sequence of text.

When we type text in a search engine, we can see the probabilistic model of the search engine starts predicting the next set of words based on the context. This is known as the Autocomplete feature of the search engines.

N-Grams allows us to build this text mining forecasting model.

As an instance, if the sentence is “FinTechExplained is a publication”, then:

  • 1-Gram would be: FinTechExplained, is, a, publication
  • 2-Gram would be: FinTechExplained is, is a, a publication
  • 3-Gram would be: FinTechExplained is a, is a publication

In Python, we can implement N-Gram using NLTK library:

from nltk.util import ngrams
from collections import Counter
text = 'FinTechExplained is a publication'1_grams = ngrams(nltk.word_tokenize(text), 1)
2_grams = ngrams(nltk.word_tokenize(text), 2)
3_grams = ngrams(nltk.word_tokenize(text), 3)

2. Bag of Words (BoW)

In this section, I will explain the concept that is gaining its popularity in NLP projects. It’s known as Bag of Words (Bow).

Essentially the algorithm revolves around the fact that the text needs to be converted into numbers before it can be applied into the mathematical algorithms. When we convert the text to the numbers, we can apply various techniques. One of the techniques is to count the occurrence of words in a document.

BoW is all about creating a matrix of words where the words (terms) are represented as the rows and the columns represent the document names. We can then populate the matrix with the frequency of each term within the document, ignoring the grammar and order of the terms.

The matrix is referred to as the Term Document Matrix (TDM).

As an instance, assume you extract the tweets from Twitter and statuses from Facebook that contain the word “NLP”. You can then tokenise the sentences into words and then populate TDM where the columns will be Facebook and Twitter, and the rows will be the terms (words of the text). The matrix is then populated with the frequency of each term within a document:

We can achieve it using Sci-kit learn library in Python

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
data = {'twitter':get_tweets(),
'facebook':get_fb_statuses()}
vectoriser = CountVectorizer()
vec = vectoriser.fit_transform(data['twitter'].append(data['facebook']))
df = pd.DataFrame(vec.toarray().transpose(), index = vectoriser.get_feature_names())df.columns = ['twitter', 'facebook']

3. Term Frequency-Inverse Document Frequency (TF-IDF)

In NLP projects, we are required to determine the importance of each word. TF-IDF is a great statistical measure. It helps us understand the relevance of the term (word).

For each term in a document, a matrix is computed by performing following 3 steps:

  1. Calculate the frequency of a term in a document. This is known as Term Frequency (TF). This is achieved by dividing the number of times a term appears in a document divided by the total number of terms in a document.
  2. Calculate the inverse of document frequency of a term. This is computed by dividing the total number of documents by the number of documents that contain the term. The inverse is calculated so that we can compute a positive log value. Therefore, compute a log of the computed value. This will result in a positive value. This is known as Inverse Document Frequency (IDF).
  3. Finally multiply step 1 by step 3. This is known as TF-IDF

Rows of the matrix represent the terms and the columns of the matrix are the document names.

To understand it better, let’s assume there are 100 documents. 4 documents contain the term “FinTechExplained”.

The term is mentioned once in the first and second document, twice in the third document and thrice in the fourth document.

Also let’s consider that there are 100 words in each document.

  1. Term frequency for the documents is:
  • Document 1: 1/100 = 0.01
  • Document 2: 1/100 = 0.01
  • Document 3: 2/100 = 0.02
  • Document 4: 3/100 = 0.03

2. The value is: 100/4 = 25

3. IDF = Log (25) = 1.398

4. Finally, the TF-IDF of the term “FinTechExplained” in document 1 is 1.398*0.01=0.01398

We can implement it in Python using Sci-kit learn library:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
data = {'twitter':get_tweets(),
'facebook':get_fb_statuses()}
vectoriser = TfidfVectorizer()
vec = vectoriser.fit_transform(data['twitter'].append(data['facebook']))
df = pd.DataFrame(vec.toarray().transpose(), index = vectoriser.get_feature_names())df.columns = ['twitter', 'facebook']

Summary

This article explained the most widely used text mining algorithms used in the NLP projects. Explaining N-Grams, Bag Of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms and their implementation in Python

FinTechExplained

This blog aims to bridge the gap between technologists, mathematicians and financial experts and helps them understand how fundamental concepts work within each field. Articles

Farhad Malik

Written by

My personal blog, aiming to explain complex mathematical, financial and technological concepts in simple terms. Contact: FarhadMalik84@googlemail.com

FinTechExplained

This blog aims to bridge the gap between technologists, mathematicians and financial experts and helps them understand how fundamental concepts work within each field. Articles

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade