# NLP: Text Mining Algorithms

## Explaining N-Grams, Bag Of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms and their implementation in Python

This article aims to clearly explain the most widely used text mining algorithms used in the NLP projects. It will explain 3 algorithms including:

- N-Grams
- Bag of Words (BoW)
- Term Frequency-Inverse Document Frequency (TF-IDF).

# 1. N-Grams

N-Grams is an important concept to understand in text analytics. Essentially, N-Grams is a set of 1 or more consecutive sequence of items that occur next to each other. As mentioned above, N is a numerical value that implies the n items of sequence of text.

When we type *text* in a search engine, we can see the probabilistic model of the search engine starts predicting the next set of words based on the context. This is known as the *Autocomplete *feature of the search engines.

N-Grams allows us to build this text mining forecasting model.

N-Grams allows us to predict the next words of a text

As an instance, if the sentence is “FinTechExplained is a publication”, then:

- 1-Gram would be: FinTechExplained, is, a, publication
- 2-Gram would be: FinTechExplained is, is a, a publication
- 3-Gram would be: FinTechExplained is a, is a publication

In Python, we can implement N-Gram using NLTK library:

from nltk.util import ngrams

from collections import Countertext = 'FinTechExplained is a publication'1_grams = ngrams(nltk.word_tokenize(text), 1)

2_grams = ngrams(nltk.word_tokenize(text), 2)

3_grams = ngrams(nltk.word_tokenize(text), 3)

# 2. Bag of Words (BoW)

In this section, I will explain the concept that is gaining its popularity in NLP projects. It’s known as Bag of Words (Bow).

Essentially the algorithm revolves around the fact that the text needs to be converted into numbers before it can be applied into the mathematical algorithms. When we convert the text to the numbers, we can apply various techniques. One of the techniques is to count the occurrence of words in a document.

BoW is all about creating a matrix of words where the words (terms) are represented as the rows and the columns represent the document names. We can then populate the matrix with the frequency of each term within the document, ignoring the grammar and order of the terms.

The matrix is referred to as the Term Document Matrix (TDM).

Each row is a word vector. Each column is a document vector.

As an instance, assume you extract the tweets from Twitter and statuses from Facebook that contain the word “NLP”. You can then tokenise the sentences into words and then populate TDM where the columns will be Facebook and Twitter, and the rows will be the terms (words of the text). The matrix is then populated with the frequency of each term within a document:

We can achieve it using Sci-kit learn library in Python

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizerdata = {'twitter':get_tweets(),

'facebook':get_fb_statuses()}vectoriser = CountVectorizer()

vec = vectoriser.fit_transform(data['twitter'].append(data['facebook']))df = pd.DataFrame(vec.toarray().transpose(), index = vectoriser.get_feature_names())df.columns = ['twitter', 'facebook']

# 3. Term Frequency-Inverse Document Frequency (TF-IDF)

In NLP projects, we are required to determine the importance of each word. TF-IDF is a great statistical measure. It helps us understand the relevance of the term (word).

For each term in a document, a matrix is computed by performing following 3 steps:

- Calculate the frequency of a term in a document. This is known as Term Frequency (TF). This is achieved by dividing the number of times a term appears in a document divided by the total number of terms in a document.
- Calculate the inverse of document frequency of a term. This is computed by dividing the total number of documents by the number of documents that contain the term. The inverse is calculated so that we can compute a positive log value. Therefore, compute a log of the computed value. This will result in a positive value. This is known as Inverse Document Frequency (IDF).
- Finally multiply step 1 by step 3. This is known as TF-IDF

Rows of the matrix represent the terms and the columns of the matrix are the document names.

To understand it better, let’s assume there are 100 documents. 4 documents contain the term “FinTechExplained”.

*The term is mentioned once in the first and second document, twice in the third document and thrice in the fourth document.*

Also let’s consider that there are 100 words in each document.

- Term frequency for the documents is:

- Document 1: 1/100 = 0.01
- Document 2: 1/100 = 0.01
- Document 3: 2/100 = 0.02
- Document 4: 3/100 = 0.03

2. The value is: 100/4 = 25

3. IDF = Log (25) = 1.398

4. Finally, the TF-IDF of the term “FinTechExplained” in document 1 is 1.398*0.01=0.01398

We can implement it in Python using Sci-kit learn library:

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizerdata = {'twitter':get_tweets(),

'facebook':get_fb_statuses()}vectoriser = TfidfVectorizer()

vec = vectoriser.fit_transform(data['twitter'].append(data['facebook']))df = pd.DataFrame(vec.toarray().transpose(), index = vectoriser.get_feature_names())df.columns = ['twitter', 'facebook']

# Summary

This article explained the most widely used text mining algorithms used in the NLP projects. Explaining N-Grams, Bag Of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms and their implementation in Python