Why NLP is the next big thing?

Ninad Mandavkar
9 min readApr 30, 2024

With the advent of Chatgpt and the emergence of AI, everyone is baffled by the impact it has created on the technology and how it has shaped the leading tech giants to incorporate AI into their workforce. While I am no one to comment on whether it’s right or wrong, but it’s repercussions on technology are something one should certainly take into notice. It does feel exciting to wonder how such generative AI tool was build in the first place. The answer to all this lies in a domain of Deep learning which goes by the name Natural Language Processing (NLP). What exactly is NLP and what does it do ? Let us try to understand.

1. What is NLP?

NLP aka Natural Language Processing (NLP) is a domain of Deep Learning that deals with text language. It converts input text into numbers to generate an output. Chatbot is a perfect example of NLP, where the machine tries to interpret the results after understanding the input text, translating it into a numeric code and generating an output. To put it in technical terms,

“It is the technology that is used by machines to understand, analyse, manipulate, and interpret human languages”.

Other examples of NLP are text translators, voice assistants, spam detection, sentiment analysis, speech recognition, etc.

2. Components of NLP

There are two components of NLP -

  1. Natural Language Understanding (NLU): Natural Language Understanding (NLU) helps the machine to understand and analyse human language by extracting the metadata from content such as concepts, entities, keywords, emotion, relations, and semantic roles. NLU mainly used in Business applications to understand the customer’s problem in both spoken and written language.
  2. Natural Language Generation (NLG): Natural Language Generation (NLG) acts as a translator that converts the computerized data into natural language representation. It mainly involves Text planning, Sentence planning, and Text Realization.

3. What are the steps to build NLP pipeline?

3.1 Sentence Segmentation

The first step to build a NLP pipeline is sentence segmentation. In this step we break the entire corpus (paragraph) into individual sentences based on a delimiter. This delimiter can be ‘.’, ‘,’ or anything else depending upon the use case. To give an example, look at the image below.

3.2 Word Tokenization

Word tokenization refers to breaking individual sentences into words or ‘tokens’. To give an example, look at the image below:

3.3 Stemming & Lemmatization

Stemming and Lemmatization both these techniques use the same approach. They normalise the generated token/words into root words. However, Lemmatization is more advanced than Stemming and more accurate as it correctly extracts the root word. To understand the concept, look at the following image:

3.4 Filtering the stop words

Stop words are the words present in the input text that do not contribute much in extracting the meaning of the sentence. Some of the most common stop words are “is”, “and”, “the”, and “a”.

NLP pipelines will flag these words as stop words. Stop words might be filtered out before doing any statistical analysis. To give an example, look at the image below:

3.5 Dependency parsing

Dependency Parsing is used to find that how all the words in the sentence are related to each other.

3.6 Part of Speech (POS) tags

POS stands for parts of speech, which includes Noun, verb, adverb, and Adjective. It indicates that how a word functions with its meaning as well as grammatically within the sentences. A word has one or more parts of speech based on the context in which it is used. To give an example, look at the image below:

3.7 Named Entity Recognition (NER)

Entity is something that is known or famous. Entities can be either organization, person or a place. Named Entity Recognition (NER) is the process of detecting the named entity such as person name, movie name, organization name, or location. To give an example, look at the image below:

In the above sentence, Entities are Steve Jobs, iPhone, Macworld Conference, San Francisco, California

3.8 Chunking

Chunking is used to collect the individual piece of information and grouping them into bigger pieces of sentences.

The 2 most important libraries used in NLP are NLTK (Natural Language Toolkit) and spaCy.

Type Markdown and LaTeX: 𝛼2

4. Bag of Words (BoW) model

The Bag of Words (BoW) model is the simplest form of text representation in numbers. Like the term itself, we can represent a sentence as a bag of words vector (a string of numbers).

Example:

• Review 1: This movie is very scary and long

• Review 2: This movie is not scary and is slow

• Review 3: This movie is spooky and good

We will first build a vocabulary from all the unique words in the above three reviews. The vocabulary consists of these 11 words: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’.

We can now take each of these words and mark their occurrence in the three movie reviews above with 1s and 0s.

This will give us 3 vectors for 3 reviews:

Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]

Vector of Review 2: [1 1 2 0 0 1 1 0 1 0 0]

Vector of Review 3: [1 1 1 0 0 0 1 0 0 1 1]

And that’s the core idea behind a Bag of Words (BoW) model.

5. Drawbacks of Bag of Words (BoW) model

In the above example, we can have vectors of length 11. However, we start facing issues when we come across new sentences:

  1. If the new sentences contain new words, then our vocabulary size would increase and thereby, the length of the vectors would increase too.
  2. We are retaining no information on the grammar of the sentences nor the importance of each word neither the ordering of the words in the text.

6. TFIDF model

To overcome the drawbacks of BoW model, we make use of TFIDF model.

TFIDF stands for Term Frequency Inverse Document Frequency model.

TFIDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

Term Frequency is a measure of how frequently a term ‘t’, appears in a document ‘d’:

Here, in the numerator, n is the number of times the term “t” appears in the document “d”.

Thus, each document and term would have its own TF value. We will again use the same vocabulary we had built in the Bag-of-Words model to show how to calculate the TF for Review #2:

Review 2: This movie is not scary and is slow

Here,

• Vocabulary: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’

• Number of words in Review 2 = 8

• TF for the word ‘this’ = (number of times ‘this’ appears in review 2)/(number of terms in review 2) = 1/8

Similarly, • TF(‘movie’) = 1/8

• TF(‘is’) = 2/8 = 1/4

• TF(‘very’) = 0/8 = 0

• TF(‘scary’) = 1/8

• TF(‘and’) = 1/8

• TF(‘long’) = 0/8 = 0

• TF(‘not’) = 1/8

• TF(‘slow’) = 1/8

• TF(‘spooky’) = 0/8 = 0

• TF(‘good’) = 0/8 = 0

We can calculate the term frequencies for all the terms and all the reviews in this manner:

IDF (Inverse Document Frequency) is a measure of how important a term is. We need the IDF value because computing just the TF alone is not sufficient to understand the importance of words:

We can calculate the IDF values for the all the words in Review 2:

IDF(‘this’) = log (number of documents/number of documents containing the word ‘this’) = log (3/3) = log (1) = 0

Similarly,

• IDF (‘movie’,) = log (3/3) = 0

• IDF(‘is’) = log (3/3) = 0

• IDF(‘not’) = log (3/1) = log (3) = 0.48

• IDF(‘scary’) = log (3/2) = 0.18

• IDF(‘and’) = log (3/3) = 0

• IDF(‘slow’) = log (3/1) = 0.48

We can calculate the IDF values for each word like this. Thus, the IDF values for the entire vocabulary would be:

We can now compute the TF-IDF score for each word in the corpus.

It is basically the product of TF and IDF. (TF*IDF)

Words with a higher score are more important, and those with a lower score are less important:

We can now calculate the TF-IDF score for every word in Review 2:

TF-IDF (‘this’, Review 2) = TF (‘this’, Review 2) * IDF(‘this’) = 1/8 * 0 = 0

Similarly,

• TF-IDF (‘movie’, Review 2) = 1/8 * 0 = 0

• TF-IDF (‘is’, Review 2) = 1/4 * 0 = 0

• TF-IDF (‘not’, Review 2) = 1/8 * 0.48 = 0.06

• TF-IDF (‘scary’, Review 2) = 1/8 * 0.18 = 0.023

• TF-IDF (‘and’, Review 2) = 1/8 * 0 = 0

• TF-IDF (‘slow’, Review 2) = 1/8 * 0.48 = 0.06

Similarly, we can calculate the TF-IDF scores for all the words with respect to all the reviews:

To summarize,

“TF-IDF also gives larger values for less frequent words and is high when both IDF and TF values are high i.e. the word is rare in all the documents combined but frequent in a single document.”

Code

1. Importing the libraries

Let us import the necessary libraries here. Tfidfvectorizer is the library we are going to use to extract Tfidf of the corpus.

2. Writing the user input

Let us consider 3 review sentences from different users and try to analyse them using Tfidf.

3. Applying Tfidf Vectorization

Let us now apply Tfidf vectorization on all the 3 sentences above using TfidfVectorizer() function. Here we are storing the function in variable vectorizer and fitting the function later on sentences. The result we are then storing in a variable called ‘tfidf’.

4. Evaluating the importance of each word in the corpus

Let us first see how many feature names (words) is the corpus broken into:

Now let us see how much importance we have obtained on each of these feature names (words) after Tfidf.

Let us interpret the output obtained. Here

[0. 0. 0.52682017 0. 0.52682017 0. 0.52682017 0.40912286] — → 1st sentence

[0.52682017 0. 0. 0. 0.52682017 0. 0.52682017 0.40912286] — → 2nd sentence

[0.26006226 0.34195062 0.26006226 0.68390125 0. 0.34195062 0. 0.40392309] — → 3rd sentence

Here, the values inside all these 3 sentences are individual importance of [‘boring’ ‘first’ ‘good’ ‘half’ ‘movies’ ‘second’ ‘the’ ‘was’].

So if one analyses the above result carefully, one gets to know that after Tfidf, the maximum importance is given to words like half(0.68390125), good (0.52682017), movies (0.52682017), boring(0.52682017), the (0.52682017).

Most of the times, analysis is right, however important thing to remember here is that we got the adjectives good, boring a high importance.

Such words do contribute a lot in deciding the sentiment of the input, which is what we are going to learn in our next lecture. I hope you guys got an overview of what NLP is, what components it has, the pipeline of NLP, what all models we have in NLU and how TFIDF assigns importance to each feature name/word/token.

Check out the source code: 🖥️

Follow me on LinkedIn : 💼

--

--