A Glance to Natural Language Processing (NLP)

Published in

The Startup

4 min readAug 19, 2020

What is it?

According to Wikipedia Natural language processing is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human languages.

Let’s understand it more easily:

Personal Assistants like Google Assistant, Siri, Amazon Alexa, Samsung Bixby use natural language processing to understand us.
It is the field of Artificial Intelligence where machines get the ability to read, understand, and get the meaning.
We use NLP in our daily lives like Autocomplete suggests the rest of the word and sentences, Spell checkers remove misspellings, typos, or stylistically incorrect spellings, Google Search, and much more.
Grammarly is a huge implementation of Natural Language Processing which detects various spelling mistakes, tenses, tone, clarity, engagement.

Tokenization

As the word says, converting the sentences into tokens. This will help in throwing away certain characters which are not having importance in the meaning of the sentence and separate the words into tokens.

Let us take an example:
We have a paragraph which has 3 sentences. Firstly we will separate the sentences.

I have three visions for India. In 3000 years of our history people from all over the world have come and invaded us, captured our lands, conquered our minds. From Alexander onwards the Greeks, the Turks, the Moguls, the Portuguese, the British, the French, the Dutch, all of them came and looted us, took over what was ours.

We will use a Python library called Natural Language Tool Kit (NLTK). This library is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.

import nltkparagraph = "I have three visions for India. In 3000 years of our     history people from all over the world have come and invaded us, captured our lands, conquered our minds. From Alexander onwards the Greeks, the Turks, the Moguls, the Portuguese, the British, the French, the Dutch, all of them came and looted us, took over what was ours."sentences = nltk.sent_tokenize(paragraph)

The sent_tokenize method will separate the sentences from the paragraph and will store three different strings in the list called sentences.

sentences['I have three visions for India.',  
'In 3000 years of our history people from all over the world have come and invaded us, captured our lands, conquered our minds.',  'From Alexander onwards the Greeks, the Turks, the \n Moguls, the Portuguese, the British, the French, the Dutch, all of them came and looted us, took over what was ours.'
]

But this is not enough to get the meaning of sentences by computers. It may have two or more words which mean the same. And it can be a lengthy process if we did not remove that.

So to solve that, we will use a technique called stemming.

Stemming

Stemming is a process where the word will be reduced to its root word, and the meaning will be kept the same. It is just a removal of suffix and maintaining the meaning.

We will continue by taking the same example above. Firstly we separated the sentences. And now we will stem the words.

from nltk.stem import PorterStemmersentences=nltk.sent_tokenize(paragraph)
stemmer=PorterStemmer()
for i in range(len(sentences)):
   words=nltk.word_tokenize(sentences[i])
   words=[stemmer.stem(word) for word in words if word not in    set(stopwords.words('english'))]
   sentences[i]=' '.join(words)

The PorterStemmer will remove all the extra words and remove the suffix of the words if possible.
The sentences after stemming will be as:

sentences = [
'I three vision india .',  
'In 3000 year histori peopl world come invad us , captur land , conquer mind .', 
 'from alexand onward greek , turk , mogul , portugues , british , french , dutch , came loot us , took .',
]

But the problem here is that some words here, loose their meaning, like:

history -> history
Capture -> captur

The loss of meaning could be a negative signal to get the meaning. So to overcome that, we will use another technique that is related to stemming but a bit different, which will retain the meaning of the sentence.

Lemmatization

Lemmatization in linguistics is the process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by the word’s lemma, or dictionary form.

The tokenized paragraph will now be lemmatized and we will see how it differs from Stemming.

from nltk.stem import WordNetLemmatizer

lemmatizer=WordNetLemmatizer()
for i in range(len(sentences)):
    words=nltk.word_tokenize(sentences[i])
    words=[lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i]=' '.join(words)

The WordNetLemmatizer will reduce up till the word does not lose its meaning. And the result is totally different from Stemming.

sentences = [
'I three vision India .',  
'In 3000 year history people world come invaded u , captured land , conquered mind .', 
'From Alexander onwards Greeks , Turks , Moguls , Portuguese , British , French , Dutch , came looted u , took .'
]

So the problem is now solved. The words which lost their meaning in Stemming are now same and we can get the meaning just by reading them.
This sentences can be cleaned and then fed to a neural network.

A Glance to Natural Language Processing (NLP)

What is it?

Tokenization

Stemming

Lemmatization

Written by Parth Mistry