Natural Language Processing

Shruthi Gurudath
Analytics Vidhya
Published in
12 min readAug 12, 2020

Aren't we all initially got surprised when smart devices understood what we were telling them? And in fact, it answered in the most friendly manner too, isn't it? Like Apple’s Siri and Amazon’s Alexa comprehend when we ask the weather, for directions, or to play a certain genre of music. Ever since then I was wondering how do these computers get our language. This long due curiosity rekindled me and I thought to write a blog as a newbie on this.

In this article, I will be using a popular NLP library called NLTK. Natural Language Toolkit or NLTK is one of the most powerful and probably the most popular natural language processing libraries. Not only does it have the most comprehensive library for python-based programming, but it also supports the most number of different human languages.

What is Natural Language Processing?

Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human languages, in particular how to train computers to process and analyze large amounts of natural language data.

Why sorting of Unstructured Datatype is so important?

For every tick of the clock, the world generates the overwhelming amount of data!!, yeah, this is mind-boggling!! and the majority of the data falls under unstructured datatype. The data formats such as text, audio, video, image are classic examples of unstructured data. The Unstructured Datatype will not be having fixed dimensions and structures like traditional row and column structure of relational databases. Therefore it’s more difficult to analyze and not easily searchable. Having said that, it is also important for business organizations to find ways of addressing challenges and embracing opportunities to derive insights and prosper in highly competitive environments to be successful. However, with the help of natural language processing and machine learning, this is changing fast.

Are Computers confused with our Natural Language?

Human language is one of the powerful tools of communication. The words, the tone, the sentences, the gestures which we use draw information. There are countless different ways of assembling words in a phrase. Words can also have many shades of meaning and, to comprehend human language with the intended meaning is a challenge. A linguistic paradox is a phrase or sentence that contradicts itself, for example, “oh, this is my open secret”, “can you please act naturally”, though it sounds pointedly foolish, we humans can understand and use in everyday speech but for machines, natural language’s ambiguity and inaccurate characteristics are the hurdles to sail-off.

Most used NLP Libraries

In the past, only pioneers could be part of NLP projects those who would have superior knowledge in mathematics, computer learning, and linguistics in natural language processing. Now developers can use ready-made libraries to simplify pre-processing of texts so that they can concentrate on creating machine learning models. These libraries have enabled text comprehension, interpretation, sentiment analysis through only a few lines of code. Most popular NLP libraries are:

Spark NLP, NLTK, PyTorch-Transformers, TextBlob, Spacy, Stanford CoreNLP, Apache OpenNLP, Allen NLP, GenSim, NLP Architecture, sci-kit learn.

The question is from where should we start and how?

Have you ever observed how kids start to understand and learn a language? yeah, by picking each word and then sentence formations, right! Making computers understand our language is more or less similar to it.

Pre-processing Steps :

  1. Sentence Tokenization
  2. Word Tokenization
  3. Text Lemmatization and Stemming
  4. Stop Words
  5. POS Tagging
  6. Chunking
  7. Wordnet
  8. Bag-of-Words
  9. TF-IDF
  1. Sentence Tokenization(Sentence Segmentation)
    To make computers understand the natural language, the first step is to break the paragraphs into the sentences. Punctuation marks are such an easy way out for splitting the sentences apart.
import nltk
nltk.download('punkt')
text = "Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."sentences = nltk.sent_tokenize(text)
print("The number of sentences in the paragrah:",len(sentences))
for sentence in sentences:
print(sentence)
OUTPUT:
The number of sentences in the paragraph: 3
Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area.

2. Word Tokenization(Word Segmentation)
By now we have separated sentences with us and the next step is to break the sentences into words which are often called Tokens.

The way of creating a space in one’s own life helps for good, similarly, Space between the words helps in breaking the words apart in a phrase. We can consider punctuation marks as separate tokens as well, as punctuation has a purpose too.

for sentence in sentences:
words = nltk.word_tokenize(sentence)
print("The number of words in a sentence:", len(words))
print(words)
OUTPUT:
The number of words in a sentence: 32
['Home', 'Farm', 'is', 'one', 'of', 'the', 'biggest', 'junior', 'football', 'clubs', 'in', 'Ireland', 'and', 'their', 'senior', 'team', ',', 'from', '1970', 'up', 'to', 'the', 'late', '1990s', ',', 'played', 'in', 'the', 'League', 'of', 'Ireland', '.']
The number of words in a sentence: 18
['However', ',', 'the', 'link', 'between', 'Home', 'Farm', 'and', 'the', 'senior', 'team', 'was', 'severed', 'in', 'the', 'late', '1990s', '.']
The number of words in a sentence: 22
['The', 'senior', 'side', 'was', 'briefly', 'known', 'as', 'Home', 'Farm', 'Fingal', 'in', 'an', 'effort', 'to', 'identify', 'it', 'with', 'the', 'north', 'Dublin', 'area', '.']

The prerequisite to use word_tokenize() or sent_tokenize() functions in the program, we should have punkt package downloaded.

3. Stemming and Text Lemmatization

In every text document, we usually come across different forms of words like write, writes, writing with an alike meaning, and the same base word. But how to make a computer to analyze such words?
That’s when Text Lemmatization and Stemming comes in the picture.

Stemming and Text Lemmatization are the normalization techniques that offer the same idea of chopping the ends of a word to the core word. While both of them want to solve the same problem, but they are going about it in entirely different ways. Stemming is often a crude heuristic process whereas Lemmatization is a vocabulary -based morphological base word. Let’s just take a closer look!

Stemming- Words are reduced to their stem word. A word stem need not be the same root as a dictionary-based morphological(smallest unit) root, it just is an equal to or smaller form of the word.

from nltk.stem import PorterStemmer#create an object of class PorterStemmer
porter = PorterStemmer()
#A list of words to be stemmed
word_list = ['running', ',', 'driving', 'sung', 'between', 'lasted', 'was', 'paticipated', 'before', 'severed', '1990s', '.']
print("{0:20}{1:20}".format("Word","Porter Stemmer"))for word in word_list:
print("{0:20}{1:20}".format(word,porter.stem(word)))
OUTPUT:
Word Porter Stemmer
running run
, ,
driving drive
sung sung
between between
lasted last
was wa
paticipated paticip
before befor
severed sever
1990s 1990
. .

Stemming is not as easy as it looks :(
we might get into two issues such as under-stemming and over-stemming of a word.

Lemmatization-When we think that stemming is the best estimate method to snip a word based on how it appears and meanwhile, on the other hand, lemmatization is a method that seems to be even more planned way of pruning the word. Their dictionary process includes resolving words. Indeed a word’s lemma is its dictionary or canonical form.

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
#A list of words to lemmatizeword_list = ['running', ',', 'drives', 'sung', 'between', 'lasted', 'was', 'paticipated', 'before', 'severed', '1990s', '.']print("{0:20}{1:20}".format("Word","Lemma"))for word in word_list:
print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word)))
OUTPUT:
Word Lemma
running running
, ,
drives drive
sung sung
between between
lasted lasted
was wa
paticipated paticipated
before before
severed severed
1990s 1990s
. .

If speed is needed, then resorting to stemming is better. But it’s better to use lemmatization when accuracy is needed.

4. Stop Words
in’, ‘at’, ‘on’, ‘so’.. etc are considered as stop words. Stop words don't play an important role in NLP, but the removal of stop words necessarily plays an important role during sentiment analysis.

NLTK comes with the stopwords for 16 different languages that contain stop word lists.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
print("The stop words in NLTK lib are:", stop_words)para="""Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."""tokenized_para=word_tokenize(para)
modified_token_list=[word for word in tokenized_para if not word in stop_words]
print("After removing the stop words in the sentence:")
print(modified_token_list)
OUTPUT:
The stop words in NLTK lib are:
{'about', 'ma', "shouldn't", 's', 'does', 't', 'our', 'mightn', 'doing', 'while', 'ourselves', 'themselves', 'will', 'some', 'you', "aren't", 'by', "needn't", 'in', 'can', 'he', 'into', 'as', 'being', 'between', 'very', 'after', 'couldn', 'himself', 'herself', 'had', 'its', 've', 'him', 'll', "isn't", 'through', 'should', 'was', 'now', 'them', "you'll", 'again', 'who', 'don', 'been', 'they', 'weren', "you're", 'both', 'd', 'me', 'didn', "won't", "you'd", 'only', 'itself', 'hadn', "should've", 'than', 'how', 'few', 're', 'down', 'these', 'y', "haven't", "mightn't", 'won', "hadn't", 'other', 'above', 'all', "doesn't", 'isn', "that'll", 'not', 'yourselves', 'at', 'mustn', "it's", 'on', 'the', 'for', "didn't", 'what', "mustn't", 'his', 'haven', 'doesn', "you've", 'are', 'out', 'hers', 'with', 'has', 'she', 'most', 'ain', 'those', 'when', 'myself', 'before', 'their', 'during', 'there', 'or', 'until', 'that', 'more', "hasn't", 'o', 'we', 'and', "shan't", 'which', 'because', "don't", 'why', 'shan', 'an', 'my', 'if', 'did', 'having', "couldn't", 'your', 'theirs', 'aren', 'just', 'further', 'here', 'of', "wouldn't", 'be', 'too', 'her', 'no', 'same', 'it', 'is', 'were', 'yourself', 'have', 'off', 'this', 'needn', 'once', "wasn't", 'against', 'wouldn', 'up', 'a', 'i', 'below', "weren't", 'over', 'own', 'then', 'so', 'do', 'from', 'shouldn', 'am', 'under', 'any', 'yours', 'ours', 'hasn', 'such', 'nor', 'wasn', 'to', 'where', 'm', "she's", 'each', 'whom', 'but'} After removing the stopwords in the sentence:
['Home', 'Farm', 'one', 'biggest', 'junior', 'football', 'clubs', 'Ireland', 'senior', 'team', ',', '1970', 'late', '1990s', ',', 'played', 'League', 'Ireland', '.', 'However', ',', 'link', 'Home', 'Farm', 'senior', 'team', 'severed', 'late', '1990s', '.', 'The', 'senior', 'side', 'briefly', 'known', 'Home', 'Farm', 'Fingal', 'effort', 'identify', 'north', 'Dublin', 'area', '.']

5. POS Tagging
Down the memories lane of our early English grammar classes, can we all remember how our teachers used to give relevant instructions around basic parts of speech to have effective communication? Yeah, good old days!!
Let's teach parts of speech to our computers too. :)

The eight parts of speech are nouns, verbs, pronouns, adjectives, adverbs, prepositions, conjunctions, and interjections.

POS Tagging is an ability to identify and assign parts of speech to the words in a sentence. There are different methods to tag, but we will be using the universal style of tagging.

nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')
pos_tag= [nltk.pos_tag(i,tagset="universal") for i in words]
print(pos_tag)
[[('Home', 'NOUN'), ('Farm', 'NOUN'), ('is', 'VERB'), ('one', 'NUM'), ('of', 'ADP'), ('the', 'DET'), ('biggest', 'ADJ'), ('junior', 'NOUN'), ('football', 'NOUN'), ('clubs', 'NOUN'), ('in', 'ADP'), ('Ireland', 'NOUN'), ('and', 'CONJ'), ('their', 'PRON'), ('senior', 'ADJ'), ('team', 'NOUN'), (',', '.'), ('from', 'ADP'), ('1970', 'NUM'), ('up', 'ADP'), ('to', 'PRT'), ('the', 'DET'), ('late', 'ADJ'), ('1990s', 'NUM'), (',', '.'), ('played', 'VERB'), ('in', 'ADP'), ('the', 'DET'), ('League', 'NOUN'), ('of', 'ADP'), ('Ireland', 'NOUN'), ('.', '.')]

One of the applications of POS tagging to analyze the qualities of a product in feedback, by sorting the adjectives in the customers’ review we can evaluate the sentiment of the feedback. Say example, how was your shopping with us?

6. Chunking
Chunking is used to add more structure to the sentence by tagging the following parts of speech (POS). Also named as shallow parsing. The resulting word group is named “chunks.” There are no such predefined rules to perform chunking.

Phrase structure conventions:

  • S(Sentence) → NP VP.
  • NP → {Determiner, Noun, Pronoun, Proper name}.
  • VP → V (NP)(PP)(Adverb).
  • PP → Pronoun (NP).
  • AP → Adjective (PP).

I never had a good time with complex regular expressions, I used to remain as far as I could but off late realized, how important it is to have a grip on regular expressions in data science. Let’s start by understanding the simple instance.

If we need to tag Noun, verb (past tense), adjective, and coordinating junction from the sentence. You can use the rule as below

chunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}

import nltk
from nltk.tokenize import word_tokenize
content = "Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."tokenized_text = nltk.word_tokenize(content)
print("After Split:",tokenized_text)
tokens_tag = pos_tag(tokenized_text)
print("After Token:",tokens_tag)
patterns= """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}"""chunker = RegexpParser(patterns)
print("After Regex:",chunker)
output = chunker.parse(tokens_tag)
print("After Chunking",output)
OUTPUT:
After Regex: chunk.RegexpParser with 1 stages: RegexpChunkParser with 1 rules: <ChunkRule: '<NN.?>*<VBD.?>*<JJ.?>*<CC>?'>
After Chunking
(S (mychunk Home/NN Farm/NN) is/VBZ one/CD of/IN the/DT
(mychunk biggest/JJS)
(mychunk junior/NN football/NN clubs/NNS) in/IN
(mychunk Ireland/NNP and/CC) their/PRP$
(mychunk senior/JJ)
(mychunk team/NN) ,/, from/IN 1970/CD up/IN to/TO the/DT (mychunk late/JJ) 1990s/CD ,/, played/VBN in/IN the/DT (mychunk League/NNP) of/IN (mychunk Ireland/NNP) ./.)

7. Wordnet

Wordnet is an NLTK corpus reader, a lexical database for English. It can be used to generate a synonym or antonym.

from nltk.corpus import wordnetsynonyms = []
antonyms = []
for syn in wordnet.synsets("active"):
for lemmas in syn.lemmas():
synonyms.append(lemmas.name())
for syn in wordnet.synsets("active"):
for lemmas in syn.lemmas():
if lemmas.antonyms():
antonyms.append(lemmas.antonyms()[0].name())
print("Synonyms are:",synonyms)
print("Antonyms are:",antonyms)
OUTPUT:
Synonyms are: ['active_agent', 'active', 'active_voice', 'active', 'active', 'active', 'active', 'combat-ready', 'fighting', 'active', 'active', 'participating', 'active', 'active', 'active', 'active', 'alive', 'active', 'active', 'active', 'dynamic', 'active', 'active', 'active']
Antonyms are: ['passive_voice', 'inactive', 'passive', 'inactive', 'inactive', 'inactive', 'quiet', 'passive', 'stative', 'extinct', 'dormant', 'inactive']

8. Bag of Words
A bag of words model turns the raw text into words, and the frequency for the words in the text is also counted.

import nltk
import re # to match regular expressions
import numpy as np
text="Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."sentences = nltk.sent_tokenize(text)
for i in range(len(sentences)):
sentences[i] = sentences[i].lower()
sentences[i] = re.sub(r'\W', ' ', sentences[i])
sentences[i] = re.sub(r'\s+', ' ', sentences[i])
bag_of_words = {}
for sentence in sentences:
words = nltk.word_tokenize(sentence)
for word in words:
if word not in bag_of_words.keys():
bag_of_words[word] = 1
else:
bag_of_words[word] += 1
print(bag_of_words)
OUTPUT:
{'home': 3, 'farm': 3, 'is': 1, 'one': 1, 'of': 2, 'the': 8, 'biggest': 1, 'junior': 1, 'football': 1, 'clubs': 1, 'in': 4, 'ireland': 2, 'and': 2, 'their': 1, 'senior': 3, 'team': 2, 'from': 1, '1970': 1, 'up': 1, 'to': 2, 'late': 2, '1990s': 2, 'played': 1, 'league': 1, 'however': 1, 'link': 1, 'between': 1, 'was': 2, 'severed': 1, 'side': 1, 'briefly': 1, 'known': 1, 'as': 1, 'fingal': 1, 'an': 1, 'effort': 1, 'identify': 1, 'it': 1, 'with': 1, 'north': 1, 'dublin': 1, 'area': 1}

9. TF-IDF

TF-IDF stands for Term Frequency — Inverse document frequency.

Text data needs to be converted to the numerical format where each word is represented in the matrix form. The encoding of a given word is the vector in which the corresponding element is set to one, and all other elements are zero. Thus TF-IDF technique is also referred to as Word Embedding.

TF-IDF works on two concepts:

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
docs=["Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland",
"However, the link between Home Farm and the senior team was severed in the late 1990s",
" The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area"]
#instantiate CountVectorizer()
cv=CountVectorizer()
# this steps generates word counts for the words in your docs
word_count_vector=cv.fit_transform(docs)
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)
# print idf values
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"])
# sort ascending
df_idf.sort_values(by=['idf_weights'])
# count matrix
count_vector=cv.transform(docs)
# tf-idf scores
tf_idf_vector=tfidf_transformer.transform(count_vector)
feature_names = cv.get_feature_names()#get tfidf vector for the document
first_document_vector=tf_idf_vector[0]
#print the scores
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)
tfidf
of 0.374810
ireland 0.374810
the 0.332054
in 0.221369
1970 0.187405
football 0.187405
up 0.187405
as 0.000000
an 0.000000
and so on..

What are these scores telling us? The more common the word across documents, the lower its score, and the more unique a word the higher the score will be.

So far, we learned the steps of cleaning and preprocessing the text. What can we do with the sorted data after all this? We could use this data for sentiment analysis, chatbot, market intelligence. Maybe build a recommender system based on user purchases or item reviews or customer segmentation with clustering.

Computers are still not accurate with human language as much as they are with numbers. With the massive proportion of text data generated every day, NLP is indeed becoming ever more significant to make sense of the data and is being used in many other applications. Hence there are endless ways to explore NLP.

--

--