NLP using Natural Language Toolkit

Kushagra Jindal
Integraate
Published in
4 min readJun 19, 2020
From unsplash.com by Charles Deluvio

The machine does not understand the natural languages of human, NLP or Natural Language processing is a part of Artificial Intelligence, which helps the machine to read and understand the human language. Speech Recognition, Sentimental Analysis, question answering, autocomplete and autocorrect are some of the use cases of NLP.

In this blog, I will share the steps required in data preprocessing and how we can achieve it in python using Natural Language Toolkit.

Natural Language Toolkit

NLTK is an open-source python package used to build programs related to language processing. It provides more than 50 corpora for tokenizations, stemming, part-of-speech, sentimental analysis and many more.

pip3 install nltk

The next step is to download the required corpora. This can be done manually by downloading the data from nltk_data website or using a python script.

import nltk
nltk.download()

On running the script NLTK Downloader will appear, where you can download the required Corpora.

NLTK Downloader

Text Analysis Operations

(A) Tokenization- The text which is needed to be analyzed is in the form of paragraphs. Before processing, we need to identify the words of the string. Tokenization means dividing a string into smaller tokens or chunks. These tokens can be words or small sentences.

Let’s take an example — ‘ It was an apple pie ’

After tokenization — [‘It’, ‘was’, ‘an’, ‘apple’, ‘pie’].

from nltk.tokenize import word_tokenizetext = """Tokenization means dividing a string into smaller tokens or chunks. These tokens can be words or small sentences."""words = word_tokenize(text)print(words)Output:- 
['Tokenization', 'means', 'dividing', 'a', 'string', 'into', 'smaller', 'tokens', 'or', 'chunks', '.', 'These', 'tokens', 'can', 'be', 'words', 'or', 'small', 'sentences', '.']

POS-Tagging tells us whether a given word is Noun, Adjective, Pronoun, Adverb or Verb.

from nltk.tokenize import word_tokenizetext = """Tokenization means dividing a string into smaller tokens or chunks."""
words = word_tokenize(text)
tags = nltk.pos_tag(words)
print(tags)Output:-
[('Tokenization', 'NN'), ('means', 'VBZ'), ('dividing', 'VBG'), ('a', 'DT'), ('string', 'NN'), ('into', 'IN'), ('smaller', 'JJR'), ('tokens', 'NNS'), ('or', 'CC'), ('chunks', 'NNS'), ('.', '.')]

(B) Stop Words- There are many unnecessary words in the regular text which act as noise. These words are termed as stop words. Some of the common stop words are ‘is’, ‘am’, ‘are’, ‘this’, ‘a’, ‘an’ and ‘the’.

example — [‘It’, ‘was’, ‘an’, ‘apple’, ‘pie’].

After removing stop words — [‘apple’, ‘pie’].

from nltk.corpus import stopwordsstop_words = set(stopwords.words("english"))
new_words = []
for word in words:
if word not in stop_words:
new_words.append(word)
print(new_words)Output:-
['Tokenization', 'means', 'dividing', 'string', 'smaller', 'tokens', 'chunks', '.']

(C) Stemming & Lemmatization- For grammatical reasons, different affixes are attached to the words. Additionally, there are many words with similar meanings, removing the affixes is termed as stemming.

example — drive, drove and driven.

from nltk.stem import PorterStemmerposterStemmer = PorterStemmer()
new_stemmed_words=[]
for word in new_words:
new_stemmed_words.append(posterStemmer.stem(word))
print(new_stemmed_words)Output:-
['token', 'mean', 'divid', 'string', 'smaller', 'token', 'chunk', '.']

Some of the words become meaningless after stemming, for example, ‘saw’ can become ‘s’ after stemming where lemmatisation can be used to return see.

from nltk.stem.wordnet import WordNetLemmatizerwordNetLemmatizer = WordNetLemmatizer()
token = "running"
print(wordNetLemmatizer.lemmatize(token,"v"))output:-
run

(D) Bags Of Words- Models require mathematical data so we need to convert the text into some number or a vector. Bags Of Words is one of the methods that can be used to calculate vector by creating a matrix of occurrence of words present in a document.

from nltk.tokenize import word_tokenizedata = "hello, anyone there, hello hello"
words = nltk.word_tokenize(data)
bag_of_words = {}
for word in words:
if word not in bag_of_words.keys():
bag_of_words[word] = 1
else:
bag_of_words[word] += 1
print(bag_of_words)

There is a problem with Bags Of Words due to which more priority is given to the longer documents. So, we have TF-IDF(Term Frequency — Inverse Document Frequency). IDF is the amount of information a word provides across a document.

idf(w) = log ( number of document / number of documents containing that word)

Conclusion

In this blog, you have seen how easy it is to apply the text analysis operation using the Natural Language Toolkit. Once we have analyzed the text, we can easily train a model from it.

I’d love to hear any feedback or questions. You can ask the question by leaving a comment, and I will try my best to answer it.

--

--

Kushagra Jindal
Integraate

Blockchain | web3 | DeFi | NFT | zkSNARKs | circom