Day 40 of 100DaysofML

Charan Soneji
100DaysofMLcode
Published in
6 min readJul 26, 2020

NLTK. Natural Language Toolkit is one of the important tools that are used for NLP tasks and I shall explain a few lines using Python in a simplistic manner. So a couple of days back, I mentioned about NLP and its applications so I thought of explaining the implementation with a bit of code.

Before I start, I’m going to actually include the reference to the NLTK module or the documentation of the given library.

Let us start by importing all the required libraries.

import nltk   #Package for NLP
import re #Library for regex
import heapq
import numpy as np
nltk.download('punkt')

You should see something like this in the output.

Output obtained after installing the package.

The next step would be to initialize a random text corpus.

paragraph = 
"""Thank you all so very much. Thank you to everybody at
Fox and New Regency … my entire team. I have to thank
everyone from the very onset of my career … To my parents;
none of this would be possible without you. And to my
friends, I love you dearly; you know who you are. And lastly,
I just want to say this: Making The Revenant was about
man's relationship to the natural world. A world that we
collectively felt in 2015 as the hottest year in recorded
history. Our production needed to move to the southern
tip of this planet just to be able to find snow. We need to
support leaders around the world who do not speak for the
big polluters, but who speak for all of humanity, for the
indigenous people of the world, for the billions and
billions of underprivileged people out there who would be
most affected by this. I thank you all for this
amazing award tonight. Let us not take this planet for
granted. I do not take tonight for granted. Thank you so very much."""

Sentence Tokenizing

sentences = nltk.sent_tokenize(paragraph)  # Sentence tokenizer

This one line of code is now going to separate the corpus into sentence tokens whereby each token represents a sentence. It is figured out by the usage of full stops. So make sure to keep a space after a full stop becuase only then it recognizes it as a full stop.

After sentence tokenization

The above output is obtained after sentence tokenization, it looks a bit messy mainly because of the spaces and stuff but that’s nothing to be worried about.

Word Tokenizer

In case we want to separate by words whereby each word would represent a token, we use the word tokenizer given by the below syntax.

words = nltk.word_tokenize(paragraph)   #Word tokenizer
After word_tokenizer

Stemming

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Follow the below given syntax to help with the lemmatization process.

nltk.download(‘wordnet’)
sentences = nltk.sent_tokenize(paragraph) #Step1 : Tokenizing it by sentence
from nltk.stem import PorterStemmer #importing PorterStemmer Class from nltk.stem
stemmer = PorterStemmer() #creating objects
for i in range(len(sentences)):
words = nltk.word_tokenize(sentences[i])
words = [stemmer.stem(word) for word in words]
sentences[i] = ' '.join(words)
print(sentences)

So basically over here, we are creating sentence tokens and we are parsing through these sentences and converting the words into their respective stem words by running a for loop through these sentences.

Words after stemming

Lemmatization

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

Try using the below given code in order to understand the concept.

from nltk.stem import WordNetLemmatizer
sentences = nltk.sent_tokenize(paragraph)
lemmatizer = WordNetLemmatizer()
for i in range(len(sentences)):
words = nltk.word_tokenize(sentences[i])
words = [lemmatizer.lemmatize(word) for word in words]
sentences[i] = ' '.join(words)

So the concept is very similar whereby we convert the sentences into tokens and then iterate through them and just use the module from the library over these tokens and we’ll be done. The output obtained looks a little like:

Output of lemmatization

Removal of stopwords

Words such as articles and some verbs are usually considered stop words because they don’t help us to find the context or the true meaning of a sentence. These are words that can be removed without any negative consequences to the final model that you are training. The steps for this are pretty straightforward so just try to follow the below given code:

nltk.download('stopwords')
from nltk.corpus import stopwords
for i in range(len(sentences)):
words = nltk.word_tokenize(sentences[i])
words = [word for word in words if word not in stopwords.words('english')]
sentences[i] = ' '.join(words)

Here, we apply word tokenization instead of sentence tokenization and then apply the module directly using nltk library which has a predefined set of stopwords which shall be replaced with “ “ (a blank space) using the .join keyword in python.

After Stopword removal

POS tagging

A POS tag (or part-of-speech tag) is a special label assigned to each token (word) in a text corpus to indicate the part of speech and often also other grammatical categories such as tense, number (plural/singular), case etc. POS tags are used in corpus searches and in text analysis tools and algorithms.

An example of POS tagging

Follow the below given lines of code to get a rough idea of how it is done.

nltk.download(‘averaged_perceptron_tagger’)
words = nltk.word_tokenize(paragraph) #STEP1: Word Tokenize
tagged_words = nltk.pos_tag(words) #POS TAGGING MAIN STEP
tagged_words

So here, we convert everything into word tokens since we need to tag each word and then we use the library function as it is on the word tokens obtained. The output obtained looks a little like this.

POS tags obtained

We can combine all of the POS tags and attach them back into a sentence if needed using the below given code.

#transforming paragraph into format word_posTag
word_tags = []
for tw in tagged_words:
word_tags.append(tw[0]+"_"+tw[1])

tag_para = ' '.join(word_tags)

The output obtained then would look like this.

POS joined

Thats it for the main functions and concepts needed for NLP. In case you need clarity on any topic definition in specific, you can always google it and for syntax, I have provided the link to the documentation above.

Thanks for reading. Keep Learning.

Cheers.

--

--