Day 40 of 100DaysofML

Published in

100DaysofMLcode

6 min readJul 26, 2020

NLTK. Natural Language Toolkit is one of the important tools that are used for NLP tasks and I shall explain a few lines using Python in a simplistic manner. So a couple of days back, I mentioned about NLP and its applications so I thought of explaining the implementation with a bit of code.

Before I start, I’m going to actually include the reference to the NLTK module or the documentation of the given library.

nltk Package - NLTK 3.5 documentation

The Natural Language Toolkit (NLTK) is an open source Python library for Natural Language Processing. A free online…

www.nltk.org

Let us start by importing all the required libraries.

import nltk   #Package for NLP
import re   #Library for regex
import heapq  
import numpy as np
nltk.download('punkt')

You should see something like this in the output.

Output obtained after installing the package.

The next step would be to initialize a random text corpus.

paragraph = 
"""Thank you all so very much. Thank you to everybody at 
   Fox and New Regency … my entire team. I have to thank 
   everyone from the very onset of my career … To my parents; 
   none of this would be possible without you. And to my 
   friends, I love you dearly; you know who you are. And lastly,
   I just want to say this: Making The Revenant was about
   man's relationship to the natural world. A world that we
   collectively felt in 2015 as the hottest year in recorded
   history. Our production needed to move to the southern
   tip of this planet just to be able to find snow.  We need to
   support leaders around the world who do not speak for the 
   big polluters, but who speak for all of humanity, for the
   indigenous people of the world, for the billions and 
   billions of underprivileged people out there who would be
   most affected by this. I thank you all for this 
   amazing award tonight. Let us not take this planet for 
   granted. I do not take tonight for granted. Thank you so very much."""

Sentence Tokenizing

sentences = nltk.sent_tokenize(paragraph)  # Sentence tokenizer

This one line of code is now going to separate the corpus into sentence tokens whereby each token represents a sentence. It is figured out by the usage of full stops. So make sure to keep a space after a full stop becuase only then it recognizes it as a full stop.

The above output is obtained after sentence tokenization, it looks a bit messy mainly because of the spaces and stuff but that’s nothing to be worried about.

Word Tokenizer

In case we want to separate by words whereby each word would represent a token, we use the word tokenizer given by the below syntax.

words = nltk.word_tokenize(paragraph)   #Word tokenizer

Stemming

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Follow the below given syntax to help with the lemmatization process.

nltk.download(‘wordnet’)
sentences = nltk.sent_tokenize(paragraph)  #Step1 : Tokenizing it by sentence
from nltk.stem import PorterStemmer  #importing PorterStemmer Class from nltk.stem
stemmer = PorterStemmer()  #creating objects
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words]
    sentences[i] = ' '.join(words)
print(sentences)

So basically over here, we are creating sentence tokens and we are parsing through these sentences and converting the words into their respective stem words by running a for loop through these sentences.

Lemmatization

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

Try using the below given code in order to understand the concept.

from nltk.stem import WordNetLemmatizer
sentences = nltk.sent_tokenize(paragraph)
lemmatizer = WordNetLemmatizer()
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmatizer.lemmatize(word) for word in words]
    sentences[i] = ' '.join(words)

So the concept is very similar whereby we convert the sentences into tokens and then iterate through them and just use the module from the library over these tokens and we’ll be done. The output obtained looks a little like:

Removal of stopwords

Words such as articles and some verbs are usually considered stop words because they don’t help us to find the context or the true meaning of a sentence. These are words that can be removed without any negative consequences to the final model that you are training. The steps for this are pretty straightforward so just try to follow the below given code:

nltk.download('stopwords')
from nltk.corpus import stopwords
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [word for word in words if word not in stopwords.words('english')]
    sentences[i] = ' '.join(words)

Here, we apply word tokenization instead of sentence tokenization and then apply the module directly using nltk library which has a predefined set of stopwords which shall be replaced with “ “ (a blank space) using the .join keyword in python.

POS tagging

A POS tag (or part-of-speech tag) is a special label assigned to each token (word) in a text corpus to indicate the part of speech and often also other grammatical categories such as tense, number (plural/singular), case etc. POS tags are used in corpus searches and in text analysis tools and algorithms.

Follow the below given lines of code to get a rough idea of how it is done.

nltk.download(‘averaged_perceptron_tagger’)
words = nltk.word_tokenize(paragraph)  #STEP1: Word Tokenize
tagged_words = nltk.pos_tag(words)   #POS TAGGING MAIN STEP
tagged_words

So here, we convert everything into word tokens since we need to tag each word and then we use the library function as it is on the word tokens obtained. The output obtained looks a little like this.

We can combine all of the POS tags and attach them back into a sentence if needed using the below given code.

#transforming paragraph into format word_posTag
word_tags = []
for tw in tagged_words:
    word_tags.append(tw[0]+"_"+tw[1])
    
tag_para = ' '.join(word_tags)

The output obtained then would look like this.

Thats it for the main functions and concepts needed for NLP. In case you need clarity on any topic definition in specific, you can always google it and for syntax, I have provided the link to the documentation above.

Thanks for reading. Keep Learning.

Cheers.

Day 40 of 100DaysofML

nltk Package - NLTK 3.5 documentation

The Natural Language Toolkit (NLTK) is an open source Python library for Natural Language Processing. A free online…

Written by Charan Soneji