A Quick Guide for Natural Language Processing Tool Kit

Venkatesh Gowda
3 min readApr 8, 2023

--

Before starting, What is Natural Language Processing? “NLP refers to the branch of Artificial Intelligence that allows the machine to read, understand and derive meaning from human language. NLP combines linguistic and computer science to decipher language structure and guidelines to make models that can comprehend break down and separate significant details from text and speech”.

How do I Start NLP using Python?

The Natural Language Toolkit (NLTK) is a platform used for building Python programs that work with human language data for application in statistical natural language processing (NLP). It contains text-processing libraries for tokenization, parsing, classification, stemming, tagging, and semantic reasoning.

Install NLTK

Please go through this guide for the installation and download the nltk packages.

Installation done. Let’s get started.

There are various steps involved in Natural Language Processing namely,

  1. Tokenization
  2. Stop Words
  3. Stemming
  4. Lemmatization
  5. Chunking

Tokenization: It is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

Let us take an example.

from nltk.tokenize import sent_tokenize, word_tokenize

example_text = "Hello Mr. Python how are you? I hope you are doing well.
I should mention you are doing great in programming."

#Sentence tokenizing
print(sent_tokenize(example_text))

#word tokenizing
print(word_tokenize(example_text))

In the above example, the sent_tokenize will split the paragraph into separate sentences. And word_tokenize will split the paragraph into individual words called tokens

#sent_tokenize
['Hello Mr. Python how are you?', 'I hope you are doing well.',
'I should mention you are doing great in programming.']
#word_tokenize
['Hello', 'Mr.', 'Python', 'how', 'are', 'you', '?', 'I',
'hope', 'you', 'are', 'doing', 'good', '.', 'I', 'should', 'mention',
'you', 'are', 'doing', 'great', 'in', 'programming', '.']

Stop Words: A stop word is a commonly used word (such as ‘the’, ‘a’, ‘an’, ‘in’) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

example_sentence = "This is an example of showing Stopword filtration."
stop_words = set(stopwords.words('english'))

words = word_tokenize(example_sentence)

filtered_sentence = []

for w in words:
if w not in stop_words:
filtered_sentence.append(w)

print(filtered_sentence)
#output
['This', 'example', 'showing', 'Stopword', 'filtration', '.']

Stemming: Stemming is a technique used to extract the base form of words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words Creating, Created, and Creative is Create.

In simple words, we’ll assume these two sentences. ‘I was taking a ride in the car’. ‘I was riding in the car’.

here both words riding and ride have the same meaning. Stemming is a process where we normalize words into their root form.

from nltk.stem import PorterStemmer

ps = PorterStemmer()

example_words = ["creates","created","creating"]

for w in example_words:
print(ps.stem(w))
#output for Stemming
creat
creat
creat

Lemmatization: It is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words.

Most of the time lemmatization is preferred over stemming.

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))

# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos ="a"))
#output 
rocks : rock
corpora : corpus
better : good

Chunking: Text chunking also referred to as shallow parsing, is a task that follows Part-Of-Speech Tagging and that adds more structure to the sentence. The result is a grouping of the words in ‘chunks’.

import nltk
from nltk import pos_tag
from nltk import RegexpParser

#Text
txt ="This is NLP Chunking NoteBook"
text = txt.split()

#POS Tags
POS_tag = pos_tag(text)

print("After POS tags:",POS_tag)

#for Visual Representation
text.draw()
#output
After POS tags: [('This', 'DT'), ('is', 'VBZ'), ('NLP', 'NNP'),
('Chunking', 'NNP'), ('NoteBook', 'NNP')]

Applications of Natural Language Processing.

  1. Virtual assistants and chatbots for conversational interfaces.
  2. Sentiment analysis for analyzing opinions and feedback from customers.
  3. Machine translation for language localization in global communication.
  4. Text summarization for generating concise summaries from the lengthy text.
  5. Named entity recognition for extracting specific information such as names, dates, and locations from the text.
  6. Speech recognition for enabling voice commands and dictation etc.

Happy Learning…

--

--

Venkatesh Gowda

Undergraduate Artificial intelligence and Machine Learning student @Jyothyit | Curious learner | Explorer |