NLP The Tools, Techniques and Application

Published in

Analytics Vidhya

4 min readJul 12, 2020

Congratulations!! You have won 25000$. Click on this link to redeem your money $$$. Did you notice that you no longer see emails like this in your inbox. These kinds of emails are called “Spam”. And the phenomenon behind you not seeing these types of emails is because of NLP (Natural Language Processing).

So Natural Language Processing is a subset in AI that deals with analysis and providing smart Intelligent solution using Text and Voice Components. As NLP deals uses AI (deep learning, machine learning and some data science). The important commodity is the Data/Information. And before training or using the data, it must be cleaned. The original data is called Raw Data, and the process of converting/cleaning a raw data into a data that is useful is called Processing. And Processing a textual data is called Text Processing.

Text Processing, Tools and Techniques

There are various methods and procedures to process the text. The NLTK Library in provides all the methods required to preproccess the data. The various text preprocessing methods are

Removing the Noise : Noise are unwanted information present in the whole document of data they may be images, some numbers, special symbols and even html tags such as <div>,<img> ,<h1> and much more. These tags don’t provide any useful information to the Programmer and act as chunk. So we can remove these noise using Regular Expressions (in python regular expression library are referred as re)

import re
text="<h1> Hello this is how Regular Expressions work in python and we can remove the noise using re module </h1>"
new_text= re.sub(r'<.?h1>','',text) #this replaces the h1 and /h1 tag with empty space

2. Tokenizing : Is the process of converting a a document of information to sentences or words (Breaking the text into words). There are two types of tokenizers that nltk offers to us Word-Tokenizer and Sentence-Tokenizer
a) Word Tokenizer: Breaks the text to words

from nltk.tokenize import word_tokenize
text= "This text is to be Tokenized"
tokenized=word_tokenize(text)
print(tokenized)# ["This", "text","is","to", "be", "Tokenized"]

b) Sentence Tokenization : Sentence Tokenization is the process of breaking the text into sentences.

from nltk.tokenize import sent_tokenize
text="hi this is a sentence. this is another sentence"
tokenized= sent_tokenized(text)
print(tokenized)#["hi this is a sentence", "this is another sentence"]

Both the tokenization methods are used depending on the applications. Tokenization is an important process in building Applications which uses NLP

3. Lemmatization is a scalpel to bring words down to their root forms.For example, NLTK’s savvy lemmatizer knows “am” and “are” are related to “be.

from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()
tokenized = ["NBC", "was", "founded", "in", "1926"] 
lemmatized = [lemmatizer.lemmatize(token) for token in tokenized]    print(lemmatized)
 # ["NBC", "wa", "founded", "in", "1926"]

After Tokenizing the text lemmatization is followed to convert the tokenized words to their root forms using lemmatizer. This process are extensively used in building Chatbots.

4. Stemming : is the process of removing the prefixes and suffixes from the word. Since the suffixes don’t add any significance to the data which is required it is better we remove them Example going becomes go after stemming. and raining become rain and lighted becomes light

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
tokenized = ['NBC', 'was', 'founded', 'in', '1926', '.', 'This', 'makes', 'NBC', 'the', 'oldest', 'major', 'broadcast', 'network', '.']  
stemmed = [stemmer.stem(token) for token in tokenized]  print(stemmed) 
# ['nbc', 'wa', 'found', 'in', '1926', '.', 'thi', 'make', 'nbc', 'the', 'oldest', 'major', 'broadcast', 'network', '.']

Stemming is a common method used by search engines to improve matching between user input and website hits.

5. Stop Word Removal : Stopwords are words that we remove during preprocessing when we don’t care about sentence structure. Example of Stop Words are “and”, ”an”,”the”. Luckily for us nltk library provides a function to remove the unwanted words

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
nbc_statement = "NBC was founded in 1926 making it the oldest major broadcast network in the USA" 
word_tokens = word_tokenize(nbc_statement)  
# tokenize nbc_statement  
statement_no_stop = [word for word in word_tokens if word not in stop_words]  print(statement_no_stop)
# ['NBC', 'founded', '1926', 'making', 'oldest', 'major', 'broadcast', 'network', 'USA']

The Text Processing is the first Step in Building Applications using Natural Language Processing. There are Various Process to follow such as

Part of Speech Tagging
Parsing the text.
Converting a word to a vector (Word2Vec)
Finding the Similarity between the texts.
Topic Modelling.

Applications

The Subset of Artificial Intelligence called Natural Language Processing is used to build lot of products such as

Chat Bot
Spam Classifier
Sentiment Analysis (Positive or Negative Feedback)
Plagiarism Checker
And Much More

Finally A reward Checkout the Sentiment Analysis Program and Check whether a feedback is positive or negative.

NLP The Tools, Techniques and Application

Text Processing, Tools and Techniques

Applications

Written by Srivatsan Murali