NLP The Tools, Techniques and Application

Srivatsan Murali
Analytics Vidhya
Published in
4 min readJul 12, 2020
Natural Language Processing

Congratulations!! You have won 25000$. Click on this link to redeem your money $$$. Did you notice that you no longer see emails like this in your inbox. These kinds of emails are called “Spam”. And the phenomenon behind you not seeing these types of emails is because of NLP (Natural Language Processing).

So Natural Language Processing is a subset in AI that deals with analysis and providing smart Intelligent solution using Text and Voice Components. As NLP deals uses AI (deep learning, machine learning and some data science). The important commodity is the Data/Information. And before training or using the data, it must be cleaned. The original data is called Raw Data, and the process of converting/cleaning a raw data into a data that is useful is called Processing. And Processing a textual data is called Text Processing.

The Relation Set

Text Processing, Tools and Techniques

There are various methods and procedures to process the text. The NLTK Library in provides all the methods required to preproccess the data. The various text preprocessing methods are

Text Processing
  1. Removing the Noise : Noise are unwanted information present in the whole document of data they may be images, some numbers, special symbols and even html tags such as <div>,<img> ,<h1> and much more. These tags don’t provide any useful information to the Programmer and act as chunk. So we can remove these noise using Regular Expressions (in python regular expression library are referred as re)
import re
text="<h1> Hello this is how Regular Expressions work in python and we can remove the noise using re module </h1>"
new_text= re.sub(r'<.?h1>','',text)
#this replaces the h1 and /h1 tag with empty space

2. Tokenizing : Is the process of converting a a document of information to sentences or words (Breaking the text into words). There are two types of tokenizers that nltk offers to us Word-Tokenizer and Sentence-Tokenizer
a) Word Tokenizer: Breaks the text to words

from nltk.tokenize import word_tokenize
text= "This text is to be Tokenized"
tokenized=word_tokenize(text)
print(tokenized)
# ["This", "text","is","to", "be", "Tokenized"]

b) Sentence Tokenization : Sentence Tokenization is the process of breaking the text into sentences.

from nltk.tokenize import sent_tokenize
text="hi this is a sentence. this is another sentence"
tokenized= sent_tokenized(text)
print(tokenized)
#["hi this is a sentence", "this is another sentence"]

Both the tokenization methods are used depending on the applications. Tokenization is an important process in building Applications which uses NLP

3. Lemmatization is a scalpel to bring words down to their root forms.For example, NLTK’s savvy lemmatizer knows “am” and “are” are related to “be.

from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()
tokenized = ["NBC", "was", "founded", "in", "1926"]
lemmatized = [lemmatizer.lemmatize(token) for token in tokenized] print(lemmatized)
# ["NBC", "wa", "founded", "in", "1926"]

After Tokenizing the text lemmatization is followed to convert the tokenized words to their root forms using lemmatizer. This process are extensively used in building Chatbots.

4. Stemming : is the process of removing the prefixes and suffixes from the word. Since the suffixes don’t add any significance to the data which is required it is better we remove them Example going becomes go after stemming. and raining become rain and lighted becomes light

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
tokenized = ['NBC', 'was', 'founded', 'in', '1926', '.', 'This', 'makes', 'NBC', 'the', 'oldest', 'major', 'broadcast', 'network', '.']
stemmed = [stemmer.stem(token) for token in tokenized] print(stemmed)
# ['nbc', 'wa', 'found', 'in', '1926', '.', 'thi', 'make', 'nbc', 'the', 'oldest', 'major', 'broadcast', 'network', '.']

Stemming is a common method used by search engines to improve matching between user input and website hits.

5. Stop Word Removal : Stopwords are words that we remove during preprocessing when we don’t care about sentence structure. Example of Stop Words are “and”, ”an”,”the”. Luckily for us nltk library provides a function to remove the unwanted words

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
nbc_statement = "NBC was founded in 1926 making it the oldest major broadcast network in the USA"
word_tokens = word_tokenize(nbc_statement)
# tokenize nbc_statement
statement_no_stop = [word for word in word_tokens if word not in stop_words] print(statement_no_stop)
# ['NBC', 'founded', '1926', 'making', 'oldest', 'major', 'broadcast', 'network', 'USA']

The Text Processing is the first Step in Building Applications using Natural Language Processing. There are Various Process to follow such as

  • Part of Speech Tagging
  • Parsing the text.
  • Converting a word to a vector (Word2Vec)
  • Finding the Similarity between the texts.
  • Topic Modelling.

Applications

The Subset of Artificial Intelligence called Natural Language Processing is used to build lot of products such as

  • Chat Bot
  • Spam Classifier
  • Sentiment Analysis (Positive or Negative Feedback)
  • Plagiarism Checker
  • And Much More

Finally A reward Checkout the Sentiment Analysis Program and Check whether a feedback is positive or negative.

--

--

Srivatsan Murali
Analytics Vidhya

An aspiring Engineering Student pursuing undergraduate in Computer Science Engineering. Interested in Computer Vision, AI and Data Science