Analytics Vidhya
Published in

Analytics Vidhya

image:loginworks

Understanding Natural Language Processing (NLP)

Overview:

Natural Language is the language which is human readable like text, messages. Processing these languages by machine for the use of different applications is called as Natural language Processing or NLP.

Some Practical example of NLP is Sentiment Analysis, Analyzing Restaurant Reviews, google/Alexa voice search which converts speech into text and then use for internal processing.

Topics Covered:

1. Details about NLP Application

2. Applications using Natural Language Processing (NLP)

3. Understand NLP using Python & NLTK library

4.Word Tokenizer and Sentence Tokenizer

5. part-of-speech tagging

6. Stemming and Lemmatization

Details about NLP Application:

As we know that today almost everyone having smart phones/Laptop and easy access to internet. Due to that every day millions of gigabyte(gb) of data is getting generated.

Companies using these data generated from various sources like social media sites, Facebook, LinkedIn, twitter or messing system like WhatsApp, Telegram for their business purpose.

They analyze that data and try to understand the need of individuals like there food choices, Cloths brand, Holiday vacation choices etc. and later they use their application like recommendation systems to target these customers which are great source of their revenue.

Applications using Natural Language Processing (NLP):

  1. Spam or Ham detectors in our mail box which filters most of the spams into spam folder.

2. Search Engines like google, understand you and shows things related to your choice.

3. Social Media websites like Facebook, LinkedIn show news, articles relate to your choice.

4. Speach Recognition engines like Google Alexa, Apple Siri.

5. Websites like Amazon and Flip-kart shows brands as per your interest and choices.

Understand NLP using Python & NLTK library:

We are going to use python for all our NLP implementation.To begin with, lets first install the most important library in python(if not already installed)

NLTK library is most common and heavily used for most of the NLP application implementation. NLTK stands for Natural Language Tool Kit.

Sentence Tokenizer and Word Tokenizer :

Sentence Tokenizer is used to split the paragraph/corpora into sentences.As we all know that when we read any paragraph for NLP application we need to split that paragraph into sentences to analyze its further.

Word Tokenizer is used to split sentences into each words for further processing.

Part-of-speech tagging(pos tagging):

As per nltk.org the definition of Part of Speech tagging is as below

https://www.nltk.org/book/ch05.html

The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories.

POS tag list:

  • CC coordinating conjunction
  • CD cardinal digit
  • DT determiner
  • EX existential there (like: “there is” … think of it like “there exists”)
  • FW foreign word
  • IN preposition/subordinating conjunction
  • JJ adjective ‘big’
  • JJR adjective, comparative ‘bigger’
  • JJS adjective, superlative ‘biggest’
  • LS list marker 1)
  • MD modal could, will
  • NN noun, singular ‘desk’
  • NNS noun plural ‘desks’
  • NNP proper noun, singular ‘Harrison’
  • NNPS proper noun, plural ‘Americans’
  • PDT predeterminer ‘all the kids’
  • POS possessive ending parent’s
  • PRP personal pronoun I, he, she
  • PRP$ possessive pronoun my, his, hers
  • RB adverb very, silently,
  • RBR adverb, comparative better
  • RBS adverb, superlative best
  • RP particle give up
  • TO to go ‘to’ the store.
  • UH interjection errrrrrrrm
  • VB verb, base form take
  • VBD verb, past tense took
  • VBG verb, gerund/present participle taking
  • VBN verb, past participle taken
  • VBP verb, sing. present, non-3d take
  • VBZ verb, 3rd person sing. present takes
  • WDT wh-determiner which
  • WP wh-pronoun who, what
  • WP$ possessive wh-pronoun whose
  • WRB wh-abverb where, when

Stemming and Lemmatization:

Stemming:

This is used to get the the root word from any given word in corpora.As Multiple different word have same meaning when it is converted as base word.This is helpful in NLP application processing.

Example: ‘Eating’ should be converted to ‘eat’ by removing its suffix ‘ing’ which is root/base word.

Library used for stemming is PorterStemmer from nltk

from nltk.stem.porter import PorterStemmer

In stemming ,Some time it might happen that a word not not correctly converted to its root word and hence it become meaning less.For example

Now you can see here that word “Flying” has been converted to “fli” and we do not have any such word which makes this word meaning less.

Lemmatization:

To over come this problem we use Lemmatization .Stemming can often create non-existent words but Lemmatization which correctly converts as actual words.

Library used for Lemmatization is WordNetLemmatizer from nltk

from nltk.stem import WordNetLemmatizer

Conclusion : Natural Language Processing is a vast area of research and development and used heavily in the various application as the data generated by users are huge and getting insight from that makes business successful.I will be covering other NLP topics in my next articles.

Please write your queries & comments and share your feedback.

Hope you like my article.Please hit Clap 👏(50 times) to motivate me to write further.

Want to connect :

Linked In : https://www.linkedin.com/in/anjani-kumar-9b969a39/

If you like my posts here on Medium and would wish for me to continue doing this work, consider supporting me on patreon

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Anjani Kumar

Anjani Kumar

Data Science ,ML & NLP Enthusiastic