First ladder towards NLP

Sanket Doshi
4 min readJan 22, 2019

--

As everyone knows NLP or Natural Language Processing is one of the most talked topics of today. Because of NLP, our computers can understand what we’re talking or what we are trying to tell the computer.

We always hope that if computer understands what we want to say how good it would be. What if computers understand our sarcasm? What if the computer recommends savage replies? These questions are interesting and will excite every developer. These things will be possible only due to NLP.

NLP trains our software to understand the language similar to the understanding of humans. NLP can make the computer talk like humans so that human cannot discriminate between humans and the computers.

There is a large amount of data available in an unstructured format which cannot be analyzed by any programming language till now. But NLP has made it possible to learn from this information or to understand from this information. Recent development in Machine Learning of Deep Learning has made it possible to develop more complex models that can understand similar to humans.

Let us see an example of how interpreting the meaning is a sentence is hard.

They think John is a lion but he is a chicken.

Hear, the humans clearly understand that everyone thinks John is a lion that is he is bold or strong or can fight but in reality, he is a chicken that is he is not strong or is afraid of fighting.

But the machine may understand that John is the name of chicken which looks like a lion.

So, you see how complex it’s to understand the real meaning of the sentence.

We’ll be using nltk library in python. Install nltk library using pip and run nltk.download() in your script before running any commands.

We’ll learn the three basic and important step in NLP

Tokenization

In this, the sentence is divided into various tokens that help the machine to understand the meaning of a sentence. Tokens can be a single word or group of words based on an application. For example: not happy in this, the machine will see that the sentence is neutral but in fact, it is negative. So selecting the number of words in each token is an important part and may change the results. This count of words present in a token is known as ngrams . In the semantic analysis, the value of ngrams plays an important role and can reduce or increase the accuracy of a model exponentially.

There are various types of tokenizers we’ll be using Regextokenizer which creates tokens based on given regular expression.

Example of tokenization

Citizens of India are known as Indians.

After applying tokenizer in this sentence we get

tokens = ['Citizens', 'of', 'India', 'are', 'known', 'as', 'Indians']

Tokens are created based on \w+ regular expressions. This regex splits the sentence if any letter except alphanumeric occurs. So space or any special characters arrive this tokenizer splits the sentence.

Removing Stop Words

Stop words are words which are not significant or does not help in understanding a sentence. Stop words are not fixed and may change based on the use case of a model. Generally, stop words are calculated based on the frequency of the words. More the frequency of the word more chance of it being as a stop word. Stop words are the words which are also normally ignored by humans. This set includes words such as ‘of’, ‘the’, ‘in’, ‘a’, etc.

But before defining the stop words you need to know the application of a model. For example, in the phrase Messi is the God we cannot ignore words ‘is’, ‘the’ as the phrase Messi God doesn’t make sense.

Now, we’ll remove stop words from the previous sentence

new_tokens = ['Citizens', 'India', 'known', 'Indians']

Lemmatization

This process includes transforming the words into their root words. Lemmatizers are used for this process.

Lemmatization

As we can see how many words can be transformed into a single root word. Lemmatization is a very complex process and uses the dictionary to know the mapping between words. Based on different lemmatizers we can get different root words.

We’ll be using two different lemmatizers

WordNetLemmatizer

This lemmatizer produces the following root words

['Citizens', 'India', 'known', 'Indians']

As we can see there is nothing different as this lemmatizer is very basic and works on very basic words.

PorterStemmer

This lemmatizer is brutal and is a little complex

This lemmatizer produces the following root words

['citizen', 'india', 'known', 'indian']

We can see that all the capital letters are converted to small and plurals are converted to singular.

Implementation

Applications of NLP

  1. Automatic Summarization.
  2. Sentiment Analysis.
  3. Text Classification.
  4. Question and Answering (Chatbots).
  5. Customer Service.

We’ve covered the basic steps used while processing the text and also learnt the applications of NLP. I hope this blog will help you to take your first step towards NLP.

--

--

Sanket Doshi

Currently working as a Backend Developer. Exploring how to make machines smarter than me.