Everything to get started with NLP

This is a three-article series. I will be covering the following topics in detail:
In this Article:
- Overview, history, and difficulties with NLP.
- Major Applications.
- How it works: Tokenization, Normalization and feature extraction(BOW,TFIDF).
- How to make a Sentiment Classifier.
- Overview of NLP libraries including Spacy, NLTK.
- Pros and Cons of these libraries.
- How to use these libraries.
- Overview of NLP Pre-trained models.
Without wasting any time, let’s get started…
What is NLP?
NLP stands for Natural Language Processing. It is the field of study that focuses on the interactions between human language and computers. It is an amalgamation of computer science, artificial intelligence, and computational linguistics.
NLP is a sub-field of Artificial Intelligence that is focused on enabling computers to understand and process human languages, to get computers closer to a human-level understanding of language.
The road trip of NLP

Initial KMs
In the initial days, most of the NLP tasks used hand-coded rules. For example, grammatical rules or heuristic rules for stemming. As expected it was quite a cumbersome process with very low accuracy.
After some hundred KMs
Later on between 1980 and 1990, researchers started using Machine learning algorithms to NLP tasks. They tried to use the enormous amount of real-world data available to them to learn the rules which were earlier hand-coded. But since they didn't have advanced computers of today, the approach and accuracy were limited.
Later on the road
Then with the advancement of computer processing, scientists were able to exploit data in a much better way. They focused on statistical models, which make probabilistic decisions based on attaching real-valued weights to each input feature. Such models gave the answer in probability so that more than one options could be considered.
Why is NLP difficult or not easy?
If you are human then you can understand what the below statement is trying to convey, but if you are not and still you understood then NLP is working at its imagined best.
Virat Kohli was wreaking havoc last night. He literally smashed every ball out of the park.
What can we understand from this news statement?
To us humans, it is very obvious what this means. We know that Virat Kohli is a Batsman and he played very brilliantly and scored lots of runs against some opponent. The park here refers to a stadium.
But computers don’t work like humans and they will not be able to read between lines. For them wreaking havoc means some disaster that is being incurred by Virat Kohli and Park means a park and not a stadium.
Now, this is what actually NLP is:
NLP is a fitness program for computers which they are following and trying to reach to the level of human strength.
Based on the performance researchers are improving the program.
Major Applications of NLP
The following is a list of some of the most common applications of natural language processing.
- Machine Translation
- Speech Recognition
- Spell Checking
- Language Generation
- ChatBots
- Sentiment Analysis
NLP at Work

To understand how NLP works, let us try to make a system which performs Sentiment Analysis. Sentiment Analysis is a text classification task. We will try to know if the given text is eliciting positive or negative sentiment.
For a person who owns a restaurant and wants to know how happy the customers are Sentiment Analysis of reviews is very important. Likewise, this task is very important in many different fields.
Note: This is very basic example of NLP, but I hope it can give you a gist of how things play in NLP world.
Use case
We will be trying to make a sentiment classifier for movie reviews. That is we will try to classify a review as positive or negative based on the marker words in the review.
Let us see an example of a positive and a negative review.
Positive:
The movie was very nice. It was full of suprises and fun. Characters were at their best.
Negative:
This movie hurt my expectations. I had expected much more from this director. Not a good movie.
Let us formally write our input and output
Input: Text of review.
Output: Class of sentiment: Positive or Negative.
Note: The classes of output could be more, like slightly positive, but for the sake of understanding we have taken just two.
Before feeding the data into an ML algorithm, we must do some pre-processing. We will understand different pre-processing steps with examples. So just hold on tight. This is gonna be interesting.
Text Pre-processing
The most important step.
Broadly categorizing, we have two main steps in Text Pre-processing:
- Cleaning up Text: Since the Sentiment is decided by marker words such as beautiful, Not good, great direction, fantastic and not by words like my, his, her, we try to remove words which don’t have any effect on our classifier. We call these words stopwords.
- Feature extraction: Since the reviews are in string format, and most of the Machine Learning algorithms take numeric features as input, we have to somehow convert the reviews into numeric features.
1. Cleaning up Text
Following are the basic and most important cleaning steps that must be done for most of the NLP task.
1. Tokenization
2. Token Normalization
3. Further Normalization
Let us look at each one by one.
1. Tokenization
First, we would like to split the input sequence into individual tokens. A Token can be thought of as a useful unit for Semantic Processing. Or simply saying breaking the text into small meaningful segments.
There are three very famous ways of Tokenization.
- White Space Tokenizer.
- Tokenizing using punctuations.
- Using Grammar to tokenize.
Let us look at how to use them and their limitations. We will use Python’s NLTK library.

White Space Tokenizer

Problem: Here it? is a different token and would not have been same as it if compared. But it and it? have same meaning. So we would like to merge them together. Like considering them as the same thing.
Let us try Tokenizing using punctuations.

Problem: s, isn, and t are not very meaningful. Let us try to tokenize on the basis of some grammar rules of English.

here you can see that it? is not there. As we know it and it? both have the same meaning so this tokenizer has given only it as output
Now you know three types of Tokenizers, you can use any one of your choices. I prefer the third one.
2. Token Normalization
Take a look at these word pairs
- wolf, wolves
- talk, talks
- kick, kicked
The pairs may look different but in actual conveys the same meaning. So it would be very beneficial to have only one representation for both of them. The same applies for any number of varieties of the same word.
For this, we can do Token Normalization. We will understand it using coding, but first some theory to get under the hood knowledge.
There are basically two very famous processes to do Token Normalization.
- Stemming:
A process of removing and replacing suffixes to get to the root form of the word which is called the stem. Usually refers to the heuristics that chop off suffixes.
example: Porter’s stemmer.
Oldest and most famous stemmer of English language. Applies 5 rules sequentially. It fails on irregular forms, produces non-words. But still works in practice. - Lemmatization:
Usually refers to doing things properly with the use of vocabulary and morphological analysis. It returns the base or dictionary form of a word, which is called a Lemma
example: WordNet Lemmatizer.
Uses the WordNet database to lookup lemmas. Not all forms are reduced.
Comparison between Stemming and Lemmatization on some words.

Now let’s perform the same using Python’s NLTK


Problem: feet is not converted into foot. Also, there are non-words like wolv, footbal.

Problem: talked is not converted into talk. Also tripping is not converted.
We need to try Stemming and Lemmatization and choose which one works best for our task. We can also apply one after the other. Effectiveness will depend on the output.
3. Further Normalization
Some of the other Normalization that we can do are:
- Normalizing Capital letters:
Us and us will be us if both are pronouns.
US and us could be country also. Now, this is tricky how to know if its a country or a pronoun.
We can use heuristics for this:
1. Lower casing beginning of the sentence as sentence typically starts with uppercase.
2. Lower casing words in titles.
3. Leave mid-sentence words as it is. Because if in mid-sentence there is capitalized word then it can be a named entity. - Dealing with Acronyms
The same acronym can be written in multiple forms. For example, eta, e.t.a., E.T.A. all are acronyms for Estimated Time of Arrival. So it is better to convert all the forms into one single form. - Removing stopwords
- Removing Symbols
- And many more…
2. Feature Extraction

Now that we have pre-processed the text and the text is in the form of normalized tokens, we will extract features from the text to feed into any machine learning algorithm.
For this task, we will do vectorization of the text. In simple words, we will convert each token into vectors. For this process, there are three famous approaches with their own merits and demerits. We will see all three of them one by one.
1. Bag Of Words(BOW)
In the BOW we count the number of occurrences of all the tokens in our text. The motivation of this approach is that we are looking for some marker words like excellent and disappointing which can help in discriminating between a positive and a negative review. After counting occurrences we will end up with a feature vector for whole text as well as for every individual token. Let us see a very small visual representation of actual BOW vectorization.
For this example we will take three movie reviews:
- good movie
- not a good movie
- did not like

As we can see from the table we have counts for every word in the text. Please note that the counts can always be more than 1. each row will be taken as a vector for the corresponding movie. And each column is a vector representation for corresponding word.
Problems with this method
- since it is a BOW representation, we lose the word order.
- Counters are not normalized
2. BOW with N-grams
The problem of losing word order can be resolved by n-grams, where n is a non-zero positive number.
Let us see what do we mean by n-grams:
1-gram: Tokens. example: good, movie, etc.
2-gram: Token pairs. example: good movie, did not, etc.
and so on…
Now let us see how it looks. We will take the same three reviews as in the case of BOW. And we will include 2-grams also.

Now we have preserved some word ordering using 2-grams. As in the last case, each row corresponds to a movie review. Please note that in the above table, we have removed stop words(too frequent n-grams).
Problems with n-gram representation
- Too many features: we have added just 2-grams for only 3 small reviews and we got 9 features, what if we had hundreds of reviews and thousands of tokens. If we had added just 2-grams, the number of features, in this case, would have been in millions.
- Counters not Normalized
Resolving the issue of too many features
Since the features can be too many, we can remove some n-grams based on their occurrence frequency in the documents of our corpus.
- High-Frequency n-grams
Articles, Prepositions, etc. Example: and, a, the, etc. - Low-Frequency n-grams: Typos, rare n-grams
We do not need them as they are very likely to overfit. Because a low-frequency n-gram could be a very good feature for our future classifier that can just see that n-gram and give output. We don't want such dependencies.
3. TF-IDF [Term frequency–Inverse document frequency]
As we saw in the previous case that we have some high and low-frequency n-grams which must be removed. Apart from that, we have medium frequency n-grams. And since we deal with a lot of data, the medium frequency n-grams too could be in a very high number.
We will try to determine which medium frequency n-grams are more useful than others in the task at hand.
The motivation behind TFIDF:
N-grams with smaller frequency can be more discriminating than others because they can capture a specific issue in the review.
For example, in a particular review of Hotel, there is an issue of wi-fi, which is a big issue, but it is not common in every review. So, We want to extract n-grams that are more common in one document and less in others. So that they can highlight some particular issue.
For this task, we can use TF-IDF Vectorization.
TF-IDF: A better BOW
We can replace the counters in BOW vectors with TFIDF values. And then normalize the results row-wise. This way we will be resolving the non-normalized issue of simple BOW vectors.
All this can be done using Python’s inbuilt TFIDFVectorizer . I gave you the essence of TFIDF. To know about the formulas and theory please visit the Wikipedia article on TF-IDF.
Let us now look at the implementation of TF-IDF Vectorization in Python.

In the output, you can see that for each movie review in the reviews list we have a row with the tfidf values corresponding to features in the columns.
The Article next to this has full python implementation with inline code blocks. You can copy the code for trying out from there.
Summary of feature extraction:
1. We made a simple counter features in BOW manner.
2. We can also add n-grams.
3. We can replace counters with TF-IDF values.
Making the Sentiment Classifier
Now that we have extracted features, we can feed these features into any classification Machine Learning model. Though any is a relative term and it will always have some constraints. For our case we have features as very long sparse vectors (a lot of zeros), so using decision tree models will take huge amount of time and will give very low accuracy. One model which performs very well with long sparse vectors is Logistic Regression. It is a linear classification model which is very fast to train.
We will cover the coding in the next Article.


