Getting started with Natural Language Processing: Easy, Quick & In-Depth (Part I)
I have gone through many blogs on Natural Language Processing, but couldn’t find any blog which could explain to me how a specific algorithm works or the mathematical theory behind the NLP strategies. My aim is to explain NLP in a very easy manner without sacrificing the quality of content.
What is Natural Language Processing (NLP)?
In simple words we can say, NLP means processing and analyzing the textual data.
Technical Definition :
Natural language processing is a sub-field of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human languages, in particular how to program computers to process and analyse large amounts of natural language data.
What is the use of NLP? OR Why do we need NLP?
In the era of internet, most data generated is in raw text format, which is, in fact, unstructured data. This data is not useful unless processed and analyzed. We use NLP techniques to convert “data into information”. Using NLP we can have a deep understanding of data and we can also take actions based on the outcomes of data.
Some applications where NLP is used:
Sentiment Analysis: analyzing if the text is positive or negative,
Automated Summarization: summarizing the meaning of data,
Text Classification: organizing data into predefined categories.
Let’s get started :
To understand what NLP actually does, we are going to make a sample project on sentiment analysis, to check if the given review is positive or negative.
This pizza is very tasty. (positive statement)
This Pizza is not tasty. (negative statement)
Programming with Python (basic knowledge)
Basics of Machine Learning
Naive Bayes - Machine Learning Algorithm
Getting hand’s dirty:
NLP can be done using Machine learning as well as Deep learning, for keeping this blog simple, I will be using Machine learning approach.
Firstly, we need to install Natural Language processing Toolkit (nltk). You can click here to install nltk — https://www.nltk.org/install.html
Before applying any algorithm of NLP / Machine learning we need to clean the text, we need to make it in the proper format so that processing and analyzing functions can be applied to it.
Importing the dataset:-
The simplest dataset I found for sentiment analysis was available on superdatascience.com. You can download it by clicking here. Or just go to https://www.superdatascience.com/machine-learning/ and search for Natural-Language-Processing.zip. (This will download all files i.e, dataset and code-files). The dataset contains only 1000rows and two columns, ‘Review’ and ‘liked’.
Note: The code in the downloaded files is different than what is written in the blog, so don’t directly run the code, it may lead to unexpected behavior.
Since the dataset is so small, there is not much of Exploratory Data Analysis we can do. But if the dataset is huge and has many features(columns), we need to explore and analyze which data is needed for analysis and which can be skipped.
There are many advanced and complex ways for Natural language processing, but I will be using a very simple and limited method pipeline as shown below.
Text cleaning steps:-
- Remove Duplicates
- Remove Punctuations
- Remove Numbers, alphanumerics
- Remove HTML tags
We need to remove duplicate entries as they may affect our accuracy while predicting the output (Since our dataset is too small and has fewer columns, it’s too difficult to find and de-duplicate). Also, punctuation marks, do not have much that of meaning in the reviews, so they can be eliminated. Numbers and alphanumerics can also be eliminated as we are not quantifying the review, we are just analyzing if it is positive or negative.
We can use regular expressions to remove punctuation marks or we can use some inbuilt library.
Text pre-processing steps:-
- Remove stop words
Tokenization: In sentiment analysis, we give more importance to certain words which have meanings and remove meaningless words (stop-words). So, we have to deal with words rather than a complete review/sentence. So, it’s better to divide that data in smaller chunks. This process of dividing data, text into smaller chunks (or tokens) is known as tokenization. Using tokenization we can give more weight on words rather than sentences. In this way, we can easily check a review is positive or negative by just seeing specific words.
Removing stopwords: In a sentence, there are many extra words. Example: is, and, are, the etc. These words don’t add any meaning to the sentence. So, we can remove them.
Making all words in lowercase will help to identify and eliminate the same words, which will help us in reducing vector size in Bag of Words or any other approach. (will explain this in a few minutes… keep reading)
Stemming: Now we have separate words to perform analysis on, but some words have the same base/root word. So, we can merge them or take them as a single word as we don’t want to increase the size of our data to be analyzed. Stemming converts the word to its root form. Example: sing, singing, singer belongs to the root word ‘sing’. So, it would be better to keep a single word ‘sing’ and remove the remaining. Sometimes output words from a Stemmer don’t have a sense. Example calves -> calv(Stemming method actually tries to remove suffixes or substitute suffixes). There are 3 commonly used stemmers, their comparison is as below:
We can observe here that, PorterStemmer is least strict and Lancaster is strictest but really fast. Snowball stemmer is a good trade-off between speed and strictness. So, we can use Snowball Stemmer for stemming our reviews.
Lemmatization: Like Stemming, lemmatization also converts words to its base form, just it is more advanced. A lemmatizer uses its KnowledgeBase to convert words to its root form. Example: Calves -> Calf (Noun Lemmatizer output). Here lemmatizer doesn’t remove suffixes. Below is the comparison between Noun Lemmatizer and Verd Lemmatizer. You can use any of them, but keeping it same over the complete code.
It uses a knowledge base called WordNet. Because of knowledge, lemmatization can even convert words which have different spellings, but can't be solved by stemmers, for example converting “came” to “come”. The output of stemmers can be meaningless, but the output of lemmatizers will always be meaningful.
Stemming or Lemmatization or Both: It all depends on the dataset and how you analyze it. It's better if we can use trial and error method because every dataset is not same. After a few trials, we will be clear what to use, stemming or lemmatization or both.
From our dataset, I took an extreme case of reviews, where some people have written crazy reviews, typos, deliberate spelling mistakes to express emotions. And now check which is better stemming or lemmatization.
As you can see, from the trials I figured out that SnowballStemmer is working better than other stemmers and even better than lemmatizer. I would go with SnowballStemmer but if at the end of the complete sentiment analysis if my accuracy is not as expected, I may try different methods.
Machine Learning is not about how efficiently your model is working, but how accurately your model is performing. So, we won’t get the most accurate result in the first attempt, we need to optimize our methods again and again.
Now that we have done with cleaning, pre-processing of text we need to analyze and process the text, for that, we have few strategies and models as discussed below.
Feature Engineering Strategies/models to analyze text:
- Bag of Words
- TF-IDF (Term Frequency * Inverse Document Frequency)
- Average Word2Vec
- TF-IDF weighted word2Vec
As this blog is getting too large, I will be explaining only the Bag of Words approach in this part.
Bag of Words:
I told you something about ‘vectors’ in the above sections. Remember?
Our computer cannot perform operations directly on pure textual data. So, we need to convert our data into such a form which a computer can understand. The computer understands the language of numbers and mathematics. We need to convert our data into something called as a vector. And mathematical operations can be performed easily on a vector.
Vector is nothing but a single dimension array with size ‘n’.
A matrix is a collection of vectors. Once we have converted reviews to vectors, we can leverage the power of linear algebra to properly analyze the reviews. The distance between two similar vectors will be less than the distance between two dissimilar vectors. The vectors can be compared with each other and similar vectors can be plotted/represented near each other and dissimilar can be plotted/represented away from one other in a 2D or 3D or nD space. And we can use a line or a plane in n-Dimensional space to separate group of positive and negative points. In this way, we can distinguish positive reviews and negative reviews.
Let's assume that we got the vector representation of reviews in d- dimensional space as shown in the above image. We can separate them using a plane in d-dimensional space using ‘w’ (normal or perpendicular to the plane). Finding ‘w’ is a complex theory for now. So we can just assume that the vector points in the direction of ‘w’ can be classified as positive and in the opposite direction of ‘w’ i.e w^T(w-Transpose) are negative.
A review is also called a document. And the collection of reviews or documents is called a corpus.
Let's assume we have 3 pre-processed reviews (also we joined the tokens into single sentences)
Now we can say that we have a corpus and it has 3 reviews/documents.
The Bag of Words strategy converts all the reviews in the dataset into vectors. So, for each review, there will be a separate vector. The vector stores the count of each word in the review. Vector will actually contain all the words from the corpus.
As the name of the model says, ‘Bag of words’, it creates a bag (vector) that contains the count of words for each review in the document. You can see that this matrix is a sparse matrix (contains most of the elements as zeros). If preprocessing would not have been done, the matrix would have been more sparse and the words would have also increased making it less efficient and less accurate. Also, by making words in lowercase, we avoided making an extra entry in the vector. Example: ‘Food’ and ‘food’ would be separate entries/features in vector if they were not lowercased.
Code (bag of words): We will use CountVectorizer to convert our reviews to vectors. CountVectorizer is available in Sci-Kit learn library (sklearn).
The next steps are the basic steps in machine learning:
- Splitting data into a Training set and Test set.
- Fitting a machine learning algorithm to the Training set. (Here we will be using Naive Bayes machine learning algorithm)
- Predicting the Test set results.
- Checking the accuracy of the model.
Naive Bayes classifier assumes that one feature in the vector is independent of other. And thus it considers each feature will independently contribute to the probability of review to predict positive or negative, without any correlation between features.
We can see that the accuracy of the model is 74%, which is not bad. Looking at the confusion matrix we observe :
We can see that 41 reviews were False Positives, i.e They were false but were predicted True. Strange?
This happened because while removing stopwords, we also removed ‘not’. Removing ‘not’ can completely change the meaning of the review.
- This place is not good. (original review)
2. place good (removing stop words, meaning changes completely)
The Concept of uni-gram, bi-grams, and n-grams
To deal with situations like that of above, False Positives, we need to deal with more than one words at a time.
Uni-gram means a single word. While converting review to vectors using CountVectorizer, by default we use the uni-gram method. It converts single word to a feature in the vector.
We can also use bi-grams (2 words )or tri-grams (3-words), to deal with words like: ‘not good’ (bi-gram), ‘not so tasty’ (tri-gram). It’s a very rare case to use above tri-gram. In this way, we can increase the accuracy of our model.
Example: review- Place is not that good as compared to others. (after pre-processing, excluding not word removal)
Note: Now, since we will be using ‘not’ in our reviews, removing stop-words like ‘not’ should be avoided.
I used the bi-gram and tri-gram approach, but instead of getting better results, my accuracy kept on decreasing as I increased the n-grams range. For bi-gram and tri-grams I got the accuracy of 73.5% (decreased by 0.5%). But it might be because I avoided removing stopwords like ‘not’. And using custom stop-words might not work.
Code for building n-grams:
I have pushed the code to my repository here, you can try different stemmers, lemmatizers, can use custom stopwords, change the range of n-grams, change Machine Learning algorithms and figure out which one gives you better accuracy.
Bag of words is a very primitive approach for NLP. We saw that removing stop words can decrease our accuracy while using n-grams it could be increased, but again we need to avoid removing stopwords like ‘not’ for building n-grams. We cannot manually remove stopwords one by one. So, there needs to be a different approach taken. Maybe “TF-IDF” may give better results or any other Machine Learning approach or Deep Learning approach. Let’s figure it out.
I would be discussing TF-IDF in the next part.
Thanks for reading patiently.
https://www.appliedaicourse.com/ for all theory and deep concepts of mathematics, ML and NLP.
https://www.superdatascience.com/ for the simplest dataset that I could find for NLP. Also, some part of the codes were referred from here.
All the images in the blog without source mentioned are self-made by me using MS Paint, referring concepts taught in https://www.appliedaicourse.com/