An Introduction to NLP

Smit Shah
Code Dementia
Published in
4 min readMay 16, 2019

Let’s get started with the first question that comes to mind: what is Natural Language Processing and why is it useful? Well, the answer to the first question is in the name itself; more formally, it is a method of making machines comprehend languages. And as for why, you’ve probably seen the applications already. Almost all of you have used the website grammarly, used a voice assistant, had fun with Siri/Google Assistant, etc.

Now, let’s get started on the techniques used in NLP. This introduction assumes that you have a brief idea about some machine learning algorithms such as Logistic Regression or Naive Bayes. Although if you don’t, it would just suffice to know for now that they are basic classification algorithms who output probabilistic outcomes.

Preprocessing the Data:

This is the most important step in any NLP problem. If you are properly able to select the preprocessing technique, a huge chunk of your work is done. However, the most challenging part about this is there is no perfect way to do this. Every method has its own shortcomings and advantages. Knowing which technique to use relies on intuition and your choice of algorithm. Now, let’s look some preprocessing methods which are generally always used.

Removing the stopwords:

Consider two sentences :
“The cat is eating chocolates” and “cat eating chocolates”. Grammatically, the first sentence is more logical and better, however, you can understand the meaning from the second sentence itself. So, which do you think would be better for the computer to understand?

If you guessed the latter, you are indeed correct. The excess words like “the”, “it”, “are”, etc. do provide semantic importance to the sentence, but are not always necessary to understand the meaning behind the sentence. Removing this makes the computer process less words and hence, might improve the performance on your metric.

Stemming:

Let’s take a look at the words playing, plays and played. These words provide grammatical meaning along with information about tenses. However, in a prediction algorithm, you want to ignore the tense, generally speaking. The reason is simple, it improves efficiency. Now, if you had extremely large amount of data, hundreds or thousands of gigabytes of VRAMs, it wouldn’t matter taking these words as they are. However, we have a limited amount of data, and computational power, and hence require efficiency. Thus, stemming is a way of getting the stem from the tree, i.e. in the above example, playing, plays, played can be reduced to “play” or “pla”.

Lemmatizing:

Lemmatizing is another form of reducing the words to their root form. For example in the case good, better, best, the stemmer would be confused and might provide unsatisfactory results, but lemmatizer would reduce all of them to the root word “good”.

TF-IDF:

TF-IDF stands for term frequency inverse document frequency. The basic idea behind TF-IDF is to check the importance of the words. In the case of the term frequency tf(t,d), the simplest choice is to use the raw count of a term in a document, i.e., the number of times that term t occurs in document d. If we denote the raw count by f(t,d) then the simplest tf scheme is tf(t,d) = f(t,d). The inverse document frequency is a measure of how much information the word provides, i.e., if it’s common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient):

where N = total number of documents in the corpus, the the denominator of the log term is the number of documents where the term occurs. So the net will be the multiplication of both tf and idf.

To understand it better, let us consider an example. For example, if a term is really frequent in a document but is not present in any other document, the idf term would be log(N) as the denominator would be 1. And the tf term would be f (frequency of the word). However, if the word was present in all the documents, the idf term would be 0 which would amount to the tf-idf being 0.

So, here you have it, a basic understanding of all of these terms. In my next post, we’ll see how to implement these using libraries like nltk, sklearn, etc.

--

--