Natural Language Processing

Teaching machines to understand words and sentiments

Smruti Ranjan Pradhan
AlmaBetter
7 min readApr 29, 2021

--

What is Natural Language Processing ?

Natural language processing, in short NLP, is a comprehensive field of study in machine learning that deals with teaching the machine to learn patterns from written texts and documents. There are numerous ways that we will discussing in this article through which the computer learns to make assumptions about certain object/ document just by interpreting the associated textual data. Through the help NLP, currently we are able to detect sentiments, find keywords and topic associated with articles, build recommender systems based on textual reviews and so on.

How are machines so smart ?

It isn’t as complicated as it may have sounded in first go. In basic natural language processing systems, algorithms interpret the frequency of words in a particular document as a feature. So if a document has, lets say, 10000 unique words with certain term frequency, then we’ve got 10000 unique features. The machine learning model then tries to build on the relevance of each of each feature based on the term frequency. However the frequency of certain words might be a misleading feature to judge a particular document, because some words such as ‘the’, ‘have’, ‘they’ etc occur multiple times in English literature irrespective of type of document class. Hence we need to run a couple of preprocessing steps before fitting any model to our textual data set.

Preprocessing Texts

  • Tokenization : We need to convert our text document comprising of sentences and paragraphs into a collection of words/ phrases/ tokens before performing any kind of further text processing. Tokenization can be performed in different ways based on the maximum and minimum number of words you want in your tokens. Tokenization in which a token consists of only one word is known as 1-gram tokenizer. Similarly, the tokenizer which splits our texts such that each token consists of maximum of n number words, is known as n-gram tokenizer.
  • Removing punctuation : Every literature has some punctuation for the readers to understand and comprehend. However including punctuation in our machine learning model might be a very bad idea since punctuation hold no physical significance when it comes to figuring out topics and understanding sentiments of texts. Also often including punctuation in text might trick the model to interpret ‘fun.’ with a full-stop and ‘fun’ without a full-stop as two different features leading to multi-collinearity and unnecessary expansion of feature set. Hence it is always a good idea to remove the punctuation.
  • Removing stop words : Stop words are nothing but the most frequently occurring redundant words in any literature. ‘the’, ‘is’, ‘have’, ‘shouldn’t’ etc are some of the examples of stop words in English literature. Since these words hold no physical significance so far, it is often imperative to do a stop word removal procedure, especially when the feature set is already too high dimensional.
  • Removing high and low frequency words : Even after removal of stop words, the feature set might still be huge due to the vastness of vocabulary in the literature. It might be then a good idea to remove the words with high frequency from the documents, since such words might not contribute much to the classification or clustering task at hand due to the same reasons that hold true for stop words. Similarly too low frequency words used across different documents might also not convey much about the nature of documents or sentiment of texts.
  • Stemming and Lemmatisation : In most literature, we see words coming from same root, but have different forms based on the tense, sentence structure and parts of speech. For example, transforming and transformation come from same root word ‘transform’. However they two have two different forms based on their use cases. In these cases, it is essential we stem the word to its origin such that the two words are not considered as different feature by the models. There is a fine line between stemming and lemmatisation. While stemming doesn’t ensure a meaningful word as an output due to lack of knowledge about context during stemming, lemmatisation does. In machine learning stemming is done by following certain set of rules predefined by algorithms such as Porter stemmer, Lancaster stemmer and Snowball stemmer. For example, the word ‘family’ gets transformed to ‘famili’ after using Porter stemmer, which doesn’t have a meaning as such. Unlike stemming, lemmatisation attempts to select the correct lemma depending on the context.
Introduction to Stemming vs Lemmatization (NLP) | LaptrinhX

Now that we are over with pre-processing the text, we are left with one last very important step before using it for any machine learning task such classification or clustering.

Text Vectorization

As we have already discussed that in NLP, we treat unique words in final processed text as our feature set, and their counts in the document as the values of the features. Count Vectorizer is a well known algorithm in scikit-learn that deals with converting textual data into feature vectors. What it essentially does is find out list of all unique words in all the documents combined. And then, for each document, creates a vector with each component representing the number of counts of that particular word/ feature in that particular document.

However, count vectorizer is not always a good choice for vectorization of the texts. There might be some words that are specific to a given document and not quite frequent in other documents. Its intuitive that the words need to gain more weightage compared to the words that are equally frequent in all the documents. It is here tf-idf vectorizer has an important role to play. ‘tf’ stands for term frequency and ‘idf’ stands for inverse document frequency. It not only takes in account the term frequency of certain word in a given document, but also the inverse of frequency of documents with that particular word. Higher the document frequency lower weightage the word has in a particular document. To normalize and limit the explosion of inverse document frequency we use a log transformation of inverse document frequency in actual formula.

Now that we are done with vectorization of our texts and documents, its time to put our processed feature set into various machine learning tasks based on the problem statement.

Modelling

Let us consider a simple task of teaching our machine learning model to differentiate between positive and negative tweets. We are a given a set of labelled data containing the tweets made by well known celebrities, sports persons and politicians in India. And somebody has already done the herculean task of labeling them. Now its time to teach the machine to tell us the sentiment of any further tweet made, so as to automate the process of sentiment analysis in future. This is of course a quite common task performed in industries today.

We can process the data as mentioned above. As per the use case, we can choose to keep the main text and hashtags, while choosing to remove the @mentions. We remove any unnecessary punctuation and stop words. Since tweets are generally short and to the point we might not need to remove high and low frequency words. We can choose to lemmatize our text and then move to vectorize them as well. After we’re done with pre processing, we can choose to pass our vectorized features as independent variables and positive and negative tags as target variables through a supervised learning algorithm such as Support vector machines. The evaluation metric is based on the problem statement. If our objective is to recognize all the negative tweets correctly and reduce false negative rates, recall is the evaluation metric we are looking for.

There could be another use cases of the tweets. We might want to know what are the top keywords that the current tweets are talking about. To do that we might want to find out the words that are most frequently used in most of the tweets. There are advanced algorithms such as Latent Dirichlet Algorithm (LDA) that specifically deals with finding the topics that different tweets are talking about. We might want to consider talking about it in detail in some later article. Apart from that, clustering problems are quite common in NLP wherein we want to categorize texts based on some similarity between topics in the texts.

I hope, with the help of this article, you were able to gain some useful insight into the world of Natural language processing. Until next time, Bye !

--

--

Smruti Ranjan Pradhan
AlmaBetter

Data Scientist at Accenture | Python & ML Expert | Researcher | IISER-K Alumna