Understanding Human Language Through AI

Radhika Bansal
AI Graduate
Published in
9 min readJan 11, 2019

“This is simply one of the best films ever made and I know I am not the first to say that and I certainly won’t be the last”

— An excerpt of a IMDb review for Shawshank Redemption

Do you think the reviewer liked the movie or disliked it? You are right, the reviewer loved the movie and rated it highly. But how did you know that this is a positive or an appreciative review? More importantly would a computer or an algorithm be able to guess whether this is appreciation or criticism? Such tasks to classify a review as positive or negative based on review text are classic NLP or Natural Language Processing problems. Understanding human language and the emotions, sentiments and meanings behind them is a hard task for humans let alone for computers.

The daunting task of understanding and processing human language by an algorithm to extract some useful information is known as Natural Language Processing aka NLP. For instance extracting the words or phrases with a positive or negative sentiment in a movie review

IMDb review for the movie October. The phrases in green boxes carry a positive sentiment
IMDb review for movie Judwaa-2. Red boxes enclose negative sentiment extracts

Another example could be breaking the following sentence into its syntactical elements

Karna embarks on a worldwide military campaign, otherwise called Digvijaya Yatra,

Source : HackerEarth

Real world applications of NLP

Automatic Summarization — Imagine you are an employee at Amazon. Your task is to create a short 2 line description of a product from a document describing the product. This 2 line summary will be visible to the customers coming to the website. You can’t sit and read the documents for each of the millions of products listed on Amazon and create summary for each of them manually, that would take years! This is where NLP rises to the occasion through automatic summarization.

Source: https://medium.com/@social_20188/text-summarization-cfdbbd6fb800

Topic Categorization How many articles published on medium are about technology? or sports? You cannot read all the articles and categorize them manually. NLP can help you with categorizing these articles by assigning tags to them.

Source: https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a

Sentiment Analysis — Guessing sentiments for a text e.g. to classify a review as positive or negative sentiment. Sentiment analysis is used a lot in analyzing tweets. For example, you can look at all the tweets related to a movie and see if the general sentiment is positive negative or neutral.

Source: https://www.kdnuggets.com/2018/03/5-things-sentiment-analysis-classification.html

Named Entity Recognition — Tagging each word in a text to predefined categories such as person, location, organization or something else.

Source : https://www.kdnuggets.com/2018/10/named-entity-recognition-classification-scikit-learn.html

NLP In Action — Movie Review Sentiment Analysis

Lets try to start with a very simple approach. Keeping a dictionary of positive and negative words. Additionally, storing weights (=importance) of each word in the dictionary. Then counting the number of positive and negative words in a review. Think of this as a weighted sum. More number of negative words with high weights means a negative review. More positive words with high weights means a fabulous movie. To do that we have to first apply a very basic technique of NLP — Tokenization

Tokenization is a process of breaking down text into smaller chunks like words or phrases. These chunks, called tokens, provide insights like token count, token frequency etc. The tokens provide a mathematical way of organizing the text and hence these tokens become important features. What are features? Take a quick peek at this article about Neural Networks. Once we have tokens we can analyze the statistics of the text at hand.

Point to note: Tokenization is hugely dependent on language. Complex languages like German, Chinese, Japanese need targeted word breakers as these languages does not have spaces between words.

But does tokenization resolve all our woes? Not really, As I said earlier human language is complex. Lets deal with the possible pit-falls of analyzing text through vanilla tokenization.

Suppose in our dictionary of positive and negative words we have added the positive word “good” but as seen commonly people use “gud”, in place of good, in the review text. To avoid this problem with variants of the same word we have to use the second most common technique — Word Normalization.

Word Normalization “equalizes” similar words written in different ways for instance you want to match

  • U.S.A → USA (Remove Special Characters)
  • USA → usa (convert text to lower case),
  • Car, Cars, car’s, cars’ → “Car” (Lemmatization)

All these are ways to normalize a token so that it can be applied in a generic way.

So far so good. But there are often multiple variations of the same word in a review text. We have to equalize these variations (e.g. past tense of) of the same word in the dictionary like “thrilling”, “thrilled”. One of the techniques in NLP used to do this is called Stemming.

Stemming is used to extract the stem from a word — automat(es), automat(ic), automat(ion) → automat

https://chrisalbon.com/machine_learning/preprocessing_text/stemming_words/

“To be, or not to be, that is the question”. Well that is not really the question but you see how adding a negation or a negative modifier before a word or a phrase changes its meaning. That is a problem again to be tackled by NLP.

The use of positive words with a negative modifier (or vice-versa). E.g. “movie was not good” or “ The trailer was thrilling but I didn’t like the movie” changes the sentiment of the review. These are negative comments based on positive words like “good” and “thrilling”. We might end up classifying them as “good reviews”. To avoid this problem we can use our third technique — Negation Analysis

Negation Analysis: Based on your need you can create your own negation list and then apply the below mentioned technique.

Negation word (modifier)+ positive word -> Increment negative sentiment value
Negation word (modifier)+ negative word -> Increment positive sentiment value

This technique is mostly specific to sentiment analysis.

One other useful NLP technique removes high frequency words — Stop Word Removal

Stop words removal removes high frequency words like “the”, “a”,”there”, “in”. Usually they don’t carry too much information with them.

As we have seen words themselves are very important features in NLP. The dictionary based technique here uses lot of techniques to clean up and process the text and then do a basic weighing and counting of positive and negative words. Note that there is no learning algorithm at work here.

There are few other ways through which we can do sentiment analysis by wielding the power of Machine Learning. Most Machine learning algorithms require features to be extracted out of the data. Lets look at the most common technique to convert words into features in NLP — TF-IDF

What is TF-IDF?

TF-IDF stands for term frequency-inverse document frequency. It is a weight feature assigned to each word to evaluate how important a word is to a document in a collection. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus (collection of documents).

tf-idf can be used even without stop word removal as frequency of the word in the corpus automatically take care of high frequency words.

Lets look at tf-idf in a bit more detail.

Computing TF-IDF

Term Frequency (TF) measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more in long documents than shorter ones. Thus, the term frequency is often divided by the document length. Document length here means the total number of words in that document.
Mathematically —

For example, consider a document containing 100 words wherein the word cat appears 3 times. The TF for cat is then (3 / 100) = 0.03. A 300 word document containing the word cat 9 times will still have the same TF.

Inverse Document Frequency (IDF) measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. And rare words such as “internet”, “cricket”, “donald trump” should have more importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following —

Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

Instead, if we look at the importance of the word the. It would appear in all 10M of the documents. So its IDF = log(10M/10M) = 0. Thus its tf-idf weight also becomes zero. Hence, no importance.

Using these tf-idf features we can use any machine learning classification algorithms like logistic regression or train a neural network to classify a review as positive, negative or neutral.

NLP, ML and DL

People generally confuse NLP with machine learning. Machine learning is a technique that can be used in NLP and usually augments NLP. The image below shows graphically how NLP is related to both ML and Deep Learning.

Deep Learning is one of the techniques in the area of Machine Learning — there are several other techniques such as Regression, K-Means, and so on. Deep Learning is used quite extensively for vision based classification (e.g. distinguishing images of airplanes from images of dogs). Deep Learning can be used for NLP tasks as well.

However, just as NLP is not exclusive to ML, it is important to note that Deep Learning algorithms do not exclusively deal with text.

Source : https://sonix.ai/articles/difference-between-artificial-intelligence-machine-learning-and-natural-language-processing

NLP is a burgeoning field with loads of applications from Translation to Detecting spam emails etc. Watson, IBM’s AI masterpiece, is being used to draft early stage legal responses for big legal firms.

Once computers start understanding our languages maybe we can finally start having better communication!

https://marketoonist.com/2017/09/ai.html

X8 aims to organize and build a community for AI that not only is open source but also looks at the ethical and political aspects of it. More such experiment driven simplified AI concepts will follow. If you liked this or have some feedback or follow-up questions please comment below

Thanks for reading!

--

--