A²I : Sensibility in Data around us — By us, of us and for us:)

NLP in Digital Text Analytics

Reshma Unnikrishnan

Published in

Arnekt-AI

9 min readAug 2, 2018

Day-to-day data

From the daily chores in our life to the non-routine behavior we undertake, things have all been digitized. Like for example the daily chore starts with the ‘To-do list’ where we jot down the activities we need to perform in a day or be it the ‘Time-table’ of a student with specific period of time assigned for each subject. It can also be the ‘Shopping List’ for a homemaker to make a note of all the groceries to be bought for a month or a ‘Collection of favorite quotes and proverbs’ of an individual as his/her hobby or even helping a ‘writer with the possible word usages’ in times where he/she gets stuck, with the help of a search engine. Noting down such stuff is no longer a paper pen matter as it has all been stored in digital texts. These days we humans actually do not remember the spelling of even the simplest words, because of the prediction system behind the devices we use. Be it the movie name or the song we like to listen, by just typing certain keywords we are able get what we are exactly looking out for. Just as how one of the ancient means of communication — Telegram just vanished away, in near future we would see people relying only on their mobile phones for almost everything. This is because people have even stopped watching TV and stopped reading the newspapers off-line. As TV programs do not get played according to the users choice, now even advanced technology of backing up the data of shows they missed have come up. Newspapers has also become digitized and turned into applications that could be read as and when possible.

Apart from all these data that is created from simple chores the world has moved into another phase of generating digital data. This is nothing but the usage of Social media as part of ones life. These days people do not really bother of whether they ate or slept but instead do social networking. This has turned out to be a significant part in every humans life. There are people who do not just chat or post important events, but also keep updating their daily life movements in public on Social medias. From the above scenarios, we can make out that everything is at a one click away afar, within our five fingers. If one has closely noted about the amount of data generated from all these activities of the humans, the amount of text data generated would stand in the first position amongst all the other datums(voice, video, image, etc.). So what is a text data and how is this useful?

“Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.” — Clive Humby

Text

Text data could be anything from the message we send a friend, to the content we write for a journal. It comes under structured, semi-structured and unstructured category. Structured involves pre-defined formats for application forms like loan forms, birth certificates, mark lists and etc., while semi-structured includes the journals, conference proceedings and etc. The rest of the mass data that an individual generates like the chats, posts, comments, reviews, articles, blogs, descriptions and etc., comes under the unstructured category. In real life, the unstructured data grows much faster than the former two categories. This has eventually led to the building of various methodologies that handles real time data which extracts information from it and also benefits the human in return.

Opinion evaluator ;) :) :(!!!

Faces — index of mind (NLP reads our mind, from the texts we generate)

One such methodology is Sentiment Analysis! People also refer it to an Opinion miner or Emotion analyzer.

We humans always have a tendency or may be anxiety in knowing what others think about us. Some will even want to know what others think in general. Nobody really knows what we think of each other — at times even our own thoughts become ambiguous. With such complexity in ones thinking process there is no super human to read the mind of a person. Even though this seems to be an unsolvable case when dealt without any base for arriving at a solution, the same problem could be addressed with the text data as the base. As seen earlier the amount of data that people generate does not hold any limit or boundary. It just keeps pouring out like the Niagara Falls:)!!! Taking this vastly produced data (unstructured — text) by humans as a foundation, one can analyze to some extent what people think about us and what they think in general. This is where Sentiment analysis plays its role.

In plain words, Sentiment analysis is an idea of examining a natural text data generated by a human, to bring out the emotions behind it. Emotions can be any, such as Happy, Sad or indifferent. In technical (Artificial Intelligence) point of view, determining if a text data falls into one of the three categories would be the prime target behind this methodology.

As humans we are blessed with the Neural Network kind of functioning brain that encodes and decodes any problem definition. We will be able to scrutinize the positive or negative feel of a person’s text from his/her context in the text. Context is something that can be understood when one is clear with the language (its grammatical rules), as well as the subject one is talking about. Contexts vary from topic to topic and it is hard for even humans who are usually not aware of all the fields. Texts with a riddling structure (irony) may also exist that seems to be confusing for us. With complexities in text structure and word usage, building a reliable model that replicates human is very challenging. Despite these intricacies, there are various use cases that have been deployed and used in daily lives. Some of them are listed below.

Talking in business strategy view, Sentiment analysis has come up with interesting applications like reviewing the customer satisfaction from the comments they post for the product they purchase, in recommender systems for analysis of frequently bought items to recommend similar products to the customer, add placement based on the interest on the product frequently brought, regulatory compliance for product liability based on customer chats, reviews and other features, conversational models or chat bots to respond and query based on the end user’s mood and lots more.

These are only a very few to be mentioned. Opinion mining is applied to various other domains as well. Election analysis and reporting system has been used to predict the chances of a party winning the election, by the tweets common people post for or against the party leader, Devices that play songs with respect to your mood, identifying distractors and promoters in market movement, Work force analytics in organizations and small firms, Product management based on customer reviews and many more to go.

Text-Sentiment-Technology

Now that we know all the possible areas where Sentiment analyzer has been used, it is also necessary for us to know the technology behind these applications. Natural Language Processing (NLP) is a sub-field of Artificial Intelligence (AI) that plays with text data, in order to bring out the potential information hidden behind natural language texts(digital text). From the notes, a student takes down to the massively generated chat texts, informations can be extracted, retrieved, classified, summarized, as well as translated. This can be done by making use of the various stages in NLP — Morphological, Lexical, Syntactic, Semantic, Pragmatics and Discourse analysis. As of now we have just reached in a position between syntactic and semantic analysis according to the Jumping NLP Curves: A Review of Natural Language Processing Research. Text classification in NLP is the one which conquers eighty percent of the Text analytics applications. Representing the text as numerical values and deriving inference from them through a classifier, are the primary elements of any Text classification application.

Sentiment analysis falls under the category of Text classification in which the above mentioned primary elements play a unique role in case of Conventional Machine Learning and these elements in turn gets merged (put together) with the usage of Deep Learning.

NLP is a combination of Machine Learning and Computational Linguistics. Computational Linguistics is typically concerned with understanding of written and spoken language from a computational perspective. Machine Learning focuses on deriving statistical inference out of written and spoken language. With the commercial utilization of GPUs, Conventional Machine Learning got switched to Statistical Machine Learning which then paved way for Deep Learning. Until then all the Text Classification problems were solved through algorithm driven models. Few of the successfully used Conventional Machine Learning algorithms are Support Vector Machine, Logistic Regression, Decision Trees and Ensemble methods. The constrain in scalability with respect to the data and computation complexity in Conventional Machine Learning along with its performance limitation has eventually obliged users towards the practice of Deep Learning in NLP.

R&D at Arnekt

For a complete look out: https://arxiv.org/abs/1804.03673

Getting to know the technology behind the applications in Sentiment analyzer, lets now see how Arnekt has made use of the available resources.

Arnekt has kick-started its emergence in AI, by crawling about few lakhs of news data in real-time and extracted sentiment information from them. Predictive model was built using these datums with the help of VADER and few other unsupervised methods that took care of tagging of data. With data in few lakhs, it was necessary for us to go along with Deep Learning, which are data hungry by nature. Most recent research was on applying Convolutional Neural Network (CNN) to applications that does not involve Computer Vision (CV). It is when CNN had its optimum growth in the field of NLP. Researchers and Professors have started experimenting with CNN (then an ideal tool for CV applications) on text data to bring out state-of-art models using the same. After thorough study on research papers and materials on usage of CNN on text data, we made it a point to apply CNN for our news data. We were able to extend our simple experimentation to a product level deployment, thereby ending up with prominent results in Sentiment analysis of news datums. Though VADER was the basic tool that we had made use of for labeling the data, the unsupervised methodologies helped in ending up with a well-established tagged corpora for assembling a reliable product.

Boost your thought process!

Its always good to start practicing the idea we consume, rather than to just read, listen or analyze it. This is what we have done. The data we generate unknowingly to knowingly, its segregation in broad as text, image and speech, to a detailed structure of text and one of its usage to know what people think (Sentiment Analysis) about something along with the application that Arnekt has come up with has all been dealt with so far. The technology behind the Emotion Analyzer or Opinion miner would also be a good base in building new thought process, to come up with a stable application that would serve beneficial to we humans in return. So why wait! Dirty your hands to the fullest in the area of Sentiment Analysis by getting to know the ideology behind it.