Natural Language Processing by Intuition

Karthik Vadhri
Intuition Matters
Published in
13 min readJan 5, 2022

The process of extracting information and uncovering actionable insights from unstructured text is referred as Text Analytics (also known as text mining or text data mining).

A popular book, Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications, by John Elder, Gary Miner, Rob Nisbet, Andrew Fast, Dursun Delen & Thomas Hill categorizes text analytics into 7 different practice areas.

  1. Search & Information Retrieval
  2. Document Clustering
  3. Document Classification
  4. Web Mining
  5. Information Extraction
  6. Natural Language Processing
  7. Concept Extraction

A lot of data scientists use Text Analytics & NLP interchangeably, to any thing that relates to working with Text data.

NLP has always been a cliché area for data scientists, and the trending words kept changing every year. A Data Scientist's resume isn't complete without the mention of NLP and at least two NLP use cases.
It all started with word clouds, when the first thing we used to do with any text data, was build a word cloud, then TF-IDF became a trend, followed by Word2Vec, and now BERT , Transformers , etc. are the most talked algorithms/techniques.
The focus on text analytics & NLP, can be attributed to the UNSTRUCTURED DATA AND THE 80 PERCENT RULE, which indicates that 80% of business-relevant information originates in unstructured form, primarily text.

80% of usable business information exists in an unstructured representation. Merrill Lynch - 1998

Text Analytics is nothing but, the application of data mining & analytical techniques to text data to extract insights. Have you ever thought of the similarities between a text analytics problem, and a classification problem?
Like I said in the previous post on Why Intuition Matters , understanding things with intuition can open a wide plethora of opportunities for innovation.

Though NLP is popular, most of us often take code snippets of pre trained models, and execute them to achieve results, with just a superficial understanding.

NLP has lately become an acronym for Never Learnt Properly

Lets understand NLP with intuition.

NLP is a set of techniques that enable computers to detect nuances(entities, relationships, context, meaning, etc.) in human language in the same way as humans would detect. NLP helps analyze & detect patterns from unstructured, open-ended text data, which may not be feasible to be measured along any of the four primary data scales (nominal, ordinal, interval and ratio). These patterns can be leveraged for a lot of use cases like translation software, chatbots, spam filters, and search engines, grammar correction software, voice assistants, social media monitoring tools and many more.
Now, lets do an anatomy of a use case, and understand NLP by intuition.
Lets take a sample of user reviews for Zoom — a secure, reliable video platform powers all of your communication needs, including meetings, chat, phone, webinars, and online events.

Sample reviews for Zoom — the video communication application

Assume, we are given a task to classify these reviews by sentiment into Positive, Neutral & Negative
As a human, we look for patterns, and associate these sentences into classes.
Below are an examples of patterns we might form.

Patterns a human brain would analyze

How can a machine understand the nuances of language, from the reviews above, and create patterns to classify the reviews by sentiment.

The set of reviews collected for the Zoom app, is called a Corpus.
It is a collection of texts, stored in a database. There is no limitation to the size of a corpus. It can be small (5–10 sentences) going all the way to millions of sentences / words. (large corpus)

A machine or an algorithm considers all words equally, and not necessarily understand the nuances of English grammar.
Eg. Installation & Installed, are just different forms of the same word, Install. How will the ML algorithm know that?

In the process of understanding this text, we want the ML algorithm, to pick the right words, and not form patterns on the wrong words.

This step is known as Pre-Processing, where we try to remove the unwanted signals, to avoid the model from picking the wrong signals for generalization. A few commonly known pre-processing techniques are

  • Stemming & Lemmatization
    Text Processing technique used to convert words in different parts of speech into their root word.
    Eg. Helping & helped are different forms of the rood word “help” It is important to be cognizant of Under-stemming & over-stemming.
    Under-stemming happens when two related words should be reduced to the same stem but aren’t. This is a false negative.
    Over- stemming happens when two unrelated words are reduced to the same stem even though they shouldn’t be. This is a false positive.
    This is a research area in itself, and there are multiple implementations like Porter Stemmer, WordnetLemmatizer, SnowballStemmer , etc.
  • Spelling Correction
    Typos are common in sentences entered by humans, and it is obvious that a machine would understand two misspelled words are different words. The words highlighted in yellow in the human tagged zoom reviews, are all misspelled words.
  • Remove Unwanted words
    English language has a lot of sentence constructers, which could be false signals for the algorithm, and can be removed.
    These are called as stop words, and prebuilt libraries are available which contain a list of stop words in English language.
    Note that, there could be domain specific stop words which should be added to the base list.

This Analytics Vidhya article clearly articulates the most popular pre-processing techniques. Must Known Techniques for text preprocessing in NLP

Tokenization

Tokenization is the process of breaking up a cleaned corpus into individual terms composed of 1-, 2-, or more word phrases.
Tokens can be either words, characters, or sub words. Tokenization can be broadly classified into 3 types

  1. Word — Splits a piece of text into individual words based on a certain delimiter
  2. Character — splits apiece of text into a set of characters
  3. Sub Word (n-gram characters) tokenization — splits the piece of text into sub words (or n-gram characters)

Tokenization is the fundamental step for any NLP task.

There are a lot of tokenization techniques that can be applied based on the use case, and are part of prebuilt python packages like nltk. More details about Tokenization for Natural Language Processing have been explained in this article.

Word clouds — A visual representation of tokens.
Word Clouds are a popular text visualization technique used for almost every text analytics use case.
Separate word clouds can be created for different types of tokens to understand the underlying patterns in the corpus.

Word Cloud of the Zoom Review Corpus

Vectorization:

Vectorization is the process of converting tokens into machine readable format.
Machines are built to understand only binary base2 number system, that consists of only 1 & 0. Though most modern software applications encapsulates this from the end user, this is the lowest level of understanding for any machine.

How can we convert this text into a numeric representation?

Term-Document Matrix (TDM)
The Term Document matrix is the basic form of converting text into numeric.
After a corpus is tokenized, a simple frequency table can be built to determine how many times each token occurred in each document. This frequency table is called the term-document matrix (TDM).

A Term Document Matrix consists of rows correspond to the terms in the document, columns correspond to the documents in the corpus and cells correspond to the weights of the terms.

Image Source : https://livebook.manning.com/book/natural-language-processing-in-action/chapter-4/v-3/34

The values of a TDM can be as simple as 1’s & 0’s based on presence of absence of the token in the document. Another possibility is the frequency of tokens in the sentence, also referred as Term Frequency Matrix.

TDM & DTM are interchangeably used, as they both represent document vectors in matrix form, there is only a slight difference in their representation.
But the Document Term matrix, is a transpose of the TDM. The rows correspond to the documents in the corpus and the columns correspond to the terms in the documents and the cells correspond to the weights of the terms. The only drawback with this Term Frequency approach, is it considers all the words equally, without giving weightage to important words. Eg. in the Zoom corpus, it is obvious that the word “zoom” is the most frequent word, but we don’t want the algorithm to pick that as a signal.

Term Frequency — Inverse Document Frequency Matrix
This implementation has gained a lot of popularity, The basic idea is that a token’s frequency should be normalized by the average number of times that token occurs in a document (i.e., document frequency).

Below is the TDM based on tf-idf for the Zoom reviews corpus. Instead of 0’s and 1’s there are decimal values, which basically are the normalized frequency values for each token, across the documents.

Term Document Matrix — for Zoom Reviews

The crucial step in any NLP use case, is how we represent text in numeric format. This is called Vectorization.

The vectorization methods mentioned so far are very naïve, and cannot capture the nuances in the language. Given the complexity of the language, It requires slightly complex architectures & huge data samples to be used for training the model for it to understand.

A decade ago, researchers introduced the concept of pre-trained models, where models are made to learn patterns based on numerous sources of data available from the web with vast amount of computational ability. These pre-trained models can be fine tuned with domain specific examples.
This has driven a paradigm shift in the text analytics space, with a lot of implementations available

Paradigm shift in the text analytics space - Pre Train models with huge corpus available in the web, and then fine tune based on domain specific examples.

Lets see some advanced techniques for converting text into numeric, based on these pre-trained models and break them down understandable steps.
The tech giants/digital native companies have trained models on huge corpora of text to serve their users better, and later open sourced them. Below are a few such pre-trained models, you probably would have heard of at least once.
- Word2vec from Google
- Fasttext from Facebook
- Glove from Stanford.

Word2vec is one of the most popular pretrained word embeddings developed by Google. It is a single layered Neural network, trained on Google News Data set (about 100 billion words)
word2vec is not a single monolithic algorithm. In fact, word2vec contains two distinct models (CBOW and skip-gram), which are shallow two layer neural networks having one input layer, one hidden layer and one output layer.
These models have different training methods (with/without negative sampling) and other variations (e.g. hierarchical SoftMax), which amount to small “space” of algorithms. With lot of scope for parameter tuning.

CBOW is trained to predict the target word t from the contextual words that surround it, c, this model is trained in an online setting (one example at a time), at time T the goal is therefore to take a small step (mediated by the “learning rate”) in order to minimize the distance between the current vectors for t and c (and thereby increase the probability P(t |c)). By repeating this process over the entire training set, we have that vectors for words that habitually co-occur tend to be nudged closer together, and by gradually lowering the learning rate, this process converges towards some final state of the vectors

For the skip-gram, the direction of the prediction is simply inverted, i.e. now we try to predict P(citizens | X), P(of | X), etc. This turns out to learn finer-grained vectors when one trains over more data.

CBOW smooths over a lot of the distributional statistics by averaging over all context words while the skip-gram does not. With little data, this “regularizing” effect of the CBOW turns out to be helpful, but skip-gram is able to extract more information when more data is available.

These 2 layered pre trained models, have a few limitations and haven't proved to be robust enough, in spite of initial success.
1- limit to the amount of information they could capture.
2- Order of the words in the sentence wasn't accounted for.
3- these models did not take the context of the word into account.

Lets take the word “bank” example. The same word has different meanings in different contexts, right? However, an embedding like Word2Vec will give the same vector for “bank” in both the contexts.

This motivated the use of deeper and more complex language models using transformers , LSTM’s etc.

Lets shift the focus towards the latest buzz word , BERT.
BERT stands for Bidirectional Encoder Representations from Transformers and is trained on a transformer architecture, a mechanism that learns contextual relations between words in a sentence.
This transformer architecture is the power behind this “deeply bidirectional” model. This means that the embedding is created taking into account the entire sentence, from both the left and the right side during the training phase.

Unlike unidirectional language models, which generate the same embedding for all words, BERT is smart enough to create different embeddings based on the context. And , it is pre-trained on a large corpus of unlabeled text including the entire Wikipedia(that’s 2,500 million words!) and Book Corpus (800 million words).
Another key thing to note, is that BERT is capable of generating both Word & Sentence encodings, making it a go to choice for a lot of text analytics use cases.
The outcome of a BERT model, is a word or sentence embedding in the form of a vector of length 768.

Google Search is currently powered by an advanced BERT Architecture.

With the vast amount of research on NLP , there is a lot of improvements on BERT, in improving the quality of embeddings or computational speed. A few noteworthy advancements are RoBERTA , DistilBERT, XLNet, etc.
There are different flavors of pre-trained models, and this space has been evolving at a rapid pace. Models are trained on huge volumes of data available from the web with , with huge volumes of compute power. This paper on Recent Advances in Natural Language Processing via Large Pre-Trained Language Models, details some recent work that uses these large language models to solve NLP tasks via pre-training then fine-tuning.

So far, we have only converted text into numeric format. After the vectorization step, we have a set of features(tokens) , and embeddings for each token.
Now the Text Analytics problem is converted into a supervised or unsupervised learning problem. Depending on the use case, we can apply any classification or clustering algorithm. Follow Intuition Matters, to get notified when a new article on Classification Algorithms by intuition is published.

Applying BERT alone isn't sufficient in building a good text classifier, the right choice of classification algorithm also matters!

Summary — Steps in a text analytics use case.

Now you must be wondering, why I haven’t touched upon a few more commonly used key words in the arena of Text Analytics, LDA, NER, text summarization, sentiment analysis, etc.
In the previous section, I have briefly touched upon pre-trained models, and their value proposition. Lets talk about a few real -world implementations of pre-trained models. This article by Ajitesh Kumar explains a bunch of pre-trained models, with examples.

Named entity recognition (NER )
NER is an information extraction technique that automatically identifies named entities in a text and classifies them into predefined categories. The task in hand is to identify & categorize key information (entities) in text. An entity can be any word or series of words that consistently refers to the same thing. Every detected entity is classified into a predetermined category. To learn what is and is not a relevant entity and how to categorize them, a model requires training data. The more relevant that training data is to the task, the more accurate the model will be at completing said task.

Open source implementations of NER are trained on examples of what is a relevant entity and what isn't, giving it the ability to pick the correct signals, and eliminate noise, thus eliminating the need to pre-process the data.

NER in combination with a Rule Based classifier, can solve most business use cases, leveraging text data. This is applicable if the Domain knowledge is sufficient to capture the rules correctly.

Sentiment Analysis Pre trained models that can output the sentiment of text passed.

All the steps listed (pre-processing, tokenization, vectorization, classification) trained on huge corpus of text & consolidated into a single function. There are a good number of open source implementations available for sentiment analysis.

Text Summarization — A similar line of thought like an NER, a pre-trained model tries to summarize the input text into a shorter version in an efficient way that preserves all important information from the input text.

This brings me to another jargon, LDA. It is impossible to talk about NLP without mentioning LDA.
A few years back, LDA was perceived as a synonym of NLP, given its wide adoption for a lot of text analytics use cases.

LDA is an attempt of combining the vectorization & modelling together, using pre-defined tokens. LDA topic modelling discovers topics that are hidden (latent) in a set of text documents
LDA is based out of 2 primary assumptions:
1- Documents are a mixture of topics,
2- Topics are a mixture of tokens (or words)

LDA starts by creating a Document Topic Matrix and a Topic Word Matrix, and the end goal of LDA is to find the most optimal representation of Document-Topic distribution and Topic-Word distribution.
The inference in LDA is based on a Bayesian framework. This allows the model to infer topics based on observed data (words) through the use of conditional probabilities.

Human language is too rich and subtle for computer languages to capture anywhere near the total amount of information “encoded” in it.

If you come across any useful resources on the ongoing research in the field of NLP, like Sarcasm Detection , or have any perspectives on , please drop them in the comments section.

About Intuition Matters :
Intuitive understanding can help everything else snap into place. Learning becomes difficult when we emphasize definitions over understanding. The modern definition is the most advanced step of thought, not necessarily the starting point. Intuition Matters in everything, and it matters the most!

--

--