Natural Language Processing Concepts

5 min readMay 25, 2020

What is Linguistics?

Linguistics is the scientific study of language. The part of Linguistics that is concerned with the structure of the language is subdivided into the following categories:

Related to Sound:

Phonetics — the study of speech sounds in their physical aspects.
Phonology — the study of speech sounds in their cognitive aspects.

Related to Formation:

Morphology — the study of the formation of words and their relationship to other words.
Syntax — the study of the formation of sentences.

Related to Meaning:

Semantics — the study of meaning
Pragmatics — the study of language use

What is Natural Language Processing?

Natural Language Processing, also called NLP, is a branch of Artificial Intelligence that deals with humans’ natural language that could be in the form of text or speech. It draws from computer science and computational linguistic.

The goal of NLP is to read, understand, and decipher the human language in such a way that it could be used by Machine Learning Algorithms to extract value out of the language.

Where is NLP Used?

Chatbots or Personal assistant applications like Alexa, Siri, and OK Google.
Call centers IVR (Interactive Voice Response) applications use it to respond to users’ specific requests.
Language Translators like Google Translate.
For grammar correction by Word Processors like MS Word.
Plagiarism Detection.
Email Spam Detection.
Sentiment Analysis.
Content-based Recommender System.

Why is NLP a difficult problem in Machine Learning?

The reason for NLP being a difficult problem is the very nature of the human language that does not follow a fixed set of rules. Understanding the words and their intent is challenging. Although it is easy for humans to master any language, for machines it is a challenge to master something that can be ambiguous.

How does NLP Work?

NLP converts text or audio-based natural language into numbers so that they can be used by an algorithm to extract meaningful information out of it. The goal is to take sentences and convert them into a matrix.

NLP Vocabulary:

Corpus: The entire array of the input text. Depending on the problem, it could be a collection of sentences, a collection of words, or even a single sentence or a single word.

Document: Each element (row) in the corpus. It could be a sentence or a word, depending on the problem we are working on.

Converting a Document into a Matrix

Tokenization: Splitting the Document into a list of comma-separated words.
a. Word Level: Words separated by space.
b. Character Level: One or more characters (singleton, pair, triplets, and so on) separated by space.
c. N-grams: On a word level, 1-gram is one word, 2-gram is a pair of words, 3-gram is a triplet of words, and so on. On a character level, 1-gram is a single character, 2-gram is a pair of characters, 3-gram is a triplet of characters, and so on.
Stop Words: Words that are not useful in NLP like a, an, and, is, I, IT, its. There is a pre-compiled list. We can also add words manually to this list or create our list of words that we think do not contribute towards solving our problem. These are words that we want to ignore. Stop Words can be dropped during Lemmitization or Vectorization. Dropping Stop Words help us save space, unnecessary computation, and most importantly decreases the chances of overfitting the model. This is also called Noise Removal.
Lexicon Normalization: Words with inflected endings are also textual noise. Normalizing words that appear different but are contextually the same help building more accurate models. There are two techniques to normalize words:
Lemmitization: Removing the inflected ending of words by properly using the vocabulary and morphological analysis of words and returns their dictionary or base form (called lemma). WordNet is a common module used for Lemmitization. Example —
Cats -> Cat
Feet -> Foot
Are -> Are
Stemming: This is another way of reducing inflected words to the common base form but is cruder and can lead to inaccurate results. Stemming chops off the end of the words without getting into their morphological meaning that can often lead to words that lose their intended meaning. Example — Stemming will return “hat” for all of these words — “hat”, “hating”, “hated”.
Part of Speech Tagging: Determining the POS tag for each word. Languages can be ambiguous. A word like “book” can be a Noun as well as a Verb. Tagging their correct POS gives more meaning to the word. POS tagging comes under Syntactic Parsing that analyzes the words in a sentence for grammar and their arrangement in a manner that shows the relationship between the words.
Named Entity Recognition: Detecting named entities like person names, location names, etc. from the text. This is very useful when building an automated chatbots.
Vectorization: A feature extraction technique that converts a collection of text documents from word tokens into a matrix. This could be done in two ways:
Count Vectorization: Converts the collection of documents from word tokens to a matrix of token counts. It produces a sparse representation of counts that can be converted into a dense matrix (as needed by certain algorithms). Depending on the problem, the parameters can be set to analyze a word or a character (n-gram), to ignore words that have a very low or very high frequency to limit the vocabulary to top max features, and so on.
Term Frequency — Inverse Document Frequency (TF-IDF): This technique uses the term weighting scheme. It scales down the impact of word tokens that appear very frequently in the corpus and usually do not much value or are less informative than the features that occur in a small fraction of the training corpus. Hence this technique can be beneficial as compared to the Count Vectorization method for certain problems.

Once the corpus text is cleaned, pre-processed, and vectorized, it is ready to be used by a Machine Learning algorithm to make predictions.

NLP USE CASE:

Content-based Recommendation System:

NLP is extensively used in content-based recommendation systems that we come across in our day to day life when searching for movies, groceries, jobs, books, etc. search. It involves the following steps:

Data Preprocessing & Tokenization — Dropping punctuation, white spaces, stop words, Lemmatizing, and converting the cleaned Documents into word tokens.
Vectorization — Using a vectorization method to convert the word tokens into a matrix of frequencies or TF-IDF weights.
Normalization — This is an important step that normalizes all Documents to unit norm. This allows all Documents (big or small) to have a length equal to one which is important when trying to compare the Documents for similarities.
Cosine Similarity: It measures the cosine of the angle between the two token vectors and returns the most similar match. The smaller the angle, the higher the cosine similarity. Since we have normalized the Documents, the denominator of cosine which is the product of two vector norms becomes equal to one. The cosine calculation is then converted into the dot product of the two vectors. Once the cosine is calculated, the most similar Document or a list of top n most similar Documents can be returned.