Journey Through the World of NLP -Natural Language Processing! — Part-1

Dinesh Kumar
The Startup
Published in
7 min readJan 18, 2021

Imagine you are waking up from the bed and having a conversation with a digital assistant such as Alexa, Siri, Goggle.

you : Alexa! How does my schedule looks today?

Alexa : “ You have 2 meeting scheduled today. one at 10:00 AM with john doe and another at 3:00 PM with Jane doe.

We use this smart assistance to do many of our daily tasks, and we are so much relaying on them. We talk to these assistants not in a programming language but in our natural language. Natural language is the communication between humans since the beginning. Teaching computers to analyse and understand human language is called Natural Language Processing.

Example Applications using NLP

Below are some real world applications using natural language processing.

  • Analyse social media feed to understand the customer voice, trending.
  • Voice based assistants — Interact with the user, understand user commands and perform task based on it.
  • Email platform such as Gmail, Microsoft uses various NLP features of their product such as spam classification, a priority index, calendar event, auto complete and so on.,
  • Machine Translation : Translating the text from one language to another. (Google translate, Microsoft translate)
  • E-Commerce : Understanding user review, extracting relevant information from the product.
  • Automatic summary/report generation on domain specific areas like Law, medical etc.,
  • Spelling and grammar correction tools.
  • Plagiarism detection

Example of NLP Tasks

  • Language Modelling : Task of predicting the next word based on the previous & historical words. Language model heavily uses in the areas such as speech recognition, OCR, Handwriting recognition, Machine translation, Spelling correction.
  • Text Classification : Grouping text into a known set of categories based on its content.
  • Information Extraction : Extracting relevant information from the scientific / medical Journal and any text data on the internet.
  • Information Retrieval : Finding documents relevant to the user query from a large collection of data. Most of the search engine today use this method to retrieve the results.
  • Conversational Agent : Dialogue systems that can make conversation in human language. E.g. Alexa, Siri, OK Google Etc.,
  • Text Summarization : Create short summaries form longer document while maintaining the core content and meaning of the document.
  • Question Answering : System that can automatically answer questions given in natural human language.
  • Machine Translation : Converting text from one language to another while preserve its meaning.
  • Topic Modelling: Uncovering topical structure of a large collection of docs.

Language

Before jumping into NLP algorithm and technique we need to understand the core concept of language.

Language is a structured system of communication that involves a complex combination of its components such as characters, words, sentences etc.

At the core human language having four major building blocks:

  • Phonemes : Any of the abstract units of the phonetic system of a language that correspond to a set of similar speech sounds. (such as the velar \k\ of cool and the palatal \k\ of keel) which are perceived to be a single distinctive sound in the language. Phonemes may not have any meaning by themselves, but can induce meanings when in combining with other phonemes. (speech and sounds) Speech to text, speaker identification, text to speech are some NLP tasks heavily relay on Phonemes.
  • Morphemes & Lexemes : A morpheme is a meaningful unit of language that cannot be further divided. It formed by combination of phonemes. lexeme is, roughly, the set of inflected forms taken from a single word. Lexeme is a foundation block for many NLP tasks such as Tokenization, word embedding, POS Tagging.
  • Syntax : Set of rules to construct grammatically correct sentences out of words and phrases in a language.
  • Context : Context is how different parts in a language convey a particular meaning. The meaning of a sentence can change based on the context. Generally context composed of sematic and pragmatics. Sematic is the direct meaning with out external context. Pragmatics adds external context by referring world knowledge into it. Some tasks heavily rely on the context are sarcasm detection, text summarization, topic modelling.

Why NLP is challenging?

Ambiguity and creativity of human language are just two of the characteristics that make NLP more challenging.

Ambiguity of Language : Is the presence of two or more meanings in a single passage. Uncertainty of meaning. Most human languages are inherently ambiguous.

“ He is good as John Doe” Figurative languages increase the ambiguity.

Creativity : Since the beginning language evolved over time. language is not only rule based there is also creativity to it. Various styles, dialects, variation are used in all languages. A very good example of this creativity is “poem”.

One of the key challenges in NLP is how to encode all the things that are common knowledge to humans in a computational model.

Diversity across language : There is no direct mapping between the vocabularies of any two languages(most cases). This hurdle makes NLP solution harder. So either NLP models need to be language agnostic or models needs to be developed for each and every language. Both are a bit unrealistic.

Approaches to NLP

Below are some of the common approaches to solve NLP problems.

  • Rule based NLP
  • Machine Learning
  • Deep Learning
  • Reinforced Learning

Rule based NLP

This approach likely to building rules for the task at hand. This takes some expertise in the domain to formulate rules, this also requires resources like dictionaries, thesaurus. Word net has been built to aid rule based NLP. WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. It’s a database of a word and the semantic relation between them. It captures synonyms, hyponyms, meronyms.

  • Synonyms : Word with similar meanings.
  • Hyponyms : Capture type of relationships. (Football, cricket, tennis — Sports)
  • Meronyms : Capture part of relationship. (hands, legs — meronyms of body)

Machine Learning

As trend continuous machine learning is one method used by many researchers/organization to perform NLP Tasks.

In order to apply machine learning algorithm on given text corpora the features of the text needs to be extracted. there are several methods are there to extract feature, I will cover those method deeply in the next article.

Once the features are extracted, those features are then used to representation of text to learn a model. Then this model will be evaluated and improve over time.

Some of common machine learning algorithm in MLP tasks is Naive Bayes, SVM, Logistic regression, Hidden Markov model, Conditional random field.

Deep Learning

  • Long short term memory (LSTM)
  • Convolutional Neural networks (CNN)
  • Auto encoders
  • Transformers
  • Recurrent Neural networks (RNN)

Problem with deep learning NLP

  • Deep learning algorithm tends to overfit on small datasets.
  • Few-shot learning and synthetic data generation. (Learning from very few training examples.)
  • Domain adoption: when we develop a large model on some common domains and apply the same model one newer domain, the model yield poor performance. Example the model trained on product review will not work well when applied to domains such as law, medicine.
  • Interpretable models : It’s hard to interpret a DL model because , most of the time DL models work like a black box. Many of the business demands explanation on why the model comes to the specific results.
  • Common sense and world knowledge : We achieved benchmark NLP tasks using ML and DL models, language remains a bigger puzzle to researchers and scientist. mainly it’s because human language and common sense enormously evolved over time. so interprets the exact context of the meaning is a big challenge.
  • Cost : Build NLP tasks can be pretty expensive, because of enormous parameters and GPU cost.
  • On-device deployment : NLP solutions need to be deployed on an embedded device rather than in the cloud. Because the model needs to be work without relying on the internet.

Example NLP Problem : Conversational Agents

Voice based conversational agents like Alexa, Siri, OK Google is some of existing application of NLP.

Speech recognition and synthesis

Speech recognition converting speech signals to their phonemes, then this transcribed as words. Synthesis is reverse process where text results converted into spoken language.

Natural Language understanding

Once words received, the text will be further analysed using a different technique based on the use case. Sentiment analysis, named entity recognition, coreference resolution.

Dialog management

Once find the useful information from user speech, we need to find the intent of the words.

Response management

Send response back to the user using speech synthesis.

Reference

  • Practical natural language processing : Sowmya vajjala, Bodhisattwa Majumeder, Anuj gupta & harshit surana.
  • Lipton, Zachary C. and Jacob Steinhardt. “Troubling Trends in Machine Learning Scholarship”, (2018).

--

--

Dinesh Kumar
The Startup

A passionate Full stack NLP Engineer, who is trying to uncover grant mystery about language.