An Introduction to Natural Language Processing (NLP)

Megha Sinha
9 min readJun 6, 2022

--

Table of contents

  1. What is NLP
  2. Difference between NLP, NLU and NLG
  3. Approaches of NLP
  4. Open source NLP libraries
  5. Phases or logical steps of NLP
  6. Linguistic Resources in Natural Language Processing
  7. Steps of Core NLP
  8. Some Applications of Natural Language Processing
  9. End Note

Language is a form of communication between humans. One human emits a message which is a combination of symbol, grammar, sign, feeling, tone and multiple sense of knowledge which should enable the receiver to understand the message. Everything a human say or write contain huge amount of their information and behavior. Have you ever thought about how now a days search engine work and how chatbot understand our sentence and respond?

All of this is because of the art of Natural Language Processing. Using NLP you can make machines understand human-like sound, and what you’re writing in text form.

What is NLP ?

Natural Language Processing, is a field of artificial intelligence that gives the machines the ability to read, understand and derive meaning from human languages. NLP is a combination of field of linguistics and computer science which is used to understand language structure and make models which can give separate significant details from text and speech. Understanding human language is considered a difficult task due to its complexity. A sentence can have different meaning, sense and emotion. Each level can have lot of ambiguity So a good and complete NLP system should be able to handle all the complexities and ambiguity of sentence.

Some NLP-powered software that we use in our daily lives in various ways, for example:

  • Personal assistants: Siri, Google Assistant, Cortana, and Alexa.
  • Spell checking: We use this feature in all places , in our browser, your IDE (e.g. Visual Studio), desktop apps (e.g. Microsoft Word) and Email.
  • Search-Engine: In search engines (e.g. Google, Yahoo) Auto complete feature and some inbuilt search engine in software .
  • Machine Translation: Google Translate, Microsoft Translation API, IBM Watson Language Translator API.

Difference between NLP, NLU and NLG

Natural Language Processing(NLP), Natural Language Understanding (NLU) and Natural Language Generation(NLG) are all related topic. At a high level NLG and NLU are components of NLP.

Lets read about each term individually

Natural Language Processing systems look at language and figure out what ideas are being communicated.

Natural-language understanding is a sub field of Natural Language Processing in artificial intelligence that deals with machine reading comprehension and focuses on a machine’s ability to understand the human language. NLU refers to how unstructured data is rearranged so that machines can understand and analyze it. Machine translation is one of the example of NLU.

Natural Language Generation is a sub-field of Natural Language Processing in Artificial Intelligence (AI) that automatically generates text in any language as an output on the basis of data as input. It is a process of creating insight in the form phrase or sentence in natural language. In the age of digitalization & AI, consumers expect personalization and NLG can provide it at scale. Chatbot is one of the example of NLG.

Approaches of NLP

There can be two approaches to achieve NLP

  1. Rule based (using linguistic)
  2. Model based (using ML and Deep Learning)

or The Hybrid Approach .

Rule-Based-NLP : It is commonly used for structured data and context free grammar. Rule based NLP is basically a system of rules based on linguistic structures and we can also use rule based approach for multi-lingual problem. Grammar rules can be use in a very flexible manner, for example we can use feature of synonym , and can easily be updated with new functions, word and data types, with no changes to the core system. In rule based NLP system, the system doesn’t require a massive training corpus, compared to the machine learning-based approach.

The most obvious disadvantage of the rule-based approach is that it requires skilled linguist or a knowledge engineer to manually encode each rule in NLP. Rules need to be manually crafted and enhanced all the time. Moreover, the system can become so complex, that some rules can start contradicting each other.

Topics which comes under Linguistics :

  1. Phonetics: The study of individual speech sounds
  2. Phonology: It is study of organizing sound systematically.
  3. Morphology: It is a study of construction of words from primitive meaningful units.
  4. Syntax: Involves determining the structural role of words in the sentence and in phrases.
  5. Semantics: The meaning of words and how to combine words into meaningful phrases and sentences
  6. Pragmatics: Using and understanding sentences in different situations and how the interpretation of the sentence is affected.

Model-based-NLP : This approach is based on algorithm that learn to understand language using ML model. it works on both structure and unstructured data. This is possible through the use of statistical methods, where the system starts analyzing the training set (annotated corpus) to build its own knowledge, produce its own rules and its own classifiers. and the result work on probabilistic approach. Machine learning is good at task such as document classification, word clustering and prediction task. Machine learning approaches can significantly speed up the development when enriched training data and testing set are available. However its difficult task to gather meaningful data. the most obvious disadvantage of ML based system is that it requires lot of data for training and lack of data can be a problem for this type of system.

Hybrid Approach : This approach is most flexible and popular approach for complex system. Both systems have their own limitations So combining both approach can reduce limitation.

Like for a system main goal is to help database to understand query placed by user in human language and translate into language which is familiar to database, For this we use formal grammar of corresponding language.

On the other part there could be a problem of text classification on the basis of result (static and dynamic). so for this the best solution is to use model based approach.

In Hybrid approach we use combined solution and it is easier than the last two segregated, for a complex problem. This process can be divided into three step,

First gather data and build corpus,

Second step understand basic linguistic concept and so pre processing on data as per NLP application,

Third extract feature using linguistic , statistics and computational concept described in feature engineering.

Open source NLP libraries

These libraries provide the algorithmic building blocks of NLP in real-world applications.

  • Apache Open NLP
  • Natural Language Toolkit (NLTK)
  • Stanford Core NLP
  • MALLET
  • Spacy

Phases or logical steps of NLP

  1. Syntactic Analysis
  2. Semantic Analysis
  3. Pragmatic Analysis

Syntactic Analysis

The purpose of this phase is to draw exact meaning, or you can say dictionary meaning from the text. Syntax analysis checks the text for meaningfulness comparing to the rules of formal grammar.

The main roles of the parse include −

  • To report any syntax error.
  • To recover from commonly occurring error so that the processing of the remainder of program can be continued.
  • To create parse tree.
  • To create symbol table.
  • To produce intermediate representations (IR)

Syntactic Analysis further includes some specific techniques:

  1. Lemmatization: The process of lowering multiple inflected forms of a word into a single form for hassle-free analysis
  2. Morphological segmentation: Division of words into single units called morphemes
  3. Word segmentation: Division of a large piece of continuing text into different units
  4. Part-of-speech tagging: Identification of the part of speech for each word
  5. Parsing: Grammatical analytics for the assigned sentence
  6. Sentence breaking: Placement of sentence boundaries on a massive piece of text
  7. Stemming: Includes cutting the inflected words to their original form

Morphological segmentation is the study of

  • The formation of words.
  • The origin of the words.
  • Grammatical forms of the words.
  • Use of prefixes and suffixes in the formation of words.
  • How parts-of-speech (PoS) of a language are formed.

Eg:- we can break the word foxes into two, fox and -es. We can see that the word foxes, is made up of two morphemes, one is fox and other is -es.

Terms which comes under Morphological segmentation:

  • Lexicon (Includes the list of stems with the basic information about them, eg the information like whether the stem is Noun stem or Verb stem, etc.)
  • Morphotactic (the model of morpheme ordering, eg the morphotactic fact is that the English plural morpheme always follows the noun rather than preceding it)
  • Orthographic rules (the rule of converting y to ie in word like city+s = cities not citys)

Semantic Analysis

Semantics refers to the linguistic and logic that are conveyed through a text. It involves implementing computer algorithms to find out the interpretation of words and the structure of the sentences.

Here are some techniques in semantic analysis:

  • Named entity recognition (NER): It involves discovering the parts of a text that identifies and classifies into predetermined groups. Some common examples include the names of places and people.
  • Word sense disambiguation: It’s about determining the sense of a word based on its context.
  • Natural language generation: It uses the database to get semantic intentions and turn it into human language.

Pragmatic Analysis

Pragmatic Analysis deals with the overall communicative and social content and its effect on interpretation. It means abstracting or deriving the meaningful use of language in situations. In this analysis, the main focus always on what was said in reinterpreted on what is meant.

E.g., “Pruning a tree is a long process.”

Here, pruning a tree is one of the concepts of computer science algorithm techniques. So, the word pruning is not related to cutting the actual physical tree, we are talking about computer science algorithm. This is an ambiguous situation; how to deal with these kinds of ambiguous situations is also an open area of research. Big tech giants use deep learning techniques to do pragmatics analysis and try to generate the accurate context of the sentence in order to develop highly accurate NLP applications..

Linguistic Resources in Natural Language Processing

  1. Vocabulary : The entire set of terms used in a body of text.
  2. Out of Vocabulary : In NLP, data used to train our model consists of a finite number of vocabulary terms. Very often, we will encounter out of vocabulary terms when using our model for inference. Typically, a common placeholder is assigned for these terms.
  3. Documents : Document refers to a body of text. A collection of documents make up a corpus. For instance, a movie review or an email are examples of a document.
  4. Corpus : A corpus is a collection of text written or audio spoken by a native of the language or dialect organized into datasets. A corpus can be made up of everything from newspapers, novels, recipes, radio broadcasts to television shows, movies, and tweets.
    In Natural Language Processing (NLP), We have some corpus which we can use in semantic role labeling. and it is useful in information of word .
    The formation of words.
  • VerbNet(VN) : VerbNet is the hierarchical domain-independent and largest lexical resource present in English that incorporates both semantic as well as syntactic information about its contents
  • WordNet : WordNet, created by Princeton is a lexical database for English language. In WordNet, nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms called Synsets. All the synsets are linked with the help of conceptual-semantic and lexical relations. Its structure makes it very useful for natural language processing (NLP).

Steps of Core NLP

  1. Preprocessing
  2. Stop words
  3. Tokenization
  4. (Word) Embeddings
  5. n-grams
  6. Pos Tag (Parts of speech tagging )
  7. Multi word Detection
  8. Number Identifier
  9. NER (Name Entity recognition)
  10. Constituency parsing
  11. Dependency parsing

Some Applications of Natural Language Processing

  • Language Translation (Tools such as Google Translate, Amazon Translate, etc. translate sentences from one language to another using NLP)
  • Search engine
  • Chatbots
  • Sentiment Analysis
  • Text Summarizers
  • Speech Recognition
  • Autocorrect (Autocorrect will automatically correct any spelling mistakes and grammar )

End Note

This is just a tiny taste of what you can do with NLP. In future posts, we’ll talk about each concept of NLP in deep.

But until then, install and start playing around , If you are python user Install Spacy and use NLTK and If you are Java user you can try Stanford core NLP.

--

--

Megha Sinha

Working as a Senior Data Scientist at Whiz.ai | Ex-NOKIA | Alumnus of NIT Jamshedpur | Natural Language Processing