Natural Language Processing(NLP): A beginner’s guide

Reshma J
IETE SF MEC
Published in
7 min readAug 29, 2021

The year was 1950. Scientist Alan Turing described a test- he said that if someone asked a series of questions to a human and a machine simultaneously, and if there were no noticeable differences, then it could be concluded that the machine possessed the ability to think. This had to happen using a teleprinter , without the speaker knowing who is who. He named it the Imitation Game, now known as the ‘Turing Test’. It was a breakthrough idea. Can machines really be made to ‘think’?

Thinking ability is not unique to, but is advanced in human beings. So is the ability to communicate, via a language. The languages we use are complex and sophisticated with their own set of rules and vocabulary. Even closely related species do not have this potential. The primary parts responsible for speech and decoding of speech in humans are the Broca’s area and the Wernicke’s area respectively. Both are located in the left part of the brain.

The fact that neurons are used by the brain to form electrical networks was discovered by two scientists, Alan Hodgkin and Andrew Huxley in 1952. Their model received a Nobel Prize in Physiology or Medicine. These events paved the way for evolution of computers, Artificial Intelligence and Natural language processing(NLP).

Artificial Intelligence(AI) is based on mimicking human intelligence, while NLP takes it a step further - it tries to enable computer programs to understand human language. This is a tough task, given that there are about 7000 different languages spoken around the world. Add to it different slangs, grammar and dialects! Whoa!!

Now let us address the elephant in the room. What is the need or rather the motivation to invest resources behind this herculean task?

  1. Data, data everywhere- Machines generate data. Humans generate data. It is impossible for humans to handle all this data effectively.
  2. Unbiased intent- It is known that humans are biased. Our childhood, surroundings, culture and experiences shape us. These biases are not bad, but they reflect on our decisions. This can be reduced by bringing in a machine.
  3. Inclusivity- The aim of technology is to reach everyone, so that all can reap its benefits. Communication via natural language accommodates more naive users.

How does it work?

The working of NLP has two phases, mainly, data preprocessing and algorithm development.

Data preprocessing

The text data received will contain a lot of details. Some maybe necessary, some unnecessary, so it is important to ‘sieve’ the data. The methods employed are:

Tokenization: Given text is broken down into small chunks known as tokens. Smaller units are easier to work with compared to larger units. Chances of errors are also minimum.

Removal of stop words: Common words are removed. For example, words like ‘the’.

Lemmatization and stemming: Lemmatization refers to the context while stemming finds the ‘stem’ word around which the sentences are centered.

Lower casing: All words are converted to lower case. Otherwise ‘CASE’ and ‘case’ will be treated like two unique words.

Part-of-speech tagging: Words are tagged based on their part of speech, i.e. whether they are a noun or a verb, etc.

Algorithm development

Rule-based system: Traditionally used, they are based on rules built from linguistic structures. This system tries to mimic how humans form sentences from structures.

Machine learning based system: By using statistical methods like analyzing trained set and building its own rules.

Nowadays, deep learning is used as it is more flexible and attempts to mimic the way a child learns a language.

Techniques and tools for NLP

Many techniques are used for NLP. The two main terms that stand out here are syntax and semantics.

Syntax works with the grammar of the system, while semantic refers to the sense of the sentence. For eg: Ram plays Guitar. This sentence is both syntactically and semantically correct. Meanwhile, Guitar plays Ram is syntactically correct but semantically wrong as it doesn’t make any sense. Each technique can be syntactic or semantic.

Syntactical analysis

Word segmentation and sentence breaking

Differentiating two words by locating whitespace and differentiating two sentences by looking at period ‘.’ are also part of syntactic analysis.

Also, Syntactic analysis has different levels such as:

POS tagging or part-of-speech tagging, which classifies words based on their part of speech.

Constituency parsing: Components like Noun phrase, Verb phrase, Prepositional phrase are some constituencies in the English language. Replacing one constituent with another of the same type won’t affect the sentence, but like in the above Ram-Guitar example, it need not be semantically correct.

Dependency parsing: It is a more advanced approach which can be extended to other languages as well. It concentrates on the subject-predicate concept. Every sentence in this world has a subject and a predicate(verb+ object). A sentence conveys what the subject is, what it does (verb), and to whom it is done(object).

Semantic analysis

Semantics play a very important role in understanding natural language.

Some semantic analysis techniques include:

  1. Named Entity Recognition: A very popular semantic technique, NER takes a paragraph and groups words based on some property. It is used extensively by search engines.

2. Sentiment Analysis/ Opinion Mining

As the name indicates, sentiment analysis understands the emotion behind a piece of text. They are used in product reviews, online course reviews etc.

3. Natural Language Generation/ Data storytelling

The opposite of natural language understanding, it converts large structured data into natural language for ease of understanding.

4. Topic Modelling

It is used to identify topics in the text. It is also an unsupervised technique. Some algorithms used in topic modelling include Latent Semantic Analysis and Latent Dirichlet Allocation.

Information thus obtained from the text can be used in ML models or used directly.

Tools

Some of the most popular tools include:

  1. Natural Language Toolkit(NLTK)- Python library
  2. Stanford Core NLP - Stanford’s toolkit
  3. Google Cloud NLP API
  4. TextBlob - A python library for textual data processing
  5. Amazon Comprehend - An NLP service which uses machine learning
  6. Gensim- An open source library used for NLP

Real-world applications of NLP

Some real world applications of NLP include:

  1. E-mail filters — classifying mails as primary, social and promotions, differentiating spam mails from regular mails ,etc.

2. Smart Assistants like Apple’s Siri and Amazon’s Alexa to recognize patterns and to give appropriate responses. To learn more about how this works, check out our article on voice assistants here.

3. In applications like Inshorts, for text summarization.

4. To predict text and in services like auto-correction while chatting, or using e-mail.

5. To analyze customer feedbacks, a few apps which use this include Uber, Zomato etc.

6. To detect plagiarism in works.

7. For academic and research purposes.

Conclusion

Natural Language Processing is a very vast subject, advancing everyday. As mentioned before, conquering the vast multitude of languages is a persisting, daunting task. At the same time, we need to acknowledge the extent to which NLP has simplified our daily lives, and therefore it is a goal worth trying. After all, practice makes a ‘machine’ perfect!!

Resources

https://www.analyticssteps.com/blogs/introduction-natural-language-processing-text-cleaning-preprocessing

Image courtesy:

**We also invite science and technology enthusiasts to write for us. If you think you have interesting stuff which the world should know about, send in your articles to us!***

Interested in writing for us? Fill up this form!

--

--