Natural Language Processing(NLP)
It is the process of deriving meaningful information from Natural Language text. It refers to the process of deriving high quality information from the text. The overall goal of text is, essentially to turn text into data for analysis, via application of Natural Language Processing(NLP).
Introduction to NLP
Natural Language Processing (NLP) is a field of Artificial Intelligence that gives the machines the ability to read, understand and derive meaning from human languages. It is a part of Computer Science and AI which deals with human language. It pursuit to fill the gap between human communication and computer understanding.
Why NLP is important??
- Large Volume of textual data:
Natural language processing helps computers communicate with humans in their own language and scales other language-related tasks. For example, NLP makes it possible for computers to read text, hear speech, interpret it, measure sentiment and determine which parts are important.
2. Structuring Unstructured data:
Human language is astoundingly complex and diverse. We express ourselves in infinite ways, both verbally and in writing. Not only are there hundreds of languages and dialects, but within each language is a unique set of grammar and syntax rules, terms and slang. When we write, we often misspell or abbreviate words, or omit punctuation. When we speak, we have regional accents, and we mumble, stutter and borrow terms from other languages. NLP helps to structure the data.
Branches of NLP
- Natural Language Understanding(NLU): NLU is branch of natural language processing (NLP), which helps computers understand and interpret human language by breaking down the elemental pieces of speech. In NLU, machine learning models improve over time as they learn to recognize syntax, context, language patterns, unique definitions, sentiment, and intent. NLU enables human-computer interaction.
- Natural Language Generation(NLG): Natural language generation (NLG) is the use of artificial intelligence (AI) programming to produce written or spoken narratives from a data set. This enables the chatbot to interrogate data repositories, including integrated back-end systems and third-party databases, and to use that information in creating a response.
Applications of NLP
Search Autocorrect and Auto Complete:
Whenever we search for something on Google, after typing 2–3 letters, it shows you the possible search terms. Or, if we search for something with typos, it corrects them and still finds relevant results for you. It’s a wonderful application of natural language processing and a great example of how it is affecting millions around the world.
Have you ever used Google Translate to find out what a particular word or phrase is in a different language? I’m sure it’s a YES!! The technique behind it is Machine Translation. It is the procedure of automatically converting the text in one language to another language while keeping the meaning intact. These tools are helping numerous people and businesses in breaking the language barrier and becoming successful.
Customer service and experience are the most important thing for any company. It can help the companies improve their products, and also keep the customers satisfied. But interacting with every customer manually, and resolving the problems can be a tedious task. This is where Chatbots come into the picture. Chatbots help the companies in achieving the goal of smooth customer experience. Today, many companies use chatbots for their apps and websites, which solves basic queries of a customer. It not only makes the process easier for the companies but also saves customers from the frustration of waiting to interact with customer call assistance. Additionally, it can reduce the cost of hiring call center representatives for the company.
I am sure you’ve already met them, Google Assistant, Apple Siri, Amazon Alexa? Yes, well all of these are voice assistants. A voice assistant is a software that uses speech recognition, natural language understanding, and natural language processing to understand the verbal commands of a user and perform actions accordingly. They are much more than a chatbot and can do many more things than a chatbot can do.
Ambiguities with NLP:
- Lexical Ambiguity: This type of ambiguity represents words that can have multiple assertions. It is caused when two words have the same form or when a word has more than one meaning. For example: She is looking for a match, here match may be a cricket match , life partner, etc.
- Syntactic Ambiguity: This type of ambiguity represents sentences that can be parsed in multiple syntactical forms. It occurs whenever a sentence can be understood as having two or more distinct meanings as a result of the order of the words within the sentence. For example: The chicken is ready to eat, means that it is ready to eat or ready to eat for dinner.
- Semantic Ambiguity: This type of ambiguity is typically related to the interpretation of sentence. It exists when a word form corresponds to more than one meaning. For Example: as in the English word “organ”, which denotes both a body part and a musical instrument.
Six Major Components of NLP:
The initial step in NLP is tokenization which basically means breaking string into words. It is the process by which we break a complex sentence into words.
- Bigrams: Tokens of two consecutive written words.
- Trigrams: Tokens of three consecutive written words.
- N-grams: Tokens of any number of consecutive written words.
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. For Example Converting Amusement, Amusing, Amused to Amuse.
Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . This is similar to Stemming but the only difference between them is that the root word of Lemmatization is always a proper word.
Parts Of Speech(POS) Tags:
It is a process of converting a sentence to forms — list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on.
Named Entity Recognition:
Named entity recognition is a natural language processing technique that can automatically scan entire articles and pull out some fundamental entities in a text and classify them into predefined categories. Entities may be,
- Monetary values,
- Percentages, and more.
- People’s names
- Company names
- Geographic locations (Both physical and political)
- Product names
- Dates and times
- Amounts of money
- Names of events
Chunking in NLP is a process to take small pieces of information and group them into large units. The primary use of Chunking is making groups of “noun phrases.” It is used to add structure to the sentence by following POS tagging combined with regular expressions.
Implementing NLP to Dataset:
So, now we have understood concepts of NLP . let’s deep dive and understand step by step procedure to implement it in real world.
Fake News Classifier
Importing the Libraries:
The very first step of the implementation is to download all the required NLTK libraries.
NLTK: NLTK stands for Natural Language Tool Kit. It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning. It also includes graphical demonstrations and sample data sets as well as accompanied by a cook book and a book which explains the principles behind the underlying language processing tasks that NLTK supports.
Now we will preprocess text. In this step we will first remove all the stop words, then we will apply either stemming or lemmatization as per business use case.
Stop words are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so commonly used that they carry very little useful information.
For this use case I had used Stemming for preprocessing the news text.
Feature Extraction for the text
This step is carried out using three methods depending on the use cases:
- Bag of Words: A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:
a) A vocabulary of known words.
b) Measure of the presence of known words.
2. TF-IDF: TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.
- Term Frequency: is a scoring of the frequency of the word in the current document.
- Inverse Document Frequency: is a scoring of how rare the word is across documents.
IDF =Log[(# Number of documents) / (Number of documents containing the word)]
TF = (Number of repetitions of word in a document) / (# of words in a document)
TF-IDF = TF * IDF
3. Word2Vec :
In Both Bag Of Words and TF-IDF approach word semantic is not stored. And there is also chances of Overfitting.
In Word2Vec , each word is basically represented as a vector of 32 or more dimension instead of a single number. And here the semantic information and relation between different words is also preserved.
For this use case I had used Bag of Words for extracting features you can use any one of the above.
Now we will apply various Machine learning model and will predict the output and accuracy of our model.
The Source code of this is available on my github repository.
With this we came to the end of our article on Natural Language Processing. I hope this article will help you in learning Natural Language Processing(NLP) in a better manner.
For any queries contact me over linkedin.