Natural Language Processing For Beginners

6 min readAug 5, 2020

Natural Language Processing is a subfield of Artificial Intelligence that consists of systematic processes for analyzing, understanding, and deriving information from the text data. NLP solve a wide range of problems such as — automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc. Organizations today deal with huge amount and wide variety of data — calls from customers, their emails, tweets, data from mobile applications and what not. It takes a lot of effort and time to make this data useful. One of the core skills in extracting information from text data is Natural Language Processing (NLP).

Natural Language Processing plays a critical role in supporting machine-human interactions.

Benefits of NLP

NLP hosts benefits such as:

· Improved accuracy and efficiency of documentation.

· The ability to automatically make a readable summary text.

· Useful for personal assistants such as Alexa.

· Allows an organization to use chatbots for customer support.

· Easier to perform sentiment analysis.

Techniques used in NLP:

Syntactic analysis and semantic analysis are the main techniques used to complete Natural Language Processing tasks.

1. Syntax

Syntax refers to the arrangement of words in a sentence such that they make grammatical sense.

In NLP, syntactic analysis is used to assess how the natural language aligns with the grammatical rules.

Here are some syntax techniques that can be used:

· Lemmatization: It entails reducing the various inflected forms of a word into a single form for easy analysis.

· Word segmentation: It involves dividing a large piece of continuous text into distinct units.

· Part-of-speech tagging: It involves identifying the part of speech for every word.

· Parsing: It involves undertaking grammatical analysis for the provided sentence.

· Sentence breaking: It involves placing sentence boundaries on a large piece of text.

· Stemming: It involves cutting the inflected words to their root form.

2. Semantics

Semantics refers to the meaning that is conveyed by a text. Semantic analysis is one of the difficult aspects of Natural Language Processing that has not been fully resolved yet.

It involves applying computer algorithms to understand the meaning and interpretation of words and how sentences are structured.

Here are some techniques in semantic analysis:

· Named entity recognition (NER): It involves determining the parts of a text that can be identified and categorized into preset groups. Examples of such groups include names of people and names of places.

import spacy
nlp=spacy.load('en')sentence="Ram of Apple Inc. travelled to Sydney on 5th October 2017"
for token in nlp(sentence):
print(token, token.ent_type_)

Word sense disambiguation/Named Entity Disambiguation: It involves giving meaning to a word based on the context. Some language words have multiple meanings For example, in the sentence-

“Apple earned a revenue of 200 Billion USD in 2016”

It is the task of Named Entity Disambiguation to infer that Apple in the sentence is the company Apple and not a fruit.

· Natural language generation: It involves using databases to derive semantic intentions and convert them into human language.

Stemming

Stemming is the process of reducing the words(generally modified or derived) to their word stem or root form.

Example good, better and best are stemmed to good, better and best respectively.

Natural Language Toolkit (NLTK) is library in NLP that help in various processes like classification, tokenization, stemming, tagging, parsing. NLTK is a free, open source.

Some terms that are used in the article:

Tokenization — process of converting a text into tokens
Tokens — words or entities present in the text
Text object — a sentence or a phrase or a word or an article

Installing nltk

pip install -U nltk

#importing required libraries

from nltk.stem.wordnet import WordNetLemmatizer

lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer

stem = PorterStemmer()

word = “multiplying”

print(‘\n\nStemming\n\n’)

print(stem.stem(word))

We are using porter stemmer algorithm for stemming.

Lemmatisation

Lemmatisation is the process of reducing a group of words into their lemma or dictionary form. It does same thing as temming but has some major differences. For lemmatization to resolve a word to its lemma, it needs to know its part of speech. It also take into account the meaning of the word in the sentence, the meaning of the word in the nearby sentences etc. before reducing the word to its lemma. Lemmatization gives more reliable root form than stemming.

Below is the sample code that performs lemmatization and stemming using python’s popular library — NLTK.

from nltk.stem.wordnet import WordNetLemmatizerlem = WordNetLemmatizer()from nltk.stem.porter import PorterStemmerstem = PorterStemmer()word = "multiplying"lem.lemmatize(word, "v")>> "multiply"stem.stem(word)

Part-Of-Speech Tagging

In Simplistic terms, Part-Of-Speech Tagging is the process of marking up of words in a sentence as nouns, verbs, adjectives, adverbs etc. For example, in the sentence-

Give me that red purse.

Give — verb

Purse — noun

red — adjective

from nltk import word_tokenize, pos_tagtext = "I am learning Data Science"tokens = word_tokenize(text)print pos_tag(tokens)>>> [('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Data', 'NNP'),('Science', 'NNP')]```

Statistical Features

Text data can also be quantified directly into numbers using several techniques described in this section:

Term Frequency — Inverse Document Frequency (TF — IDF)

Term Frequency (TF) — TF for a term “t” is defined as the count of a term “t” in a document “D”

Inverse Document Frequency (IDF) — IDF for a term is defined as logarithm of ratio of total documents available in the corpus and number of documents containing the term T.

TF . IDF — TF IDF formula gives the relative importance of a term in a corpus (list of documents),

from sklearn.feature_extraction.text import TfidfVectorizerobj = TfidfVectorizer()corpus = ['This is sample document.', 'another random document.', 'third sample document text']X = obj.fit_transform(corpus)print X>>>(0, 1) 0.345205016865(0, 4) ... 0.444514311537(2, 1) 0.345205016865(2, 4) 0.444514311537

Word Embedding (text vectors)

It is a method of representing words as vectors. They are widely used in deep learning models such as Convolutional Neural Networks and Recurrent Neural Networks. Word2Vec and GloVe are the two popular models to create word embedding of a text. These models takes a text corpus as input and produces the word vectors as output.

from gensim.models import Word2Vecsentences = [['data', 'science'], [ 'science', 'data', 'analytics'],['machine', 'learning'], ['deep', 'learning']]# train the model on your corpusmodel = Word2Vec(sentences, min_count = 1)print model.similarity('data', 'science')>>> 0.11222489293print model['learning']>>> array([ 0.00459356  0.00303564 -0.00467622  0.00209638, ...])

Text Summarization

Text Summarization is the process of shortening up of a text by identifying the important points of the text and creating a summary using these points.

Here is how you can quickly summarize your text using the gensim package.

from gensim.summarization import summarize

sentence="Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax.Automatic data summarization is part of machine learning and data mining. The main idea of summarization is to find a subset of data which contains the information of the entire set. Such techniques are widely used in industry today. Search engines are an example; others include summarization of documents, image collections and videos. Document summarization tries to create a representative summary or abstract of the entire document, by finding the most informative sentences, while in image summarization the system finds the most representative and important (i.e. salient) images. For surveillance videos, one might want to extract the important events from the uneventful context.There are two general approaches to automatic summarization: extraction and abstraction. Extractive methods work by selecting a subset of existing words, phrases, or sentences in the original text to form the summary. In contrast, abstractive methods build an internal semantic representation and then use natural language generation techniques to create a summary that is closer to what a human might express. Such a summary might include verbal innovations. Research to date has focused primarily on extractive methods, which are appropriate for image collection summarization and video summarization."

summarize(sentence)