Natural Language Processing (NLP) Steps

Dr. Vivek Singhal
3 min readFeb 13, 2020

--

NLP is an amalgamation of computer science, the linguistic approaches and the machine learning.

NLP field is broadly divided into 3 parts:

· Speech Recognition

· Natural Language Understanding

· Natural Language Generation

Syntax & Semantics:¶

· Syntax is the grammatical structure of the text.

· Semantics is the meaning that is being conveyed.

1. Tokenization

Given a text sequence, tokenization is the task of breaking it into fragments separated by whitespaces. Often times, certain characters are usually removed, like punctuations, digits, emoticons, etc. These fragments what get returned is what we call as tokens.

tokens composed of one-word -> unigrams

tokens composed of 2 consecutive words -> bi-grams

tokens composed of 3 consecutive words -> tri-grams

tokens composed of n consecutive words -> n-grams

2. Parts of Speech (POS)-Tagging

Parts of Speech: there are 8 parts of any speech:

Noun(N) — Ram, Raghu, Kinjal

Verb(V) — go, run, speak, play, eat

Adjective(ADJ) — big, small, happy, green, young

Adverb(ADV) — slowly, quickly, very, always,

Prepositions(P) — in, on, at, with

Conjunction(CON) — and, or, not, but, because

Pronoun(PRO) — I, you, we, he, she, they

Interjection(INT) — Wow!, Yay!, ouch!, great!

3. Parsing (Chunking)

Parsing is basically to resolve a sentence into its component parts and describe their syntactical roles”.

i) Parts of Speech
N = Noun
V = Verb
DT = Determiner/Prepositions

ii) Phrases
NP = Noun Phrase: “the man”, “the cake” VP = Verb Phrase: “ate the cake”

iii) Relationship between POS and Phrases

4. Named Entity Recognition (NER)

Given a text sequence, the task is to locate and identify words or phrases that represent some definitive categories — like name of a person, name of an organization, name of a location, etc. NER is quite handy for information extraction to seek and identify entities.

5. Stemming

A process of reducing stem words to their root/stemmed words

6. Lemmatization

It basically considers the POS tag of the word before conducting stemming process

7. Removal of Stopwords

Stopwords are basically like ‘the’, ‘a’, ‘an’, ‘is’, ‘as’

8. TF-IDF

TF: The TF (Term Frequency) of a word is basically the frequency of the word (i.e the no. of times it appeared) in a document.

TF= no of times a particular word occurring/Number of word in a document

IDF: its basically a measure of how significant the term is in the whole corpus/dataset

IDF=log(no of times a particular word occurring/Number of word in a document)

So, the TF-IDF score for the term “mango” would be given as,

TF-IDF Score=TF∗IDF

The higher value of TF-IDF score, the rarer is the term and vice-versa

9.Word Embedding

A word embedding format generally tries to map a word using some dictionary to a vector representation.

Different types of Word Embeddings-

· Frequency based Embeddings

· Prediction based Embeddings

Frequency based Embeddings:

- Count Vector

- TF-IDF Vector

- Co-occurrence Vector

Count Vector:

The dictionary is created as a list of unique tokens(words) in the corpus.

Here, D=no. of documents in corpus
N=no. of unique tokens

Therefore, the Count Vector matrix will have shape as D x N

Co-occurrence Vector:

‘Co-occurrence’: for a given corpus, the co-occurrence of a pair of words say w1 and w2 is the no. of times they have occurred together in the Context Window(specified by a number and the direction).

Prediction based Embeddings:

Word2Vec

Word2vec is a combination of two techniques —

· Continuous bag of words (CBOW)- It predicts the probability of a word given a context.

· Skip-gram model — The skip-gram predicts the context given a word.

Difference between CBOW and Skip-gram:

· skip grams can handle OOV (out of vocab) tokens.

· skip grams can identify two different semantics of the same word.

· skip grams are faster to train.

--

--