Natural Language Processing (NLP) Steps
NLP is an amalgamation of computer science, the linguistic approaches and the machine learning.
NLP field is broadly divided into 3 parts:
· Speech Recognition
· Natural Language Understanding
· Natural Language Generation
Syntax & Semantics:¶
· Syntax is the grammatical structure of the text.
· Semantics is the meaning that is being conveyed.
1. Tokenization
Given a text sequence, tokenization is the task of breaking it into fragments separated by whitespaces. Often times, certain characters are usually removed, like punctuations, digits, emoticons, etc. These fragments what get returned is what we call as tokens.
tokens composed of one-word -> unigrams
tokens composed of 2 consecutive words -> bi-grams
tokens composed of 3 consecutive words -> tri-grams
tokens composed of n consecutive words -> n-grams
2. Parts of Speech (POS)-Tagging
Parts of Speech: there are 8 parts of any speech:
Noun(N) — Ram, Raghu, Kinjal
Verb(V) — go, run, speak, play, eat
Adjective(ADJ) — big, small, happy, green, young
Adverb(ADV) — slowly, quickly, very, always,
Prepositions(P) — in, on, at, with
Conjunction(CON) — and, or, not, but, because
Pronoun(PRO) — I, you, we, he, she, they
Interjection(INT) — Wow!, Yay!, ouch!, great!
3. Parsing (Chunking)
Parsing is basically to resolve a sentence into its component parts and describe their syntactical roles”.
i) Parts of Speech
N = Noun
V = Verb
DT = Determiner/Prepositionsii) Phrases
NP = Noun Phrase: “the man”, “the cake” VP = Verb Phrase: “ate the cake”iii) Relationship between POS and Phrases
4. Named Entity Recognition (NER)
Given a text sequence, the task is to locate and identify words or phrases that represent some definitive categories — like name of a person, name of an organization, name of a location, etc. NER is quite handy for information extraction to seek and identify entities.
5. Stemming
A process of reducing stem words to their root/stemmed words
6. Lemmatization
It basically considers the POS tag of the word before conducting stemming process
7. Removal of Stopwords
Stopwords are basically like ‘the’, ‘a’, ‘an’, ‘is’, ‘as’
8. TF-IDF
TF: The TF (Term Frequency) of a word is basically the frequency of the word (i.e the no. of times it appeared) in a document.
TF= no of times a particular word occurring/Number of word in a document
IDF: its basically a measure of how significant the term is in the whole corpus/dataset
IDF=log(no of times a particular word occurring/Number of word in a document)
So, the TF-IDF score for the term “mango” would be given as,
TF-IDF Score=TF∗IDF
The higher value of TF-IDF score, the rarer is the term and vice-versa
9.Word Embedding
A word embedding format generally tries to map a word using some dictionary to a vector representation.
Different types of Word Embeddings-
· Frequency based Embeddings
· Prediction based Embeddings
Frequency based Embeddings:
- Count Vector
- TF-IDF Vector
- Co-occurrence Vector
Count Vector:
The dictionary is created as a list of unique tokens(words) in the corpus.
Here, D=no. of documents in corpus
N=no. of unique tokensTherefore, the Count Vector matrix will have shape as D x N
Co-occurrence Vector:
‘Co-occurrence’: for a given corpus, the co-occurrence of a pair of words say w1 and w2 is the no. of times they have occurred together in the Context Window(specified by a number and the direction).
Prediction based Embeddings:
Word2Vec
Word2vec is a combination of two techniques —
· Continuous bag of words (CBOW)- It predicts the probability of a word given a context.
· Skip-gram model — The skip-gram predicts the context given a word.
Difference between CBOW and Skip-gram:
· skip grams can handle OOV (out of vocab) tokens.
· skip grams can identify two different semantics of the same word.
· skip grams are faster to train.