Basics of NLP — 2

Abhinav Rai
Dec 24, 2016 · 7 min read

I would recommend reading my previous post here. This post contains

  • Edit Distance
  • Weighted Edit distance (Confusion Matrix)
  • Language Modelling
  • Smoothing

Edit Distance

How similar are 2 strings ? Elphant | Elephnt | Elephant | Elepant etc.

Edit distance can help. Its the Minimum number of editing operations to convert one string to other.

Editing operations are Insertion, Deletion, Substitution with the cost of 1. In Levenshtein distance, substitution costs 2 and other operations cost 1.

Used in Spell Correction, Evaluating Machine translation and other comparing tasks.

Computing this is very easy with Dyanamic programming! Algorithms is as follows:

  1. If last characters of two strings are same, nothing much to do. Ignore last characters and get count for remaining strings. So we recur for lengths m-1 and n-1.
  2. Else (If last characters are not same), we consider all operations on ‘str1’, consider all three operations on last character of first string, recursively compute minimum cost for all three operations and take minimum of three values.
  • Insert: Recur for m and n-1
  • Remove: Recur for m-1 and n
  • Replace: Recur for m-1 and n-1. (With cost 2 if distance is Levenshtein)

For complete code see this link. We will get the edit distance table and we can backtrace from it and can get which symbol in string A corresponds to a symbol in string B.

Time: O(nm) ; Space : O(nm) ; Backtrace : O(n+m);

Weighted Edit Distance

Edit Distance can also be weighted. Some letters are more likely to be mistyped than others. Confusion matrix gives us probabilities of which symbols are more likely to be confused to which.

This is based on many factors such as their similarity of pronunciation, their position on keyboard, etc. We are performing operations on string X.

D(i,j) = min {D(i-1,j) + delete[X[i]] , D(i,j-1) + insert[Y[j]], D(i-1,j-1) + subs [X[i], Y[j] ]}

Also in Initialisation:

D(i,0)=D(i-1,0) + delete[X[i]];
D(0,j) = D(0,j-1) + insert[Y[j]];

Language Modelling

Goal of Language Modelling is to compute a probability of a sentence. This is widely used in Machine Translation (as high winds tonight > large winds tonight) , spell correction (elephant is large animal> elepant is large animal) , speech recognition (eyes awe of an < I saw a van), summarisation etc.

P(Wi | Wi-1Wi-2 …. W1) or P (W) — Compute anyone. We can get one from other.

P(W) = P(W1)*P(W2|W1)*P(W3|W1W2)* ….. * P(Wi | Wi-1Wi-2 …. W1)

Its very difficult to compute P(W5|W1W2W3W4) as we don’t have these many sentences in corpus to calculate the probability.

Markov’s Assumption: Instead of complete prefix for the word, just take few previous words.

P(Wi | W1W2 … Wi-1) is nearly equal to P(Wi | Wi-k … Wi-1) .

This is K-gram model. Other simpler models are Unigram, Bigram, Trigram models. More practically Trigram and 4-gram models solve the simple sentences for us but its very difficult when a sentence has large — distance dependencies. For ex: The mobile which I recently purchased for Rs 39,000 on my birthday is broken. Here mobile has long dependency with broken! But practically we often get away with Trigram and 4-gram models.

P(Wi | Wi-1) = count (Wi-1Wi) / Count (Wi-1)

The above is the simple formula for calculating Bigram probabilities.

To calculate the probability of the sentence — Multiply all the probabilities. Practically its better to use log to avoid underflow.

For Ex. (Bigram Model): W = I want to eat chinese food.


Here ‘s’ is starting symbol and ‘/s’ is ending symbol.

Practical implementation of Bigram model can be found here. There are many publicly available toolkit such as SRILM , Google N-gram corpus, Google books N-gram corpus.

Lets evaluate Language models now. Done by 2 ways —

  • Extrinsic Evaluation — Put models to task and run the evaluation. Whichever model has higher accuracy is better! But its sometimes time consuming.
  • Intrinsic Evaluation: Mostly when Training data is similar to test data. This intrinsic evaluation is called perplexity. Perplexity is probability of test set normalised by number of words. (Here we do reciprocal and take Nth root as in formula below).
Perplexity = P(W1W2W3 ….. Wn) ^ (-1/n).

Greater the probability of sentence => Lesser the perplexity!

Perplexity is the average branching factor. Intuitionally: How hard is it to recognize a digit from [0,1,2,3,4,5,6,7,8,9]. Its basically how many things could occur next! So answer for this is 10.

Training data must be related to test data. It would be very bad if we train our model on Shakespeare data and test it on wall street journal in Machine translation. Another thing to note here is what happens if something appears in test set which is not in training set?

So we need to Generalize. In the above case, the probability of that k-gram will be zero as it does not appear and thus perplexity will be Infinite! Thats bad :( .

Smoothing to the rescue!


The idea is to steal the probability mass and save it for the things we might see later.

Simplest way is Add one smoothing / Laplace smoothing. We pretend that we say each word one more time than we actually did. This is add-one estimator.

Probability lap= count(Wi-1Wi) + 1 / count (Wi-1) + V

More generally we can have Add-K smoothing. Add K to numerator and KV to the denominator. Put mv=K.

Add-K smoothing = count(Wi-1Wi) + m(1/V)/ count (Wi-1) + m

In unigram prior smoothing, we replace (1/V) with P(Wi). This includes Interpolation as we include Unigram result in Bigram calculation. (Read Interpolation below). This works well but still cannot we used for Language Modelling.

Add one smoothing makes massive changes in actual data. Its used in text-classification where the number of zeros are not large.

Backoff : Sometimes we don’t have enough data to trust our k-model , so we move to lower model Language model.

Interpolation: Mix Unigram, Bigram and Trigram.

Linear Interpolation: It is of 2 types.

  1. Simple Interpolation: L1 P(Wi) + L2 P(Wi|Wi-1) + L3 P(Wi|Wi-2Wi-1); L1+L2+L3 = 1;
  2. Lambdas conditional Interpolation: Lambdas depend on context.

The value of these lambdas can be calculated if we have a dev set/training data to set these lambdas to maximize the likelihood of these models.

Open vs Closed Vocabulary tasks: If we know all the words in advance then its a close Vocabulary. But many times we don’t know all words. In our test set there might occur OOV (Out of Vocabulary ) words. One way to deal with them is to create a special token called <UNK>. Any word which has low probability while training is changed to <UNK> and then the model is treated as if the word <UNK> is there. At decoding time, we change the unseen word to <UNK> and then compute its probability w.r.t the Language Model.

Smoothing for Large N-Grams (Web-scale-N-Grams): We use Stupid Backoff technique. P(Wi|Wi-k … Wi-1) = { simple probability formula } but if count(Wi-k …. Wi) is less than some threshold then we use k* P(Wi|Wi-k+1 …. Wi-1). Where k is some fraction and this can be best calculated by our dev set data. Stupid backoff produces scores rather than probabilities. And this works quite well for large scale N-Grams.

Add-K does not work good for Language modelling. Stupid backoff works well for Large N-Grams.

Lets start with Good Turing smoothing.

Good Turing Smoothing

Nc = Frequency of frequency c.

Example : Abhinav I am Abhinav am I this

this = 1 ; I = 2 ; am = 2 ; Abhinav 2

N1 = 1 ; N2 = 3

Read the intuition of Good Turing smoothing here. Done by leave one out validation and by taking a held set. Very well explained here (from 11th Minute).

P (new things occurring ) = N1 / N ;

Generalizing: c* = (c+1)*Nc+1 / Nc

And P = c* / N.

Therefore in case of new things, c* = 0 as we have not observed it earlier. After first few N’s namely N0,N1 … N10, we replace the N’s by a smooth function as there will be gaps. Say N127 can be zero. Therefore we replace with the best fit paralog function.

Kneser Ney Smoothing

First do absolute discounting (as a result of good turing smoothing). But we see the discounting factor is nearly equal to 0.75. So we do absolute discounting and then do interpolation — Discounting + ( L(Wi-1)*P(w) ).

Instead of P(w) (Unigram probability){How likely is w}, Pcontinuation(w){How likely is w to appear as a novel continuation} is a better estimate.

P continuation (w) = Words which precede w / Total word bigrams.

A frequent word like Fransisco will have low continuation probability as it only appears after San.

P kn (Wi | Wi-1) = (count (Wi-1Wi) — d)/count (Wi-1) + L(Wi-1) Pcont(Wi)

Kneser Ney Smoothing for N-Grams is as below:

Pkn( wi | wi-n+1 … wi-1) = [ max( countkn( wi-n+1… wi ) — d, 0) ] / [ countkn( wi-n+1… wi-1 ) ] + Θ( wi-n+1 …. wi-1 ) x Pkn( wi | wi-n+2….. wi-1 )


ckn(•) =the actual count(•) for the highest order n-gram

  • the actual count(•) for the highest order n-gram


  • continuation_count(•) for lower order n-gram

=> continuation_count = Number of unique single word contexts for •

Read Next blog on Spell correction here.

Abhinav Rai

Written by

Product Engineer at Go-Jek | Guitarist | Traveller | Entrepreneur

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade