Designing a Basic Spell Corrector

We see and use spell corrector everyday whenever we are using google or any mobile device. I always wondered how they correct my wrong spelling and recommend me the correct word.

It starts with checking of non-words and real-words.
Non-words do not appear in dictionary. They are the result of typo. The task of correcting such error involves error detection and substituting correct word. 
On other hand real-word error are hard to detected as they appear in dictionary and we have to check all the real-words for their correct word.

Correcting non-word errors

Step 1:
Find out the non-word in a sentence by checking each word whether it exists in the dictionary or not.
Suppose we have found that word “tha” as error word.

Step 2:
Since we can guess the correct word will be “the” or the second best word will be “that”. 
But what if the word is “acror”. Here we can guess the words like “actor”, “across” etc.
In general we need a set of real words that are similar to error.

Question: How to choose the words similar to error?

Since we know that error is not in dictionary.The basic approach is to use Minimum Edit Distance Algorithm.

Minimum Edit Distance Algorithm:

Input : Error word and dictionary
Output: a list of words with minimum edit distance.
What are edits?
There are 4 basic edit operation that can be considered.
Deletion, Insertion, Substitution and Transposition of two adjacent alphabets.
Note: Edit Distance is the number of the above operation on error word to get a real word.
In general the edit distance is the number of operations (insertions, deletions or substitutions) needed to transform one string in another.

Operations in details.

delete operation:
deleting a letter from error to get real word.
example:
“controal” — -> after deleting “a” becomes “control”
Here Edit Distance = 1

Insert operation:
We had to insert a letter in error to get a real word
example;
“contrl” → after inserting “o” we get “control”
Here Edit Distance = 1

Substitution operation:
It is obvious we have to substitute a letter to get a real word.
example:
“cuntrol” → after substitution of “o” becomes “control”
Here Edit Distance = 1

Transposition operation:
here we have to swap two consecutive letters.
example;
“contorl” → after trans-positioning “ro” we get “control”
Here Edit Distance = 1

NOTE: If any of the above operation occur once then Edit Distance = 1 and if twice on same error then Edit Distance = 2.
We can only use words with Edit Distance = 1 for sake of simplicity as around 70–80% of words have one error.
For more accuracy we can add words with Edit Distance 2 in our model and that is sufficient for the problem.

Step 2 continued:
Find words that are at Edit Distance 1 and 2 from error.
Now we have a set of real words that are similar to error.

Step 3:
Now our goal is to find the most appropriate word for the error. For the purpose we will use Maximum Likelihood Estimation.
Lets assume x is the typo and w be a word from list of similar word(W) and for each w we have to compute the following.

w_estimate = P(w|x) (using Bayes Rule)
= P(x|w) P(w)/P(x)

(we can remove P(x) as it is constant for all w)
= P(x|w) P(w)

where P(w) is the prior or the probability of w
and P(x|w) is the likelihood of x given w

Now the task is to compute the prior and the likelihood.

Prior:
Form any given corpus we can directly compute the probability of words(P(w))

Likelihood:

It is a bit complicated task to compute P(x|w). Since we have to measure that how likely the non-word occur given a real-word. In other word we can compute the probability of the operation. Example- In case of substitution of letter “m” in place of “n”, we can compute that how often they are substituted and similarly for other operations.

Result:

Since w_estimate = prior * likelihood
Now we can choose the real-word having the best estimate and our Spell Corrector is ready.

Please feel free to comment and ask questions. Stay tunes for more NLP stuff.