The Anatomy of Autocorrect
Have you ever wondered how your phone corrects your texts on the fly? Or how Google knows you misspelled a word before you do? This is the topic of automated spelling correction, or as it’s more widely known, autocorrect.
Spelling correction has improved since its inception decades ago, for two reasons. One, we’re collecting more data. Spelling correction employs machine learning, and the more data you have for your machine to learn from, the better it can perform it’s task. Two, right now there are teams with dozens of researchers who are dedicated to solving natural language processing problems like this one. Interestingly enough, the underlying fundamentals of how a spelling corrector works has not changed much in the last 25 years. Sure, maybe the learning models are better, and maybe there are more relevant error factors, but what I’m about to show you, the basis, has stood the test of time.
Take a minute to think about what you’ll need to do to make a spelling corrector…
If we are correcting words, we need to have some sort of definition of ‘correctness’. This means that given a word we need to be able to figure out if it’s valid English. To do this we’ll need a corpus — a body of text so large that it represents the English language. We’ll also need some way of deciding which of the candidate words is our replacement. This involves some way of deducing which candidate words are more likely to occur in the English language.
For example, say your machine receives the error word ‘thej’ and considers two candidates as appropriate corrections, ‘they’ and ‘thew’. Both ‘they’ and ‘thew’ are pretty similar to ‘thej’, but think about which one you think is the right correction and why. ‘they’ is the right correction because ‘they’ is more likely to be typed than ‘thew’ — we rarely see the word ‘thew’.
Next, notice how similar ‘they’ and ‘thew’ are to ‘thej’. It makes sense that corrections are going to be similar to their error words, since errors are usually off by only one or two typos. This means we need some way of generating valid words that are similar to our error word.
Let’s reiterate upon what our spelling corrector needs.
- Corpus — a datasource for the English language that we use to tell our machine which words are valid English and how often words are used.
- Language model — the ability to come up with a probability of a word occurring in the English language.
- Error model — the ability to compute similarity between words and how likely a correct word would become erroneous
We need a way of quantifying similarity between words, and Damerau-Levenshtein edit distance is an algorithm that does this for us. It’s defined as the minimum number of operations we need to make to transform one string into another, where an operation can be an insertion of a character, a deletion of a character, a transposition (swap of adjacent characters), or a substitution of a character with a new one. Pretty simple.
Let’s say we’re generating all the words within 1 edit distance of a word with length n. There would be 26(n+1) possible insertions, n deletions, n-1 transpositions, and 26n substitutions, which makes for 54n+25 generated words. As you can imagine, this will generate a large number of words. For our purposes, we’ll be generating words at up to 2 edit distances, which makes for 2916n+1375 words (54(54n+25)+25). Notice that it still scales linearly despite the large coefficients.
We can actually write a function to generate all the words within 1 edit distance from a given word in under 10 lines of Python. Don’t focus too much on the code, if you have a good idea of what’s really going on then that’s perfect.
char_set = ‘abcdefghijklmnopqrstuvwxyz’
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
inserts = [(‘ins’, a[-1] + c, a[1:] + c + b) for a, b in \
map(lambda e: (‘@’ + e, e), splits) for c in char_set]
deletes = [(‘del’, a[-1] + b, a[1:] + b[1:]) for a, b in \
map(lambda e: (‘@’ + e, e), splits) if b]
substitutions = [(‘sub’, b + c, a + c + b[1:]) for a, b in \
splits for c in char_set if b]
transposes = [(‘trans’, b + b, a + b + b + b[2:]) \
for a, b in splits if len(b) > 1]
return set(deletes + transposes + substitutions + inserts)
Note that we aren’t just storing the generated words, but also the type of edit operation and the characters that were operated on in a tuple (this info will be useful later). In addition, ‘@’ represents the empty string at the beginning of a word. For example, with insertions if we insert ‘t’ after ‘x’, we would denote that with ‘xt’, so if we insert ‘s’ at the beginning of a word, that would be denoted as ‘@s’.
Naturally we can recursively generate all the words at x edit distance.
def edits(word, dist):
if dist == 1:
return set(e for operation, confusion_params, generated_word in \
edits(word, dist — 1) for e in edits1(generated_word))
We’ll start by forming a rudimentary, but seemingly powerful spelling corrector. Here’s our algorithm.
- Check if the error word is valid English, if so return it, otherwise proceed.
- Find the word at 1 edit distance of the error word and that occurs most in the corpus and return it, if none can be found then proceed.
- Find the valid word within 2 edit distance of the error word and that occurs most in the corpus and return it, if none can be found then proceed.
- The spelling corrector has failed, return the error word.
We are making a couple of assumptions. First, we assume that if a word is in our corpus then it’s not an error. Next, we’re assuming that edit distance is the only factor affecting the error model. Finally, we assume that errors will only occur within 1 or 2 edit distance. This is not a bad assumption, as approximately 75% of errors are within 1 edit distance and nearly all of them are within 2 edit distance (based on training data of errors you’ll see later).
I created a web app that takes text and uses this algorithm to correct it. Interestingly, when I gave it to my friends, they perceived it to be quite powerful without seeing how simple the algorithm was. This first attempt is actually good enough to be considered a spelling corrector by humans.
Some rough code for this can be found here.
This is where things start to get fancy with basic probability. Assume V as the set of all words in our corpus and x as the given error word. Our corrected word is the word in our corpus that maximizes the probability of the correct word given that we have the error word (which we do).
Then apply Bayes’ Theorem.
Consider that the probability of the error word, P(x), is constant regardless of different candidate words, w.
We denote P(x|w) as the error model and P(w) as the language model.
Let’s start with the language model — the probability of our candidate word P(w). We’re going to use something called a unigram for our language model. A unigram is a model whereby the probability of a given word occurring is completely independent of all words before and after it, and the probability is given by the count of all instances of the word in the corpus divided by the count of all the words in the corpus. For example, if ‘they’ occurred 1000 times in our corpus and ‘thew’ occurred 10 times in our corpus, then P(w)=‘the’ would be 100 times greater than P(w)=‘thew’.
Next is the error model — the probability that our candidate correct word became erroneous to produce our given error word P(x|w). We compute this based on something called ‘the noisy channel model of spelling’. The idea is that we start with a correct word that undergoes some noisiness and becomes the error word. Incidentally, a correct word becomes noisy through character operations, the ones that attribute to Damerau-Levenshtein edit distance. This means that we can define P(x|w) as the probability of editing w such that it becomes x. As was mentioned earlier, some edits are more likely to occur than others. For example, we’re more likely to accidentally insert an extra ‘d’ after an ‘s’ than we are to insert an extra ‘f’ after an ‘s’, because ‘d’ is closer to ‘s’ on the keyboard. This means we need a record of errors so that we can quantify how much more likely these different edits are compared to each other. We can express these errors in something called a confusion matrix, which in this case is a matrix where each cell gives us a count of when a certain accidental edit occurred. For example, here is a confusion matrix for substitutions.
Now let’s say we consider all of the possible substitutions for our candidate word w, and in some of those possibilities we are substituting the ‘o’ in w for another letter. In one case, we are substituting the ‘o’ with ‘a’. According to our confusion matrix, ‘o’ was substituted with ‘a’ 76 times. Next, look at how many times any character incorrectly substituted ‘o’. This would be the sum of the scalars in the Y=‘o’ column. We can approximate that the probability of ‘o’ being substituted for ‘a’, given that the ‘o’ in w undergoes a substitution, is equal to the number of times ‘a’ substituted ‘o’ divided by the number of times any character substituted ‘o’. Let’s generalize.
The same goes for the other edit operations. We take the number of times the specific operation occurred and divide it by the number of the times the correct letter(s) could have gone through the operation with any of the characters in the character set.
So there we go. For each word, w, in the corpus, choose the w that maximizes P(x|w)*P(w).
We have already built a powerful spelling corrector based on both a language model and an error model, but we must ask ourselves if one is dominating the other. What if the way we calculate our language models produce innately greater values than for our error model, or vice versa? This is where we can add an α parameter, in which we exponentiate our language model by α, such that we are now finding the w that maximizes P(x|w)*P(w)^α. We can investigate and test various values of α to produce the maximum spelling correction accuracy and select our α accordingly.
Next, what if the suitable correction to our error word is at 2 edit distance, and the way we multiply the first edit probability by the second in our error model makes it so that we pretty much never select corrections at more than 1 edit distance? We can raise the second edit probability to β and test that to choose a β like we did for α.
The rough code for this is here.
Let’s look at how our spelling correctors performed.
The development test is the test where I had access to the test data and so I was able to tune the parameters based on it, while the regular test is the test where I had no access to the test data and only ran it once at the end. Interestingly, there was a jump from attempt 1 to 2 and not so much from 2 to 3. This leads me to believe that adding the noisy channel model made a positive impact, whereas tuning the parameters didn’t as much. I had hoped to reach 85% accuracy by Attempt 3, but unfortunately we fell short by 10%. I predict that I will be able to reach this once I test the spelling corrector with the bigram (see below). In addition, 75% accuracy isn’t the worst with spelling correctors. Often when it comes to HCI, instead of automatically correcting the word we can take the top 3 most likely candidates and suggest those. A simple estimate with a 75% accuracy for one suggestion provides a 98.4% accuracy for 3 suggestions (100*(1-0.25³)). In addition, our algorithm became much slower after adding the noisy channel model, which makes sense since it’s constantly computing edit probabilities for all known generated words. Overall I’m happy with how it turned out.
Next we could give our spelling corrector a very basic level of context. This will require us to alter our language model from the last attempt. Before our language model was simply a probability for our candidate word occurring in our corpus, which means that our machine was only looking at the candidate word and none of the words before it to help in the decision. Let’s say we have 2 candidate corrections for the error word ‘teh’, ‘the’ and ‘tea’. With our previous attempt we would choose ‘the’ as our correction — ‘the’ is much more common in our corpus than ‘tea’, and ‘the’ is more likely to be misspelled as ‘teh’ than ‘tea’ is. However, let’s say we found our error after the already correct word ‘some’. The phrase ‘some tea’ makes much more sense than the phrase ‘some the’. Therefore, if instead of just looking at how common the candidate word is, but rather at how common the sequence of the previous word plus the candidate word is, we’ll get some rudimentary context. This language model where we look at the previous word and the candidate word is called a bigram. In fact, you can look at however many n previous words, meaning you can have trigrams, 4-grams, 5-grams, etc. These language model structures are called n-grams, and they’re based on the mathematics of Markov chains, which are basically processes of choice whereby all choices are assigned a probability based on the current state of the system.
We could also extend it to a real word spelling correction system, as our current system doesn’t deal with real words well. If we get a sentence where all the words are valid there could still be an error. What we do is for each word in the sentence we pretend it’s an error word, generate all the candidates to replace it, and calculate P(x|w)*P(w)^α like previously. Then we can construct a web of possibilities for the sentence like so.
In this example, the chain of two →of →the had the greatest combined probability of all combinations, and so even though ‘thew’ is a word, the system corrects ‘two of thew’ to ‘two of the’. We also need some way of defining the probability of a word given itself. We might just assign this as a constant, like 0.95, and assume that every 20 words there’s a correctly typed word that’s an error. That way if the correction chain two →of →the makes enough sense (has a high enough probability), then it will have a greater probability than even two →of →thew.
In the future we can also combine dozens of different features, like phonetic edits or how verbs should be placed in sentences, and then use a machine learning classifier to take these model probabilities as input and assign probabilities to the candidates. Specifically, RNNs, a type of ANN that has short term memory through hidden layer looping, has been highly effective in natural language processing tasks due to the fact that they can ‘remember’ past data.
Spelling correction is a great problem to look at for several reasons. One, it’s damn useful — especially with how few people proofread writing these days (I most likely have a few typos in here). Two, working on the problem extensively causes one to end up learning a lot from various areas (language modeling, part of speech tagging, classification, big data storage, etc). I still enjoy working on this project, and I intend to consistently add to it every now and then. I hope you gained some value from this post, I’m going to try to write more of these.
All the resources for this project are available here.
A Spelling Correction Program Based on a Noisy Channel Model, Kernighan, Church, Gale, AT&T Bell Laboratories 1990.
CS124: From Languages to Information, Dan Jurafsky, Stanford 2016.