How to write your own spellchecker and autocorrect algorithm in under 80 lines of code

Spellcheckers and autocorrect can feel like magic. Even advanced engineers confess to not having a fully intuitive grasp of what goes on underneath the hood. Yet they’re at the core of everyday applications — our phones, office software, google search (you type in ‘incorect’ and google returns ‘Did you mean incorrect).

So how does this magic happen? And how can we build our own?

First let’s consider what a spellchecker does. If it’s useful, it takes in a misspelled word — for example ‘teh’ — and returns the best guess at the correct spelling — ‘the’

There are two key components to this

First, a body of language as it’s commonly used — literature but also contemporary text with common current words (like iPhone). From this we can produce a word count

{the: 85039, it: 68904, a: 68598 ... }

Second, we can usefully conceptualize every error as one of four types. So if the user types ‘teh’ they could have meant to:

  • add a letter: ‘teh’ -> ‘ateh, bteh, cteh, dteh,….,tehy, tehz’
  • remove a letter: ‘teh’ -> ‘eh, te, th’
  • substitute a letter: ‘teh’ -> ‘aeh, beh, ceh,…,tey, tez’
  • switch two adjacent letters: ‘teh’ -> ‘hte, the’

All the above words are said to be of error distance 1. A word of error distance 2 from ‘teh’ would be ‘too’ — ‘teh’ -> ‘toh’ -> ‘too’

There are then three steps to our particular implementation of a spellchecker. First, check whether the word is in our dictionary (our wordcount dictionary). If it is not, generate all the possible words that are of error distance 1 and 2 *and* that are also real words in our dictionary.

Finally we must consider the word count of the possible words to decide which word is the most likely correction. Here things get especially interesting.

Suppose we have the word ‘thene’ that is error distance 1 from ‘then’ but error distance 2 from ‘the’. It would seem that even though the error distance 2 word (‘the’) is far more common, the correction is more likely to be ‘then’.

So we must decide on a ratio at which we should accept an error distance 2 word (I’ve found somewhere between 10 and 100x works well). It’s also useful into account word length — shorter words are less likely to be an error distance of 2 from the intended word.

You can find the full code (in javascript) here

Will Sentance is CTO at Codesmith. He teaches core software engineering, computer science, backend engineering with Node, and prepares students for hiring.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.