A closer look into the spell correction problem — Part 1

4 min readAug 11, 2017

At searchhub.io our mission is to help search engines understand humans. You can see Search|hub as a search platform independent search query intelligence API that supercharges your existing search engine.

One important aspect in order to help search engines understand humans is to handle error prone human input aka spell correction. Let’s imagine an eCommerce context where users sometimes type the query jewlery when they’re looking for jewelry; sometimes they just accidentally hit an extra key and type dresss instead of dress. To make their online shopping experience successful, we need to identify and fix these mistakes and display search results for the canonical spelling of the actual intended query.

How big is the spell correction problem?

Let’s take the example of search and have a closer look at the distribution of different spelling errors we see on our API to get an idea about the size of the problem.

frequency of spelling errors by error type

As you can see from the graph spelling errors are quite frequent and some of them have a significant influence on the value per search.

Does context matter?

But maybe an aggregated view on spelling errors is not enough. The likelihood of a spelling error and its type could also be related to context. One obvious example for a context variable would be the device type.

frequency of different spelling errors by device

Again we can directly see some differences between the distributions for desktop and mobile especially when it come to replaces. This makes perfect sense since the keyboard on a mobile is so small that the probability of hitting the wrong key is significantly higher than on larger keyboards.

What about stemming, lemmatization and term decomposition?

But wait — maybe that’s still not detailed enough. As you can see from the graphs we also have to take care of term decomposition, stemming and lemmatization. Most of these areas can be handled by dictionary approaches but all of them fail miserably if they have to deal with a combination of a spelling error and one of these areas. ( for example “womensaparel” to “women’s apparel”)

It becomes pretty clear that any system (search engine, speech recognition, chatbot, etc.) that is using direct unfiltered human input but fails to invest in best-in-class spell correction technology leaves a significant junk of money on the table.

Defining the Problem

Now you could argue that the problem of spell correction is already solved as it is well understood and a lot of work has already been done in the last couple of centuries but we think it is not or it wasn’t :-)

The problem is directly related to approximate string matching / string distance which allows to lookup a string in a list of strings and return the ones that are “most similar”. There are are many different string distance metrics like Levenshtein, Damerau-Levenshtein, Hamming distance, Jaro-Winkler, Strike a match and others out there but in our humble opinion they are not good enough mainly because they are oversimplifying the problem space. Therefore we asked ourselves the following fundamental question:

How can we measure similarity and is it a single dimensional problem or do we have to take multiple dimensions into account?

We think that similarity has to be multi-dimensional because the causes of spelling errors are multi-dimensional as well.

different error-types in spelling errors

Not only you have to incorporate this multi-dimensionality, you should better use weighted string distance algorithms as well. It is not enough to use simple unweighted edit-distance algorithms to spell correct human input. There may exist thousands of strings that share the same edit distance (button vs. butter) or even a smaller edit distance (mispeld vs mislead) to the correct string but are completely different in their meaning. Without edit weights based on real user error probabilities you’ll end up with wrong spell corrections and frustrated users.

Manual mapping can’t fix misspellings at scale

We often see clients manually mapping misspelled words to the correctly spelled word by adding synonyms. But this means hours of manual effort and intervention. While this might be a great solution to solving small scale problems, it is highly ineffective if you have a large catalog with thousands of products with an very dynamic assortment, in which people search using complex compositions of words and symbols.

In our next blog post we’ll share some insights on how we tackled the spell correction problem at scale in search|hub with unmatched speed and accuracy.

We are hiring

If you’re excited about advancing our search|hub API and strive to enable companies to create meaningful search experiences, join us! We are actively hiring for senior Java DEVs and Data Scientists to work on next-generation API & SEARCH technology.