A closer look into the spell correction problem — Part 2 — introducing preDict

5 min readAug 22, 2017

At searchhub.io we need to spell correct millions of unique queries per minute. This sounds pretty straight forward but is actually a very hard problem to solve. There are several ways to approach this challenge and therefore we wanted to share our experience during the last 18 months.

1.The Data-driven deep-learning way

Building a spell checker using deep learning sounds like a great idea. After the recent hype of deep learning, we decided to give it a try and see how far we can get. In order to do so we built our own character-based model for spell checking. We used our query test data set which essentially consists of spelling errors and their representative correction. The dataset is split into a training set and a dev (validation) set in 90/10 proportion. Then an LSTM sequence-to-sequence model is trained to predict a corrected query based on the query with typos. If you are interested in more details around the model you should read Tal Weiss’s article about the character-based model for spell checking for details. We used the same underlying approach and tweaked the model for higher accuracy.

Accuracy results:

We trained the model on Windows 10 machine with an Intel(R) Core(TM) i7–6700 CPU (2.60GHz) with Java(TM) 1.8.0_121. The best accuracy we were able to achieve was PreDict LSTM 90.71% accuracy after 8 days of training time. Overall we were quite impressed with the achieved accuracy even though we were not able to achieve our goal of >95%.

Performance results:

Since query performance is crucial to our solution we measured the throughput queries per second to get some idea about the correction performance. On the same instance type, we where able to achieve up to PreDict LSTM thrpt 0,119 +-0,057 ops/ms (corrections per millisecond). This result was very disappointing. We know that we could maybe double or triple the performance by some fine-tuning and throwing more performant and expensive hardware on it but this is not our preferred solution. Therefore we searched for an algorithmic solution.

2.The algorithmic way

As already mentioned in Part 1 we are talking about approximate string search algorithms that allow to lookup a string in a list of strings and return those strings which are close according to a specific string metric. The best of breed algorithms all strive for the same goal to achieve short lookup times. This is mainly done by reducing the number of lookups and comparisons (between words and/or artificially generated candidates), possibly reducing further the number of full edit distance calculations and finally reducing the computational complexity of the edit distance calculation itself, while not compromising its accuracy.

We compared several algorithms that are said to be state-of-the-art in this area (BK-trees, FSTs and SymSpell)

When we found SymSpell, we were really impressed by its internal simplicity, performance and accuracy. We thought it would be a good starting point to work on a solution with an even higher accuracy without losing much of its original performance.

preDict — is our open-source state of the art spell correction library solving the spell correction problem at scale based on SymSpell.

Here we’re going to tell what we changed in detail and where we see opportunities to expand its capabilities.

a. Refactoring

Originally we forked from an existing Java fork, that looked like a horrible 1 to 1 translation from the original SymSpell. No blaming here — it worked pretty well as a POC. But simply by structuring it (it still does not fulfill many best-practice metrics), putting it into an object-oriented structure (which improved readability) and by applying some known Java tweaks, we were able to achieve a better performance.

PreDict CE thrpt 82,116 ± 1,149 ops/ms
Original SymSpell (Port) thrpt 68,105 ± 0,977 ops/ms

b. Weighted Damerau-Levenshtein

We changed the Damerau-Levenshtein implementation to support weights for single edit operations (insert, delete, replace, transpose). With these parameters we got the ability, to tune the library onto special test data with automated tests. This however should be done carefully, since test data rarely represent real use cases. It’s also useless if it should handle unknown or broad data. This is why we finally stayed with values near 1 (+/- 0.2).

c. Keyboard Distance

Today fuzzy errors happen most likely on mobile devices, where keyboards get smaller and people tend to write on the go. That’s where it happens quite often, that the finger hits the letter nearby instead the intended one. This is why we calculated a keyboard distance matrix and put it into our code. It is used in addition to the weighted Damerau-Levenshtein implementation when we ask for a replacement weight: if the replacement is a nearby letter, it gets an even lower penalty.

d. Refined proximity for top-k results

If the “accuracyLevel” (called “verbosity” at SymSpell) is set to “maximum” (2) and the “topK” value is set to a number greater than 1 (6 was pretty good), the last modification kicks in order to increase accuracy: We apply several proximity algorithms on these top-k results and combine them. The weighted proximity values are then used to reorder the final result.

We basically use SymSpell to pre-calculate several candidates and then prefer matches based on their phonetic proximity (using Eudex), prefix proximity (Jaro-Winkler) and a fragment proximity based on (Dice Coefficient).

Accuracy results:

We benchmarked the implementation on a Windows 10 machine with a Intel(R) Core(TM) i7–6700 CPU (2.60GHz) with Java(TM) 1.8.0_121. The best accuracy we were able to achieve was PreDict CE 90.04% accuracy after 2 seconds of indexing. It’s pretty amazing to see how good you can get even without a lot of data. In our Enterprise-Edition (PreDict EE) we use the same PreDict library and extend it with our “secret sauce”, that solves decomposition and add some more tricks which finally increases accuracy by other 7% to 96.86%.

Performance results:

On the same instance type, we were able to achieve up to PreDict CE thrpt 86,019 +-1,127 ops/ms (corrections per millisecond). This is almost three orders of magnitude faster than the LSTM approach.

3.Summary

Based on our measurements it should be obvious that the algorithmic solution is the preferred solution for our spell correction problem. Beside the pure performance and accuracy benefits the algorithmic solution is way easier to deploy, operate and more cost efficient. But we wanted to do more by trying to combine the best of the two worlds.

Bridging the gap between algorithmic and data-driven.

Under the hood we built a hybrid solution to the spell correction problem that uses the algorithmic solution at runtime but uses deep learning to compute the individual weights / parameters for our highly parameterized algorithmic model. By doing so we were able to achieve even higher accuracy levels without impacting the throughput performance at all.

We are hiring

If you’re excited about advancing our search|hub API and strive to enable companies to create meaningful search experiences, join us! We are actively hiring for Data Scientists to work on next-generation API & SEARCH technology.

A closer look into the spell correction problem — Part 2 — introducing preDict

1.The Data-driven deep-learning way

2.The algorithmic way

3.Summary

We are hiring

Written by searchhub.io Dev Blog