A closer look into the spell correction problem — Part 3 — the bells and whistles

4 min readSep 25, 2017

At searchhub.io query cleansing of human input (user query) is the first strategy we apply to each and every search query we receive. In Part 1 and 2 we already discussed a little about the challenges of spell correction at scale and independency of language.

However at search|hub we strive to help software systems to understand humans. Therefore we not only have to cater for typos like “scatebord -> skateboard”. There are a lot more reasons why a search engine might not understand or even worse misunderstand a user query.

https://www.youtube.com/watch?v=cbtf1oyNg-8

1.Word Segmentation & Word Decomposition:

Since most search engines are still based on a token representation of words we first have to identify the words in a user query. This might seem easy and obvious but in quite some cases it’s not. And I’m not only talking about cases that occur in some languages which are traditionally written without inter-word spaces, like Chinese and Japanese or cases where user queries are produced by some sort of speech recognition system.

Let’s take a real-world example: “Damenmotorradlederhandschuh”. Now you might think WTF is this. This is a german compound word which is built by gluing together the following words translated into english. “women+bicycle+leather+gloves”

The dictionary approach:

The traditional approach tackling such a query would be to use a decomposition dictionary that scans through the query and tries to break the query as soon as it finds a sub-word from the dictionary. So let’s do that “Damenmotorradlederhandschuh -> Damen motor rad leder hand schuh” Again for the ones that do not speak german: “women+engine+wheel+leather+hand+shoes”

Oh, wait what the heck happened here: by splitting the words we changed their meaning! Imagine the search result for such a query.

And what happens if I misspell the query?

“Damenmotoradlederhantschuh -> Damen motorad leder hantschuh” So in this query the user made two simple errors and not even the mighty google is able to guess what the user was looking for.

Word Segmentation & Word Decomposition are vital parts of the query understanding process and you can’t fix this part at scale by manual mapping through dictionaries and ambiguity handling.

2.Primary Word Detection:

Once you segmented / decomposed the query into words you’ll soon realize that now there is another query dimension you have to take care of. The sequence of words. There are several cases where the order or sequence of words inside a user query might change the meaning of the query or at least change the stemming approach.

Let’s directly jump in another example for this: Imagine the above user query “Damenmotorradlederhandschuhe” and a couple of other queries which represent the same intent/meaning -> “motorrad leder handschuhe damen, leder motorradhandschuh für damen”.

In this example, the order or sequence of words is pretty much independent of its meaning or intent. However as soon as you want to introduce stemming you better make sure that you only stem the “primary word(s)” in this case “handschuh(e)”.

But not every user query that includes the same words represents the same meaning or intent. The query “Armbanduhr” aka wristwatch vs. “Uhrarmband” aka watch bracelet is a perfect example for this. Both queries segmented or decomposed consist of exactly the same words but describe two different things. To solve this problem we first have to identify those user queries and then find the primary word in order to understand its meaning or intent.

3.Under- & Overstemming:

Grammatically correct stemming can be very tedious. Applying traditional stemmers like Porter or Snowball usually leads to a lot of overstemming or understemming — especially with short words which represent the majority of the query corpus.

Again let’s take a real-world example: “babybetten -> babybetten” and “vans -> van” and iphone5s -> iphone5. In the first example, the porter stemmer was unable to stem babybetten to its root babybett while in the second example the brand name vans was reduced to van which in this case changes its meaning.

But in order to retrieve relevant and meaningful search results the search engine needs to understand the meaning of the query. While singular and plural forms normally represent the same meaning / intent this might not be the case for automatically stemmed words.

search|hub does all of this automatically

When we build search|hub we solved all of these areas by combining domain knowledge, smart algorithms and machine learning models fueled by user data. We strongly believe that all of this is key to make search engines understand humans.

SEARCH IS THE PLACE WHERE THE USER IS TELLING YOU WHAT HE WANTS. IF YOUR SEARCH ENGINE SPEAKS THE SAME LANGUAGE AS YOUR USERS SEARCH BECOMES A CONVERSATION. SEARCH|HUB HAS SPECIFICALLY BEEN DESIGNED TO HELP YOUR EXISTING SEARCH ENGINE TO UNDERSTAND HUMANS AND DRIVE THESE CONVERSATIONS.

We are hiring

If you’re excited about advancing our search|hub API and strive to enable companies to create meaningful search experiences, join us! We are actively hiring for Data Scientists to work on next-generation API & SEARCH technology.