Autocomplete and Search for a domain specific E-commerce website

7 min readSep 8, 2016

Autocompleting user queries in the search bar enhances their browsing experience, converges them faster towards a vertical, shorten their active session-time on the site and you may even apply some business logic to boost GMV. For an e-commerce website selling products with names specific to a region like “polki necklace set”(indian), “pingguo”(chinese), “palissandre”(french), “paella”(spanish), how do you implement autocomplete and a basic domain specific search on your products?

I’ll give a brief overview of the product and it’s implementation. Following are some of the basic things expected out of your autosuggestion:

Sub 100ms response for every letter the user types in. Implies query time <10ms on the database.

I’ve chosen to build it on elasticsearch(ES) which has good documentation, faster plain text search and is scalable with the number of products. Preliminary benchmarking shows ES searching on 5million records in under 10ms.

Query completion not just to the word but to a relevant sentence.

In order to complete the sentence, we need to either generate those sentences using an NLP library(not efficient for domain specific words) or use users’ search history corpus searched at least twice. Any incoming user search will mostly be in this corpus. Search history of the users might’ve duplicate searches, spelling mistakes, domain irrelevant words, too long a query or such similar issues. **We need to clean up our search history.

Spell check user query so it matches against the text you’ve in your database.

As the problem statement is domain specific, user might not know the product’s english name or spelling. We need to do spell correction in our completion. Why? when user selects a particular autocomplete, it reduces the complexity of text search on products DB wherein you stored the name of product which you/uploading vendor think is the right way to spell the product. Also, you enhance users’ experience and his trust on the website.

Identify context(product vertical) in which your autocomplete sentence falls in. Food? Clothing? Furniture?

Identify the context of the text search so you can display the suggestions converging into a vertical. Ex. Paella in Food, Polki in Jewellery, Palissandre in Furnishing, Pingguo in Food. Identifying the product vertical lets you take the user to the corresponding vertical page so user can see that vertical specific filters. We found out users who apply filters have higher conversion compared to those who don’t.

Autocompletion may include products/vertical name/simple sentences based on your domain.

If you’re running an online grocery firm, your user searches for a particular item/product and it’s essential to show that product with the quantity he would like to order. For an online fashion firm, you would want to suggest a vertical/sub-vertical/trend/brand/regional fashion. It’s unlikely the user searches for a specific product on a fashion website. How you autocomplete depends on your analysis of user search history.

Search scalable to millions of products in your DB.

When user clicks on an autocomplete suggestion, you need to do plain text search on your inventory DB. You can identify the adjectives, modifiers and context in the search query and apply filters on your DB to fetch appropriate products. For a market place model, domain specific website, your product names and their properties are at the mercy of vendors’ keyboards. Doing a plain string search with synonyms/spell-corrected-synonyms on products gives more number of products for a given search.

Implementation

Cleaning up Search History

If you don’t have any user search history, for a cold start, start with some top selling products/vertical names/crawling competitor websites. If you have an analytics platform like GoogleAnalytics/Mixpanel/Server logs where you captured historic searches of the users, get it in the following format(top_historic_searches_site.csv).

SearchString           , #times searched
Paella                 , 5620
Polki necklace set     , 2304
Palissandre            , 932
Pingguo                , 22

Creating an auto_complete index on Elasticsearch

Once we get the historic searches(searched at least twice?!), we need to cleanup the search data. I’ve written a python script which reads every query, trims,converts to lower case, removes special characters, ignores search strings length>5 words and does spell check using inventory DB. If there’s any spell checks suggested in my model, I take the token and suggested token.

Every query is split into tokens and spell corrected token is replaced if confidence > .8 and if confidence between .5 and 0.8 it’s eyeballed and tokens are replaced with suggestions. I’ve created a gate_keeper which has all tokens that can be allowed into ES auto_complete corpus. Any query with any of its tokens not present in this gate_keeper doesn’t enter into the final corpus. This data clean up task can be an another post all together. First I’ll create an index with mapping for _suggest API.

PUT localhost:9200/auto_complete
{"settings": {
    "analysis": {
        "filter": {
            "nGram_filter": {
                "type": "nGram",
                "min_gram": 2,
                "max_gram": 20,
                "token_chars": ["letter", "digit", "punctuation", "symbol"]
            },
            "my_synonyms": {
                "type": "synonym",
                "synonyms": []
            }
        },
        "analyzer": {
            "nGram_analyzer": {
                "type": "custom",
                "tokenizer": "whitespace",
                "filter": ["lowercase", "asciifolding", "nGram_filter"]
            },
            "whitespace_analyzer": {
                "type": "custom",
                "tokenizer": "whitespace",
                "filter": ["lowercase", "asciifolding"]
            },
            "suggest_synonyms": {
                "type": "custom",
                "tokenizer": "whitespace",
                "filter": ["my_synonyms"]
            }
        }
    }
},
    "mappings": {
    "searches_type": {
        "_all": {
            "analyzer": "nGram_analyzer",
            "search_analyzer": "whitespace_analyzer"
        },
        "properties": {
            "historic_search_name": {
                "type": "string"
            },
            "search_suggest": {
                "type": "completion",
                "payloads": true,
                "preserve_position_increments": false,
                "preserve_separators": false
            }
        }
    }
}
}

I’m going to insert historic searches into ES. You can use _bulk api of ES to insert in batches of 10k records. For suggest fields, we have

output: The output to be shown to user if record is a match. input: given search can match any of the these sentences. weight: an optional weight on top of string match _score of the record. payload: optional string which will be returned along with the output where we can include the context of vertical.

PUT localhost:9200/auto_complete/searches/
{
    "historic_search_name": "polki necklace set", # the search
    "search_suggest": {
        "output": "polki necklace set", #what output to show to user
        "weight": 1.5, # weightage(or)score. 
        "input": [
            "polki set necklace",
            "necklace set polki",
            "necklace polki set",
            "set polki necklace",
            "set necklace polki",
            "polka necklaces"
        ], # all the possible combinations of search query
        "payload": "[{\"score\": 0.6, \"vertical_id\": 2, \"type\": \"context_suggest\", \"vertical_name\": \"Jewellery\"}]"
    }
}

Querying auto_complete corpus on ES

Once we put all the historic searches on auto_complete index, we can query using _suggest endpoint. “SearchSuggest” is the name of aggregation to filter in the response. To do a spell correction and search with ngram(2) edit distance, we use fuzzy parameter.

GET auto_complete/searches_type/_suggest
{
    "text": "p",
    "SearchSuggest": {
        "completion": {
            "field": "search_suggest"
        },
        "fuzzy": {
            "fuzziness": 2
        }
    }
}

With a response:

{
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    }
    "SearchSuggest": [{
        "text": "p",
        "offset": 0,
        "length": 1,
        "options": [{
            "text": "paella",
            "score": 595.0
        }, {
            "text": "polki necklace set",
            "score": 231.0,
            "payload": "[{\"score\": 0.6, \"vertical_id\": 2, \"type\": \"context_suggest\", \"vertical_name\": \"Jewellery\"}]"
        }, {
            "text": "palissandre",
            "score": 132.0,
            "payload": "[{\"score\": 1.2, \"vertical_id\": 5, \"type\": \"context_suggest\", \"vertical_name\": \"Furnishing\"},{\"score\": 0.4, \"vertical_id\": 19, \"type\": \"context_suggest\", \"vertical_name\": \"Dining Tables\"}]"
        }, {
            "text": "pingguo",
            "score": 104.0
        }]
    }]
}

_suggest has given me a response in 6ms over a search corpus of 20k records. We can write a JS parser over this response and show auto_complete suggestions to the user. An interesting thing to note in the payload is a stringified dictionary object where I’ve stored different contexts(verticals/sub-verticals) with an offline script. Every day our analytics scripts can take unique searches on our site, clean the data, and add new/trending searches to our corpus with updated weights.

In order to implement trending searches, we can give exponentially decaying weights to the timestamp of each query and use this score in top_historic_searches_site.csv

Text search on inventory DB

Once our user selects a particular suggestion, we take the string, and do a plain string search on our inventory database. If you’ve a large number of products, chances are you’re using ES/Solr to power your website for a faster response. If so, do a plain string search on _all fields.

If you’ve identified any adjectives or properties on the search string, apply those filters in a filter-must query on Elasticsearch.

{
    "from": 0,
    "size": 40,
    "query": {
        "filtered": {
            "query": {
                "bool": {
                    "must": [{
                        "match": {
                            "_all": {
                                "operator": "and",
                                "query": "polki necklace set",
                                "type": "boolean"
                            }
                        }
                    }, {
                        "match": {
                            "_all": {
                                "query": "polki necklace set",
                                "type": "phrase"
                            }
                        }
                    }]
                }
            },
            "filter": {
                "bool": {
                    "must": [{
                        "term": {
                            "vertical_id": 2
                        }
                    }]
                }
            }
        }
    }
}

As we’re doing a plain string search and we’re not sure about the quality of text in our inventory DB, we can add _score of above query with a product_score field on ES using FunctionScore or Scripting for better quality run time results.

Gists: https://gist.github.com/adityan619/7909c53281f31dd9d1f6e12a323c2a57

I haven’t covered tweaks for above code, further optimization, benchmarks and why _suggest. That’s for another time. Any suggestions welcome :)

Autocomplete and Search for a domain specific E-commerce website

Implementation

Written by Aditya Prasad N