Semantic Search — Innovation at scale!
One of the coolest thing about working for Myntra is to have the opportunity and freedom to innovate. That not only helps in nurturing creativity but also helps build cool products! I feel, our latest work of creating an intelligent query pipeline was in many ways an innovative way of improving the search experience for Myntra’s users and thereby improving conversion and discoverability.
Before diving right into the solution, it will be helpful to get some context on how we arrived at it. So here’s a brief run down:
When we started out, observed a general sentiment of search not being optimal. Even though an information retrieval (IR) system like Search may never solve for all queries, yet in its state at that time, it could be tuned for both the head and tail queries. There were plenty of anecdotal instances of the same but we wanted to establish it as a problem using a decent sample. To that end, we decided to dig deeper and find data proving or disproving the hypothesis. Primarily did the following:
- Explored the query logs
- Collected feedback from customer care (CC)
- Crowdsourced to squadrun* and got relevance scores**
- Conducted a Myntra wide bug bash to collect feedback
The aforementioned enabled us to find common themes and we clustered the problems into three main areas:
Lack of query understanding
- failure to establish intent behind a query: queries like “moto 360” should be understood as “smart wearables” which wasn’t the case
- missing contextual understanding: queries like “nike shoes under 2000” won’t be understood
Insufficient query correction
- Spell correction didn’t seem to correct common scenarios like “jins -> jeans” or “nuke -> nike”. Also, there was no correction (or expansion) for colloquial or transliterated terms (a small percentage though)
No query substitution
- In cases of no-results, a probable substitution was not offered
As is evident the above mentioned can broadly be classified as problems relating to either precision or recall and as is known that solving both at the same time is not possible (established via the Cranfield experiments***). Hence, we decided to index more on improving precision. To that end, click-through-rate (CTR) was elected as the most important metric. Additionally, other key metrics like bounce rate, click-depth and zero results were also baselined.
This exercise enabled us to establish the problem clearly and also quantify projections for benefits if it was improved.
That being said and done, it was time to lay out the solution. One possible approach was to make iterative changes, however, it was important for us to not be bounded by the limitations of the current design. Hence, we decided to fork a new branch and came up with a design from the ground up reusing components where possible. The design looked something like below (used for representation purpose only):
This is an illustration where “Moto 360” was searched. The first step corrects Moto to Motorola, then the intent recognition engine annotates Motorola as a brand, 360 as a smartwatch (in context of Motorola) and also recognises the primary intent to be a wearable. These annotations are then used to fetch results from catalog. In the case where Moto 360 wasn’t listed in Myntra’s catalog, the recognised intent and other attributes would have been used to substitute the query with similar products from brands like Samsung (gear) and Apple (watch).
Sanity: is run first up. It does stop word removal, spell corrections, and stemming on the user query as needed. Our spell correction is a two pronged approach, it would in parallel find corrections using “refined-soundex” and an in house implementation: qwerty-based jwdˆ. Output of this is passed to noisy channel which uses confusion matrix to pick a winner.
Query correction dictionary: is a dictionary of query to query corrections. It is designed to house: synonymy, translations, transliteration corrections.
Query understanding: has the intent engine. Intent, we classify as “explicit intent” or “implicit intent”. Here we take a 3 step approach: first, we do NER using SolrTextTaggerˆˆ — tags named entities in log(nm) time. Then we execute a bayesian classifier to tag implicit intent if needed. In parallel, a rule engine is executed to form understanding for pre-defined rules (queries like nike shoes under 4000). Finally, all annotations are passed to the “Disambiguator” for removing ambiguity and picking a winner.
Query substitution: is invoked if results are below a threshold. It picks the best candidate using a weighted graph which was created by crawling query logs and finding co-located terms in a session. Frequently occurring terms are nearer and hence used for finding substitutes.
Below are some indicative numbers, of the improvements, seen since launch:
The above means an improved conversion and a better discovery of Myntra’s catalog. It also improves customer experience both in terms of finding relevant items in fewer clicks and fewer disappointments.
* “Squadrun” operates like Amazon’s Mechanical Turk, except that they are specialists in Search.
**relevance score is the slope on a scatter plot of “search score (given by individuals)” vs “confidence score”
***As the level of recall rises the level of precision generally declines and vice versa — The Cranfield experiments (1957 & 1962)
ˆOpen sourced and can be found at https://github.com/nilaksh/qwerty-jaro-winkler