Yellow Couch: Tale of Two Solr Indexes

Gajendra Dadheech
Walmart Global Tech Blog
4 min readOct 27, 2020

Solving for Color and Size Queries in an Andromeda-Esque Search Index using yet another Solr as Forward-Index

Figure 1: A typical yellow chair from Walmart Catalog

In this post, we will talk about how we tackled the problem of Availability/Eligibility/Price discrepancy in search and item page for Color/Size Queries and Filters.

Problem Statement

When a shopper comes with an intent of Color or Size [e.g. Yellow couch, King Mattress etc], Our results on the first page will look quite relevant on the outset. But when we move to the Item page, it is possible that some of those items don’t have that specific colour item available which we were looking for.

Why This Happens

Figure 2: 10..00 Feet Overview of Search Architecture

Figure 2 shows a typical search Query Path in any e-commerce search system.

A query first hits one of our micro-services and goes through all the motions of search query [query understanding, reformulation, correction etc.] and then finally hit Solr Index which is getting populated from an ever running Indexing Pipeline.

Now culprit in our story is Solr Index or our data model in Solr. We group all possible colour or size combination of an item in one document and store it in a de-normalized format [Shown below] inside this index. So if any color couch is available then we mark this document as available, but we don’t know if yellow colour is available or not. That is identified on the item page by another call to our real-time availability service API [A different service].

Denormalized Format for presentation purpose:{id: 1,  color: [red, blue, green], available: true }

Question: Why not store all the data for all item combination in a document inside the search index?

Ans: Walmart has more than 5000 stores across the US and each store has its own availability, price and eligibility information. Storing all that information inside one document is very complicated and explosive for index.

Question: Why not explode the index and store each variant as a different document

Ans: This was something we tried, but It gave us huge latency spikes and also the index size grows because of data-duplication.

The System Design

We solved this problem using a Forward Index, details follow.

Figure 3: What is better than 1 Solr Index: 2 Solr Index

Figure 3 shows the new search design, we have introduced a new data-store [Forward Index] and its yet another Solr.

This forward index has only the variant items as different documents. So through our orchestration layer, we make another call after our first call to Primary Solr, to get the relevant item in a group. Primary Solr returns a most relevant group, 2nd solr just figures out most relevant colour/size item in that group.

Data Format in Forward Index:{parent_id: 1, id: 11,  color: red, available: true, price.....}
{parent_id: 1, id: 12, color: yellow, available: false, price... }

Our queries to the Forward index are mainly filtered queries [FQ] on store availability, eligibility, colour and size attributes. So we utilize document cache and filter cache to the fullest. Below diagrams shows that how we get maximum out of Solr Caches.

Figure 4: Filter Cache during the performance test
Figure 5: Document Cache during the performance test

Overall latency impact to search query latency is minimal as this call happens in parallel with another microservice call which happens during post-processing of documents, as is shown in Figure 6.

Figure 6: Impact on Latency through various approaches

Conclusion

We now have two Solr Indexes in our Search Path, one of them has a subset of documents and is used to filter out non-relevant items and keep the relevant items at the top. We also have an alternative indexing pipeline which keeps this forward index up-to-date. So for any search query first we get top N items from our primary index and then do a filtering/boosting/demoting/swapping based on the secondary call. All in all, colour-size queries work fine now.

Solr Reference Document:

Solr: https://lucene.apache.org/solr/

Would love to know if you think of any other innovative and optimal solution of the above problem, comments are welcome for the same. :-)

--

--

Gajendra Dadheech
Walmart Global Tech Blog

Professional Data-Dabbler | Search Engineering@Walmart, Ex-Flipkart | Co-founder of DataTurks