How we rebuilt the Walmart Autocomplete Backend

Shouvik Dutta

Published in

Walmart Global Tech Blog

10 min readJan 6, 2021

Introduction
The Legacy
The Approach
The Solution
- Handling multiple data sets
- Cache Capacity estimation
- Latency to call remote cache
The Architecture
What Next?
Conclusion

Introduction

Autocomplete is the first step into the search world. Instead of typing in a word/phrase, users can start typing the first few characters, and autocomplete would start suggesting relevant matches. It helps users quickly find what is most relevant and trending, and how to fine-tune their search.

This is particularly useful on mobile devices where users may not want to type a lot of characters but still get the relevant results easily. With the growing use of mobile devices and tablets, a good autocomplete service can drive customer experience and help find relevant suggestions quickly and efficiently.

Autocomplete also needs to be very quick in responding, since the goal is to suggest something on every keystroke. For the same reason, it is expected to receive a lot more traffic, compared to search traffic, since one request is made for every character entered by the user.

The Legacy

The traditional way of building an autocomplete service is by implementing a Trie. This approach works well in academia, interviews, and for faster time to market. In the real world, however, where we want to integrate more features or make it scalable, this solution can pose serious problems.

Walmart e-Commerce has been existent on the internet for some time now, and the initial version of Autocomplete was built in a much similar fashion as mentioned above.

An offline job would build a data file with suggestions and scores (driven by factors like popularity, frequency of use). Once a day this file would be parsed and a TST (a variant of Trie) would be generated in memory, which would serve traffic.

Over time, when ‘Online Grocery’ became a thing, a separate Autocomplete for OG was built by forking the original service. This was quick and dirty, increased tech-debt, and we had to maintain two separate codebases and deployments. Both these services were legacy applications running on-prem. With this architecture:

The memory capacity of the server limits the total number of suggestions
To increase total suggestions, horizontal scaling was implemented using reverse-proxy, which added network hops increasing latency
Deployment/roll-back was difficult and complicated
Unnecessary separation for mobile/web requests via different routes

The Approach

We embarked on a journey to upgrade this whole infrastructure to:

Unify the code base and service for e-Commerce and Grocery into a single multi-tenant backend service
Containerize the application and deploy it on the cloud
Make it scalable, without a reverse-proxy, so that the number of queries is not limited by the server’s memory footprint
Make it easier to introduce new features without worrying about how the TST is performing, by moving the data to an external distributed cache

The fundamental idea is based on the concept of serializing a Trie. If we can store the Trie in the form of a Hash table for key-value lookups, we can achieve constant time complexity for retrieval. Also note, that time to insert into the trie is trivial here, as it can be done offline. The important aspect is how quickly can we fetch.

Let us take a step back and look at the Trie. To look up a prefix of length N (app N = 3, apri N = 4, and so on), we need to go N levels deep at O(N). Then we need to find all the complete suggestions (assuming M of them, each avg[len] = N+K), at O(MK). Assuming K has a fixed upper bound, this becomes O(M). Finally, we want to sort them to show the top S (≤ M) at O(M log M). The sorting is done based on the score tied to every suggestion.

So irrespective of how much time we need to build a Trie, fetching in real-time takes O(N) + O(M) + O(M log M). As the data grows, this has both memory and latency impact.

In reality, on a webpage or mobile device, we only show the top 8 or 10 (S) suggestions. Also, people hardly use autocomplete after typing 20–30 (N) chars. If we can make these things constant, the only factor contributing to big-O is M.

Instead of storing all M suggestions and then sorting them for every fetch, can we presort and store only the top S? That would give us a fetch time of O(1). Looks like we can.

To proceed further we need to have an understanding of data structures like Array, HashTable, BinaryHeap, and PriorityQueue.

The Solution

We read the data set which contains suggestions and their scores. In this example, these would be bank:10, bat:20, bag:40, and ball:30. Then we start building a HashTable whose keys are the possible prefixes (b, ba, bal, bat, …). The values mapped to these keys would be a PriorityQueue each, where the queue elements are the suggestions starting with the prefix and prioritized based on the scores. We represent this HashTable as HT<prefix, PQ>. The priority queue uses a MaxHeap underneath and provides data in sorted order when we fetch them one by one. For visualization purposes, the structure looks like the way it is represented in the diagram.

We use a bounded PQ and only hold on to the top S suggestions we want to show. Any suggestion with a lower score drops off the queue for every suggestion we process, thus bringing the heapify time to a constant. Also when computing the prefixes we only compute the prefix for the first N characters. So instead of generating k(k+1)/2 prefixes (for a suggestion, k characters long), we always generate min(N, k) prefixes. At this point, the size of the Prefix Tree is dependent on the number of suggestions. Keep in mind, our goal is focused on fetching in constant time.

Next, we convert HT<Prefix, pq> to a HT<Prefix, Array>. We convert each PriorityQueue to an Array by fetching each item from the Queue. This takes O(S log S). This fetch is not serving real-time traffic, rather building the array of top-S pre-sorted suggestions.

Lookup on a HashTable is O(1) time. Fetching a sub-array is also O(1) time. This leads to O(1) time for the lookup of the complete data structure. This HT<Prefix, Array> is better known as a prefix hash tree.

Finally, we serialize this prefix hash tree into a distributed cache. For our use case, we used Memcached.

Handling multiple data sets

As we mentioned above that we wanted to unify e-Commerce and Grocery and introduce Omni Search, we realized that depending on the experience, suggestions would vary. For example, searching for ‘apple’ could lead to ‘apple juice’ in Grocery but to ‘apple iPhone’ for e-Commerce, and probably to both in Omni Search.

Similarly, we can have users searching under a specific category (like ‘Electronics’ or ‘Home Improvement’) or across all categories. For such cases, the suggestions would be different.

Lastly, there is the situation when we want to load different corpus of data with different scores, so they are ranked differently (or with different suggestions), and see how these queries perform (AB testing). This would have caused a big problem with in-memory TST since the memory footprint would nearly double and we needed to scale up the service pods. The legacy version initially didn’t support AB testing due to these reasons (horizontal scaling solved this problem to some extent).

Example of a key-value pair from the cache

To solve these problems, we needed to prepend our prefixes with certain acronyms, to uniquely identify these combinations. For example, instead of storing the prefix ‘app’ (as in apple), we would store it as ‘aeE~app’ (a defines the first variant-as in AB test, e stands for e-Commerce, and E is for the electronics category). Similarly, you could have ‘bg0~app’ (b defines the second variant in an AB test, g stands for grocery, and 0 is for all categories).

Cache Capacity Estimation

Initial proof-of-concept showed that the prefix hash tree would take more space compared to the Trie, and we were ready to do that trade-off for the sake of reduced latency and design.

However, we observed that in the Trie implementation, space correlates to the density/sparseness of the Tree (a sparse tree would take more space for the same number of suggestions than a dense tree, due to the uniqueness of intermediate nodes). With the prefix hash tree, space correlates with the prefix-length and the number of suggestions.

As we increased the number of suggestions, due to their randomness, the sparseness grew and hence the prefix hash tree gave better space optimization compared to the Trie.

How much space are we talking about to store the HashTables on a distributed cache? We want to ensure we have enough provisioned. Since Memcached stores data in a Key-Value store, we JSONified the HashTable to compute how much space it is taking. We then provisioned triple that space, so as to have enough buffer even when running AB tests. In the future, we could allocate more space on the cache without any impact on the data rendering part.

Approximately 65 million key-value pairs were stored on 8GB, which translates to around 125 bytes per pair.

Latency to call remote cache

The data is no longer in a Trie, and mathematically we are able to achieve O(1) fetch time, this sounds great on pen and paper. In reality, we just moved the data to a remote cache service and introduced network latency !!

Service Latency (cache read + network trip)

Anticipating this during our proof-of-concept, we wanted to ensure that the solution is not worst than the problem. Running performance tests with the data in a distributed cache, we found that majority of the data was being served under 10 millisecond, while in the past it would take 12–14 millisecond.

There were a few outliers that took more time due to unrelated factors (like spell correction) and were not related to cache lookup. Diving into those is beyond the scope and context of this discussion.

The cache-read was always under 1 milli sec (network roundtrip accounted for the remaining 8–9 milli sec). This meant that even with a network round trip to the cache, it was faster than looking up a huge Trie in memory.

We created replicas for geo locality so that the customer-facing services and the distributed cache are co-located. We also ensured redundancy in every geographic location to avoid outage in case one cache instance went down.

The Architecture

The writing of data into a distributed cache is done by a Cron job and is decoupled from the read-service that renders the data to the Front end. This is a complete separation of responsibilities, as the read-service can focus on serving data quickly and introduce other features without worrying about its impact on the data, while the write-job can focus on updating the cache at a scheduled time of the day.

The new Autocomplete is hosted on the cloud, running in containers managed by Kubernetes. It is comprised of two services:

The writer service runs once a day and populates the cache from the data file. This loads data to only one replica of the cache, which then self replicates across different geographies (and redundancies).

The reader service is the Front End facing service that serves real-time traffic in near-constant time and pulls data from the nearest cache.

Other details like CDN caching, GSLB, and deployment infrastructure has been left out for simplicity.

What Next?

We can further optimize the data generation by skipping the intermediate file generation process. Data can be pulled from upstream and directly fed into the distributed cache. This can be done in real-time and on-demand.

We can also make the cache ‘location-aware’. Meaning, the suggestions we show can be based on which geographic location this would be serving. For example, Alaska and Texas might have different requirements for ‘winter garments’.

Conclusion

With this re-architecture, we were able to scale the application without compromising on latency or performance; eliminating reverse-proxy, lead to fewer network hops and services.

We were also able to make the data rendering service separate from the data generation/insertion service; making it easier to introduce new features without memory implications and complete separation of responsibilities.