How Tokopedia Rank Millions of Products in Search Page

Photo by Markus Winkler on Unsplash

It’s been 11 years since Tokopedia was established and we have had tremendous evolution since. One of the biggest evolutions was our search feature, where it has evolved from a simple database (DB) query to a complex system like we have today. Everyday, we face a new challenge with the rapidly growing number of products to search while keeping our search experience fast and relevant. In this article, we will share how we managed to keep our search fast and relevant, to be specific, how we rank our products which numbers keep growing overtime.

A Little Bit Journey of Tokopedia Search Ranking

Start from scratch: the “LIKE” syntax

Indexing FTW, but…

Apparently, keyword similarities are not sufficient for Tokopedia needs. Unlike web crawlers, there are a lot of items (products) with the same name sold by countless sellers in Tokopedia (e.g. Iphone 12, thousands of sellers are selling it in Tokopedia). Facing this, the main question starts to rise:

If multiple products have identical names, which one of them should be on the top?

This is an intricate question to answer since we need to consider buyer, seller, and even Tokopedia’s perspectives, to decide which product should get the spotlight and which one should not. A lot of brainstorming and discussions happened. Eventually, we agreed to create our own product scoring system dedicated to our organic search ranking. This way, we had a capability to create a more comfortable search ranking experience for all of our stakeholders.

Architecture

We decided to create this architecture since there are several features that move swiftly, e.g. buyer behaviours, sales velocities, real-time user feedback, etc. While the others, in contrast, are moving in a (lot) longer duration. In general, we could visualize the system like this:

Tokopedia rank service general architecture
diagram 1. Tokopedia Ranking Services Architecture (the bold one)

Since offline features are moving much slower than the real-time one, we optimize our update cycle through these classification where offline features are updated less frequently compared to real time features. This is also a way to lessen the traffic coming into features DBs, which are owned by other teams across Tokopedia. We want to prevent unwanted impacts to the DBs due to unnecessarily huge traffic coming in, especially during business hours.

How We Survived Millions of Active Products Everyday

  1. How to efficiently calculate millions of product scores with rapid-moving features on real-time ranking?
  2. How to efficiently calculate millions of product scores periodically so it won’t take too much time and ended up bothering DBs traffic?

Efficient Real-Time Ranking

For this part, we have a very simple solution, which is to reduce the data size into smaller one (yes, that’s it!). In Tokopedia, we first try to classify high quality products where we will then focus on finding the ideal position for them. With the competing nature of search result pages, identifying such products is crucial for optimizing our real-time ranking system. Despite that, we are always looking for other ways to filter our product strategically, since active products are expanding in numbers each day.

Note: However, we bet that it’s not that easy to cut data size in other circumstances and that is completely understandable, since each system has different needs, problems, and architectures.

Efficient Offline Ranking

Stage 1: Concurrent Calculation

We remembered the time in the past where we needed to upscale our offline ranking service to cater some needs, and thus the disaster coming in… When we apply the upscale rules, the instances become “unstable”. Furthermore, because upscale and downscale happened randomly, the instance that was running the updates might accidentally downscaled and we need to re-run the updates on another instance. Thankfully we were able to continue the progress, imagine that we can’t; every day will be monstrous for us 😦.

Stage 2: Worker-Based Systems

Seems perfect, why is there stage 3 then? If you’re wondering, you’re sharp! Upon its perfection for our use case, unfortunately, it still does not cater our needs. The data size is way too big for our service to process. After we dive deep into our code, we found that we were careless when building stage 1. In this article, we emphasized that we only need to update as few products as possible (in this case, update active products only) but in reality, that was not happening in the stage 1 and 2.

When we design a concurrent/worker based system, we should determine how we will split the data into smaller chunks for the workers to process efficiently. At stage 1, we took a shortcut on it by getting the max product id in DB, and iteratively processed from id 1 until that max id, creating a batch of every X (it’s a configurable number, to simplify the example, we will use 100) product ids.

The drawback of this system is that we don’t create a batch of 100 active products. Because among the 100 ids, the service doesn’t know which one are active and which one are inactive. So there’s a possibility that we waste a DB query with a batch of inactive products!

Another challenge is that a product could have variants. In short, variants are aggregation of similar products containing different features such as colors, size, internal storage, etc. Since we don’t want to break the variants’ experience in our search results, we need to create logic that gives fairness to such products.

As a way to measure fairness between variants, we need to ensure that products and variants are in the same batch during updates. However, we have no information if a variant is already updated in previous batches or not because we iterate product ids sequentially. For example: id 1 has a variant with id 101. We insert id 101 to the first batch. In the next batch, we will keep updating id 101 and insert id 1 in this batch because it’s a variant of 101. With this scenario, there’s a possibility that variants will be updated multiple times during the updates, and that’s also a waste of DB queries!

These two drawbacks apparently give us a hard time because our latest updates with this method took up to 96 hours to finish! (including the time we need to pause the updates because it’s bothering other teams’ business hours)

Stage 3: Smarter Jobs Generation

Final architecture of offline rank service
Final architecture of offline rank service
diagram 2. Final architecture of our offline rank service

While doing research on this approach, we found out that iterating ids in Elasticsearch is possible with Scroll API. However, our active products are so massive that using Scroll API is dangerous as stated in the documentation:

We no longer recommend using the scroll API for deep pagination. If you need to preserve the index state while paging through more than 10,000 hits, use the search_after parameter with a point in time (PIT).

Thus, we take that solution and use the Search After API instead. After a whole month of researching this approach, we were finally confident enough to deploy this solution into production and the result was hilarious! Our updates that were taking up to 96 hours to finish now have become approximately 24 hours only (4x faster) to finish. We were thrilled to achieve this result, since the offline updates play an important role in our search ranking experiences despite the slow-paced features.

Lesson learned: always process important data only. Check the data importance consistently because what important now might not be anymore in the future

Ongoing and Future Improvements

  • More Iterative Ranking Model, we are experimenting with various models, such as counterfactual deep learning, where we optimize how features would interact with each other.
  • Personalized Features, we are exploring personalized features to create unique search experiences for each user.

A lot of things happened in every hour of our daily lives in Tokopedia Search. If you’re interested in joining us to develop the best ecosystem for our users, visit us on https://www.tokopedia.com/careers/jobs/ and let us know your interest!

References

Elasticsearch Scroll API (https://www.elastic.co/guide/en/elasticsearch/reference/current/scroll-api.html)

Elasticsearch Search After API (https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html#search-after)

Tokopedia Engineering

Story from people who build Tokopedia