It’s been 11 years since Tokopedia was established and we have had tremendous evolution since. One of the biggest evolutions was our search feature, where it has evolved from a simple database (DB) query to a complex system like we have today. Everyday, we face a new challenge with the rapidly growing number of products to search while keeping our search experience fast and relevant. In this article, we will share how we managed to keep our search fast and relevant, to be specific, how we rank our products which numbers keep growing overtime.
A Little Bit Journey of Tokopedia Search Ranking
Start from scratch: the “LIKE” syntax
Our initial search feature began with “LIKE” DB query syntax. Most of us already know it is probably the most straightforward way to search products from our DB. Yet with the growing number of products, DB queries will start to slow down to the point where we should consider implementing an index to our search.
Indexing FTW, but…
Jumping to recent years, we successfully implement index (Elasticsearch) as a core of our search service. Until now, we are incredibly helped by Elasticsearch features, it’s fast and had a lot of powerful queries to support our user needs. However, there was one problem that still bugged us even after the implementation.
Apparently, keyword similarities are not sufficient for Tokopedia needs. Unlike web crawlers, there are a lot of items (products) with the same name sold by countless sellers in Tokopedia (e.g. Iphone 12, thousands of sellers are selling it in Tokopedia). Facing this, the main question starts to rise:
If multiple products have identical names, which one of them should be on the top?
This is an intricate question to answer since we need to consider buyer, seller, and even Tokopedia’s perspectives, to decide which product should get the spotlight and which one should not. A lot of brainstorming and discussions happened. Eventually, we agreed to create our own product scoring system dedicated to our organic search ranking. This way, we had a capability to create a more comfortable search ranking experience for all of our stakeholders.
Equivalent with our Elasticsearch journey, our scoring system also had an astounding journey from a single programmatic function to multiple services. All of these services could be categorized into two, offline and real-time ranking. In simple terms, offline ranking service will calculate base-scores before we index products and real-time ranking will do post-processing after we index products into Elasticsearch.
We decided to create this architecture since there are several features that move swiftly, e.g. buyer behaviours, sales velocities, real-time user feedback, etc. While the others, in contrast, are moving in a (lot) longer duration. In general, we could visualize the system like this:
Since offline features are moving much slower than the real-time one, we optimize our update cycle through these classification where offline features are updated less frequently compared to real time features. This is also a way to lessen the traffic coming into features DBs, which are owned by other teams across Tokopedia. We want to prevent unwanted impacts to the DBs due to unnecessarily huge traffic coming in, especially during business hours.
How We Survived Millions of Active Products Everyday
That being said, the architecture is still not answering the title of this article. We still need to figure out:
- How to efficiently calculate millions of product scores with rapid-moving features on real-time ranking?
- How to efficiently calculate millions of product scores periodically so it won’t take too much time and ended up bothering DBs traffic?
Efficient Real-Time Ranking
We, engineers, should be aware that no matter how fast your programs are, if they process humongous data, they will still take a lot of time to finish (relatively, of course). As we want these scores to be real-time, we don’t want it and we need to find a workaround for this.
For this part, we have a very simple solution, which is to reduce the data size into smaller one (yes, that’s it!). In Tokopedia, we first try to classify high quality products where we will then focus on finding the ideal position for them. With the competing nature of search result pages, identifying such products is crucial for optimizing our real-time ranking system. Despite that, we are always looking for other ways to filter our product strategically, since active products are expanding in numbers each day.
Note: However, we bet that it’s not that easy to cut data size in other circumstances and that is completely understandable, since each system has different needs, problems, and architectures.
Efficient Offline Ranking
This part is honestly a tougher challenge than the real-time one. Even though offline rankings move with a slower pace, the growing number of products don’t. Due to some circumstances, unlike the real-time rankings, we can’t filter the number of data here. The major reason for this is some of the offline features are mandatory to be updated every certain period, where we risk misrepresenting the product quality otherwise. To cater this need, we have a solution which we will explain in three stages throughout our experiences
Stage 1: Concurrent Calculation
The first stage of our development for this periodical update was to do concurrent calculations. Since our service was built in Golang, we could easily implement concurrent processes with the help of channels. At our initial stage, we do all of these calculations in a single instance (yes, what a strong will it has, until we realize it doesn’t, 😆). We started to realize that this solution did not work in less than 6 months.
We remembered the time in the past where we needed to upscale our offline ranking service to cater some needs, and thus the disaster coming in… When we apply the upscale rules, the instances become “unstable”. Furthermore, because upscale and downscale happened randomly, the instance that was running the updates might accidentally downscaled and we need to re-run the updates on another instance. Thankfully we were able to continue the progress, imagine that we can’t; every day will be monstrous for us 😦.
Stage 2: Worker-Based Systems
To tackle this serious problem, we focus on how we can distribute the jobs to multiple servers, without annihilating the resume capability. To support these requirements, we considered creating a worker-based system by utilizing messaging platforms (e.g. Kafka). The initial idea was sending the jobs as a form of message and let all of the instances consume it as workers. In the end, we ended up utilizing Redis to create our own worker-based system (see diagram 2 below), because it’s more convenient for our condition.
Seems perfect, why is there stage 3 then? If you’re wondering, you’re sharp! Upon its perfection for our use case, unfortunately, it still does not cater our needs. The data size is way too big for our service to process. After we dive deep into our code, we found that we were careless when building stage 1. In this article, we emphasized that we only need to update as few products as possible (in this case, update active products only) but in reality, that was not happening in the stage 1 and 2.
When we design a concurrent/worker based system, we should determine how we will split the data into smaller chunks for the workers to process efficiently. At stage 1, we took a shortcut on it by getting the max product id in DB, and iteratively processed from id 1 until that max id, creating a batch of every X (it’s a configurable number, to simplify the example, we will use 100) product ids.
The drawback of this system is that we don’t create a batch of 100 active products. Because among the 100 ids, the service doesn’t know which one are active and which one are inactive. So there’s a possibility that we waste a DB query with a batch of inactive products!
Another challenge is that a product could have variants. In short, variants are aggregation of similar products containing different features such as colors, size, internal storage, etc. Since we don’t want to break the variants’ experience in our search results, we need to create logic that gives fairness to such products.
As a way to measure fairness between variants, we need to ensure that products and variants are in the same batch during updates. However, we have no information if a variant is already updated in previous batches or not because we iterate product ids sequentially. For example: id 1 has a variant with id 101. We insert id 101 to the first batch. In the next batch, we will keep updating id 101 and insert id 1 in this batch because it’s a variant of 101. With this scenario, there’s a possibility that variants will be updated multiple times during the updates, and that’s also a waste of DB queries!
These two drawbacks apparently give us a hard time because our latest updates with this method took up to 96 hours to finish! (including the time we need to pause the updates because it’s bothering other teams’ business hours)
Stage 3: Smarter Jobs Generation
Here in stage 3, we figured out this misbehaviour and started to find a way to filter out variants and maximize the batches usage. Since we only need to update active products, we suddenly realized that we already stored the data somewhere… that’s our Elasticsearch! Our Elasticsearch could guarantee that all products are active. Thus, we could utilize it to get a list of active product ids and split it correctly into batches without worrying which id is inactive. Fortunately, we also store variants information in our Elasticsearch, so we could remove duplication of variant updates among batches.
While doing research on this approach, we found out that iterating ids in Elasticsearch is possible with Scroll API. However, our active products are so massive that using Scroll API is dangerous as stated in the documentation:
We no longer recommend using the scroll API for deep pagination. If you need to preserve the index state while paging through more than 10,000 hits, use the search_after parameter with a point in time (PIT).
Thus, we take that solution and use the Search After API instead. After a whole month of researching this approach, we were finally confident enough to deploy this solution into production and the result was hilarious! Our updates that were taking up to 96 hours to finish now have become approximately 24 hours only (4x faster) to finish. We were thrilled to achieve this result, since the offline updates play an important role in our search ranking experiences despite the slow-paced features.
Lesson learned: always process important data only. Check the data importance consistently because what important now might not be anymore in the future
Ongoing and Future Improvements
This whole system is only the beginning of some more exhilarating improvements which we are actively reviewing and experimenting with right now. Some of the improvements are:
- More Iterative Ranking Model, we are experimenting with various models, such as counterfactual deep learning, where we optimize how features would interact with each other.
- Personalized Features, we are exploring personalized features to create unique search experiences for each user.
A lot of things happened in every hour of our daily lives in Tokopedia Search. If you’re interested in joining us to develop the best ecosystem for our users, visit us on https://www.tokopedia.com/careers/jobs/ and let us know your interest!
Golang Channels (https://tour.golang.org/concurrency/2)
Elasticsearch Scroll API (https://www.elastic.co/guide/en/elasticsearch/reference/current/scroll-api.html)
Elasticsearch Search After API (https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html#search-after)