Secrets of Tokopedia’s Ads Engine’s Success in WIB Event — Part 1
Waktu Indonesia Belanja or WIB is a monthly shopping festival that Tokopedia organizes offering many discounts and flash sales across the site. Always on the lookout for greater successes, Tokopedia’s Browse Ads team took up the challenge to face the WIB event.
We show ad recommendations on many browse pages across the Tokopedia platform. But providing the best in class experience to a high number of concurrent users is a challenging task. We decided to do multiple improvements and optimizations in our system to be able to serve that amount of traffic!
A look at the improvements we achieved:
- Reduced ElasticSearch latency and CPU usage to half (yes, you read that right. To half!)
- Increased the Nginx cache hit ratio from 15% to 95%,
- Reduced Redis resource utilization,
And did some more tweaks to our system like in-memory caching and changes in Nginx caching configurations. This allowed us to minimize resource upscaling and maximize output from our existing setup!
In this article, I will spill the beans on how we achieved ElasticSearch Query Optimizations to improve CPU usage and reduce the latency of queries!
For other improvements, look out for Part 2 coming out soon!!
ElasticSearch Query Optimizations
Do you use ElasticSearch as your database?
Will it interest you to know how small improvements can give massive results?
If your answer to the above questions is yes, hop on the bandwagon, my friend!
ElasticSearch(ES) is the main database that we use to fetch all ad content with Nginx as a reverse proxy server. Thus, optimizing the performance of ES queries becomes our foremost priority.
To take on the upcoming challenge, we have done several improvements to reduce the latency of ES queries and the ES CPU usage while also improving the Nginx caching of the queries. Let’s go through those one-by-one.
1. Change field type to keyword
Did you know we can decrease the ES query response time to almost half by only a type change?
Then let me take you through the deets !!
There are many fields in our ES documents which are of the type integer or long but which are used only for term queries.
Internally ES changes term queries on numeral data types to range queries, which are slower than a direct match for term query on a keyword type. If we need some fields only for term queries, we can change their type to keyword type and achieve great performance improvements. Yes, its that simple!
How to change the field type to keyword easily?
ES provides the mapping parameter “fields” which is useful to index the same field in different ways for different purposes. Utilizing this feature we changed the type of all fields which are only used in term queries to keyword type in our documents.
Let’s take an example:
Here, we have a field
id which is of type
long. We have also used the fields parameter to add the optimized type
Now, we can query based on
id.optimized instead of on id.
2. Adding Query Routing
What is optional but has loads of benefits if used properly? Query Routing!!!
Our ES index documents are being routed on
ad_id. Custom routing can reduce the impact of searches. Instead of having to fan out a search request to all the shards in an index, the request can be sent to the shard that matches the specific routing value.
What are the changes needed?
We added routing to all queries which filter based on ad_id. This reduced the number of shards the ES query is being sent to, thereby improving the performance of the ES query.
Too much technical detail, bored already?
Jump here to see the results and come back if you find them convincing!
3. Adding Constant Query Timeout
How to deal with variations in queries while caching?
Baby steps, I would say..
Many of our ES queries have varying timeout values. ES queries are being cached on Nginx as a string. This leads to a poor Nginx cache hit ratio because each query has a different timeout.
What to do (I think you already know!)
To improve this, we made the timeout field for all queries to be constant. When all queries will have the same timeout value, it will improve the Nginx cache hit ratio.
4. Removing “_name” Parameter In Queries
Tiny things we might miss but which can have big impacts!
Ads are being shown on many browse pages, each using a different combination of algorithms to fetch ads. These algorithms can be common across different pages. We create each ES query with the “_name” parameter to identify the page and algorithm name it belongs to.
Suppose the page name is “pageA”.
Suppose the algorithm name is “algoB”.
Then the “_name” parameter for its query becomes “pageA_algoB”
This limited the caching of this query on Nginx since even if page A has the same algorithm as page B, its query will become different based on the
What did we do to improve this?
To improve the Nginx cache hit ratio we need to ensure that the maximum of our queries with the same filters remains the same. For this, we removed the
_name parameter from the queries. This is to ensure that queries will also look the same when they are coming from a common algorithm, irrespective of the page it corresponds to.
Now time for the big results!!
ES Query latency decreased from avg 25ms to avg 13ms.
ES CPU Usage decreased from avg 30–40% to avg 11–13%.
We achieved improvements in ElasticSearch queries latencies and CPU usage by just some small tweaks.
What did we learn? Our system itself has the answer to many of our problems!
Hopefully, our learning will contribute to improvements in your system as well.
But this is not the end my friend! We have Part 2 coming up shortly where I will explain how we increased the Nginx cache hit ratio to 95% and also improved resource utilization. Stay tuned!!