Seeing is believing? Quite so. With that terms in our mind, we always tend to get to know what is going on in our applications before we realize the best-and-perfect solution for our application domain. “Min-Max-ing” our applications will consumes a lot of time tinkering with it. That’s the thing where the challenges came up to Software Craftsmanship domain, As we often making a product without further preparations about the scalability involved, usually when we’re at early stages
Scaling will always be a classic problem to any tech company. Tokopedia as one of the leading Marketplace in Indonesia also embracing scaling as a daily dose for their engineer. Even at early 2018 we were given a challenge in scaling. We have to stretch our service capability to support our peak usage during our Big Promo: Ramadhan Ekstra. Yes, in less than 3 month we have to maximize our capability to handle upcoming ~10 times traffic!
And how we did that? Every team has their own story, but I will tell how we managed to do that in Tokopedia “Discussion Feature”.
Discussion Feature Overview
It is obvious that In Tokopedia, tons of product was sold by vast seller. In Discussion Feature, we let buyer communicate with seller about a product that interest buyer with creating Talk. Seller can giving response by adding comment to a buyer question. Both buyer and seller ongoing discussion all listed in Inbox Talk page.
Keeping fast growing records of gazillion discussion with all of comments were never been small problems. After running for years, that feature got an ache. Sometimes, it feels like forever just to load discussion. But at the worst, Discussion Feature falling in downtime.
Thorough Problem Analyzing
Before took a big step, we first analyzed what was the problem with our service that serve Discussion Feature. We take a look, what was blocking our service for get the best out of it? Is it the query, is it the infrastructure, or is it the code?
Of course random guessing won’t take us anywhere, that is not the way way we getting know better the limitation of our service. We are not barbaric, we use a measured way to ensure the reliability of our service by relate it with our most basic problem.
The first problem to be solved was, ”users are impatient”. Loading time has to be short enough so users won’t be annoyed. We applied Open Tracing to track latency for each API. We break down the usage of each process to ensure no bottleneck or sluggish function that will make our service couldn’t perform well.
As for the second problem, remember what Thanos say?“This universe is finite. Its resources, finite”.That also applied to our server resource. It is limited and has to be used efficiently. Server usage has to stick at it optimal usage, cost wise.
Thank God, Golang was provided with profiling tools that help us break down our service’s usage in resource. One of our most usage of Golang profiling was to measure which process that consume most of server’s CPU resource.
Even though basic profiling tools will do, our Test Engineer help us by provide an automated profiling in a relatively short interval. (You are the best, guys!)
Houston, We’ve Got Problems!
In the moment we apply Open Tracing & do Golang Profiling, we gasp as soon as we figured out that our service is at risk.
We found multiple process taking toll on CPU, make the server load high nearly all the time. We figured the culprit of slow APIs were caused by high CPU usage. Our database administrator also confirming that masses database making queries slow even with optimized index. And heck, some legacy code were using low performance logic that we did not realize at first glance.
We had a bunch of options to solve those issues. Graph DB & NoSQL tempted us to remake our database. Sadly, revamp service to suit new database took a large consideration and of course it will take time, so much time. So we had no choice but to maximize what we had already, combined with creativity.
Redis to the Rescue!
It took no brainer that a service should use cache for most frequently used data. Cache prevent request fetch data directly from database.
Use cache for most common request
Since redis has a limited storage, Discussion only store the most essential data in it. For example, we use it in Product Detail Page(PDP) Talk page. Cache store a list of Talk ID that belong to a Product ID and maintain it whenever a new Talk added or deleted.
Eliminate Wasted Resource
In the first version of cache, we hold Talk ID for each user to be loaded in their Inbox Talk. By integrate Open Tracing, we found a fallacy logic. A request should return 11 Talk, but frequently, service can only found less of 11 of them from redis. Since service can’t found all of it, it doing a query to obtain 11 of them from Database.
OMG, how could we missed that? No wonder the API was so slow. Every time our API found less than expected amount of Talk, it will do a query to DB for 11 Talk.
Instead of doing a query to obtain 11 Talk, the service should just do a query to obtain the missing Talk, not every single of them. That would make the obtained data from redis discarded and be wasted.
Talk IDs from PDP Talk cache is not enough to fulfill the request, where is the content? Another cache was prepared with this kind of data. Each Talk already has a hash cache that hold it message, status, and etc. Since there are multiple Talk ID, service need to retrieve each of hash that corresponding with each Talk ID. To do so, service need to do a series of queries to redis: get hash for Talk ID 1, get has for Talk ID 2, etc.
Each query will make a round trip to redis that took a short time to completed. The time needed to do a round trip seems small, but since this invocation was like million time per day it cannot be ignored. Imagine the resource depleted if a round trip take 1ms and it was invoked 1 million times, 16 minutes of time resource wasted!
In Discussion, a cache of PDP Talk can hold up to 11 Talk ID. Each Talk need a Product and it User Creator information from cache. So at very least with 1 Product ID, there will be 11 Talk & 11 Talk Creator. At minimal, 22 round trip made to redis just to load a PDP Talk. Meanwhile PDP Talk accessed at roughly 15 million times a day. How many time and resource could be saved? You do the math.
So, the idea is instead of doing a round trip 11 times, why not just collectively do 1 round trip together?
The good news redis allowed us to do so by using Pipeline. This feature in redis exist with the sole purpose to reduce round trip time with extra perk, allow us to do more total operation in redis server. That was achievable since with Pipeline, Redis don’t have to perform more socket I/O.
Since we choose to do pipeline the latency was vastly reduced and also benefit server to do more request per second which then made us closer to the ultimate goal.
Compress CPU Usage
Ensure latency to be low enough was not been enough. A bad code can still be infect service and make it more exhausted to do huge amount of request. The next step taken was to reduce CPU load and make it light enough. A simple and fast algorithm may lead your server execute a problem with lower load right?
Eliminate slow function
If slow query can be tricked by adding indexes to database or putting cache before querying, a slow function can be tricked by using better algorithm and seek other better way.
By reading the Golang profiling we knew which slow function to be improved. From the sample profiling, we figured out that a function with RegEx invocation took a toll to CPU. Improving this function can be tricky, either we improve it a bit but still use RegEx or use some other way that has same result.
Either way, a code benchmark is necessary, especially if the designed function will be invoked with a huge amount. Guess what, benchmark feature was ready to use in Golang!
By doing benchmark, we could identify that an revamp function is better or worse from it original. As for example, we could see that from the result, the revamp function is much faster on shorter sentence while much slower on longer sentence. So, the revamp function is cannot be said as better, just partially better since a long sentence is rarely found in Discussion Feature.
Proxy Cache on Web Server
By learning how user behave in using Discussion Feature, we could manage a better solution to fulfill their needs. We will reflect on behavior of Apps User as an example.
It is no wonder that Apps is source of the most of Tokopedia traffic. Since most user will access PDP to check on things they will buy, naturally the Talk request mostly coming from PDP Talk.
We also learn that some request in Discussion Feature was just request on same API and returning same response nearly every time. By knowing so, we implement Proxy Cache on that API, make that request would not even reach our service. Thus, the service will not doing anything for a time being if the similar request made. For backup, we had an cache prepared in if the proxy cache has expired. Further about NGINX Cache can be found here: https://www.nginx.com/blog/nginx-caching-guide/
Less but Powerful Server, is Better
There was one time we had 16 servers run at the same time. We had a problem in maintain the load balance. Some of the server have higher load than the rest. Instead of try to solve the balance of 16 servers, we upgrade 4 of the servers to represent 16 servers. So instead of having for example 16 servers with 1 Core 1 GB RAM, we had 4 servers with 4 Core 8 GB RAM.
This reduced of amount server is beneficial mostly from server’s OS usage. By extra room in CPU, Discussion Feature could handle more request per second. The best part of this change for us is: we don’t have to watch over on 16 servers.
After months of discussion, lots of changes and improvements, days of load test, we finally reach a moment that we can celebrate on since Discussion Feature achieve some great results in scalability.
First notable result was reflected on the average latency we measured on weekly. It seems we successfully reduce the latency to only 50% from before. This mean, faster complete render Inbox Talk and PDP Talk, and satisfy impatient user on Discussion Feature.
Second notable result, we could reach the ultimate goal 10x traffic at the D-Day, the Big Promo was run. At that time, service was running with increased latency due to rapid access from overexcited buyer. The service had to adapt bit adjustment on NGINX at first incident of increased traffic. Had to admit that it was slower than usual, but still without downtime.
It was a daunting task to scale a service that filled with many legacy code and poorly written logic that slowing down entire API. Not to mention the short time available to do it all. But it was easier when you had the right tool to analyze the root problem of the service.
Options of tools was given to solve the scale challenge. In the end, wechoose to stick on resources we already had and utilized it all at the most optimal usage with morecreativity. Hope you could learn one or two from this scaling experience!