Scalability and Reliability at Tokopedia
I was taking a bath in the morning before going to work, and I was thinking about things that currently going on in my company. My company name is Tokopedia which is one of the biggest Tech Company in Indonesia which one of the main business model is E-commerce.
I’ve spent my whole career there after graduating , and I joined the Promotion tribe (team) almost my entire career life. One of the Interesting part of this tribe is that you have to handle high traffic on Big Event. These events varies across e.g: online shopping day, Tokopedia Birthday, or our Fantastic biggest E-commerce Day which called Semarak Ramadan Ekstra , held every year on Ramadan because Ramadan is the time where Indonesian people will buy many things for their holy day preparation.
Okay going back to the bath in the morning thoughts. Tokopedia just held the Ramadan Ekstra Event on May, and on August we gonna celebrate our 10th Anniversary! Yes, this means, me as the EM of Promotion Tribe as well as some of folks need to keep our system scalable and reliable on the big Event.
Seems fun right? Okay now the goal is to make everything going fine on the day, no downtime, no bugs, no crash, no chat from your boss complaining about your system, make sure all your project clean and well tested before the day. Oh it seems pretty much to do, especially on the startup with agile kind of development you can imagine that things going so fast, even the requirement can change every-time even your Event date could change.
Looking those problem we need to address some strategy before we can say that the Big Event can be successful. In this Article I will focus on how we can verify that our system scalable and reliable. Based on our experience, thus here are some outlines:
- How to measure correct RPS and funnel
- How to do proper Load-testing
- How we can improve our Performance
Set the Target !
One of the most classic question in the tech forum whenever there will be a Big Event going on is “Hi mr VPE could u tell us the Event RPS target ? “
Oh Crap! No one in this universe know how to predict the RPS. We are not Oracle! just kidding : )
Okay what is RPS by the way. We really like to use “RPS” terms to ensure that our system are reliable. It stands for Request Per Second, which equal with QPS or TPS. Its an average of number of request per our smallest time unit which is second.
Why this question is very hard to be answered? Because it’s very hard to predict the user behaviour on our product especially if there are many new Project will be launched for this. Also We’re not really sure about how many people will joining Tokopedia when the event started, and we don’t really have historical data of our past RPS on each funnel when some event happened hence it make this question more complicated to answer.
The next question will be “Hi Mr Head of Tech what is the key Hero of this Event?”
This one is pretty easy to answer, but in other position its one of the scariest question. Because it will determine which team will have to reach super high and exceptional number of RPS in sake of Big Event success.
Ok, with these 2 questions and Problem statement what actually we can do to tackle this ? #MakeItHappen
1. Set up The funnel
We need to analyse what actually user journey will be when even started. Usually Product team already have the rundown of the show, especially if the show will interact a lot with our apps.
Starting from the Key Hero, which I mention earlier, we can really focus in those funnel and setup a precise user journey on that use-case. remember PRECISE, because in the load-test scenario, slight differences of scenario can really affect the performance which will make the real and simulated system result differs really big. After we setup this key hero user journey, other main funnel also need to put on the list starting from the most critical until the lowest based on the priority.
The funnel actually consist of list of API that will be called every-time user want to do a specific use-case. In example we have an use-case where User need to:
- Get a Push notification on their apps
- Land on the Egg Screen (/game/egg_page, /user/points)
- Crack the Egg (/egg/crack)
- Click on the Check my Prize button
- Eventually redirect to A Page (/coupon/list)
- End
Each of This funnel represent step by step state where users are interact with. This separation is actually give people thinking time before they can proceed to next page/funnel. This is important because it will determine API’s that will be called concurrently as well as API’s that will be called simultaneously due to thinking time and system response time.
2. Set The RPS Target
We have all the funnel set already, now how about the RPS target?
Let us take example of the above use case which is Crack the Egg, which we usually called TapTap module. One of the best solution is to see what is current BAU traffic of the TapTap. We can check this number through our monitoring system that we used, its either Grafana or Datadog.
Why BAU traffic ? Because from this we at least can measure how high is the traffic of an specific feature, which will be most likely the same for the big Event. For example TapTap module in daily traffic could reach until 1000 RPS, which its already high for a funnel in Tokopedia. With this number we can say that the number of RPS on Big Event will be high as well as we can project the Event RPS number based on this number.
CCU or concurrent users, are the terms we use to count how many active users are currently playing around on Tokopedia. This number can be predict based on our Marketing channel size. For example if for usual day Tokopedia are handling 100k CCU. On d day the Marketing channel are blasted to 1M users, and predicted that CCU are 500k users, hence we have 5 times scale of usual traffic.
With this Scale Number we can times this to our RPS number on the daily traffic, which in TapTap situation it could go up into 5000 RPS Max. This approach can also applied if we have historical data of or last/latest event going.
3. What Affect The RPS
RPS are really significantly affected by user urgency to hit specific endpoint. Because it will trigger user to quickly hit the API as soon as they have the chance, together will all other users in almost the same time. This mechanism will choke the system which RPS exercise is very needed. while in other campaign mechanism where it has a long time window without any urgency, most likely the RPS will not be that high and tends to be distributed through out the time window.
Funnel is also affecting how RPS will behave compare to another funnel in one user journey. In TapTap model above, go to my coupon list API located in very behind of funnel’s order. With this condition, my coupon page most likely will be hit the least because we can expect some noises that prevent users go into the last funnel. The noises could consist of user knowledge of using the feature, rate limiter of the front system, some validation that lies before the end funnel ( blacklisted user, out of quota , etc).
It’s not really affecting the Crack API even though it lies on the third order of the API’s, because it’s the main highlight of the system, and the mechanism obviously will make the RPS spike where users are cracking the egg as fast as they can.
There are still many x Factors that affect the RPS estimation such as:
- Users knowledge or interest of the feature
- Not enough Historical Data to project the RPS
4. Improve our Projection
Okay guys! we have discussed how to set the RPS and funnel, but we need to make our life easier next time.
We need to have system that could capture all the Event Logs and Data, to allow us analyse this data for the next event. Right know we need to manually get the data and logs, if we don’t forget to do it due to excitement our tiredness after we did the Big Event. also we create a new SpreadSheet to fill in all the funnel and load-test result/event result, which you know how lazy a human is to fill a data to an excel : ) .
With this, we can analyse which funnel is actually getting highest traffic, check if our RPS estimation is close, which funnel is actually under utilised, and most important is to project the next event RPS comparing to previous event scheme.
Or the system can think by themselves and determine by themselves how many RPS we will get and how many scale we need to prepare. oh my! how cool is that! #AIFirst #DataFirst
Load Testing
Test engineer is a role we have to help us doing load test. Load Test is a job to ensure we get RPS target by using some tools to hit our service API’s based on the scenario/funnel decided with configurable number of RPS and Users.
Previously its very hard to do the load-test, but thank God we now have tools to easily do load-test and tools to get the load-test data in an integrated dashboard.
In this section I will discuss couple of things we need to prepare and put a concerns whenever we gonna do a load-test to our services. Let’s Go!
1. Prepare the Strategy
First thing first, gather your Test Engineer with your Squad lead or another Engineer to first deeply understand the product and the feature. Also learn about the funnel and how users will interact with the apps through the whole scenario.
This phase is really critical, because usually, TE don’t really understand what is the user journey, hence will creating a false script for this.
It’s also important for TE to have intense communication with the team to ensure that any scenario changes need to be adapt to the script.
2. Prepare the Script
After TE and the Team agreed and understand how the journey is, TE will create a script to simulate the discussed scenario and upload them into the load-testing tools.
The script consist of logic to simulate the user scenario, which consist of
- Number of users will be used for load-test
- List of API’s will be hit by the system
- Method for user interaction
Usually the script will begins with initiating list of users, that will be logged-in by hitting Account services. This list of users are important because it will affect the Cache hit Ratio to the system. The more variant the users are, the lower Cache hit Ratio will be. Also it will add more complexity to the system e.g: number of storage used, etc
These List of users also determine who actually the users are when the end services receiving the request. Sometimes the end services need to whitelist these users to skip some part of the code or to do special treatment to the request by these users. in Example We don’t want load test users to get into our blacklist checking, etc. Usually the system can have pool of whitelisted users which exactly the same with the load-test users pool. But it will hard to maintain this if the load-test users are growing or changing from time to time. We can use some additional attribute to mark that these users are load-test users so the end service can recognise this.
We sometimes also need to prepare the Data for these list of Users. For example when we gonna do test for getting coupon list of the users, imagine if the users don’t have any data at all, hence the result will be absurd because it will get nothing from the db.
One of the complex thing to do for load-testing is creating the interaction between one funnel onto another funnel. Usually we use interval before we go to another funnel that depicts users thinking time + system response time.
Also we can use percentage of number of users that would move into the next funnel. Lets say in the TapTap module, we have the Egg Image in the Homepage where when we click it will land the user into Egg page. We can set the Egg page conversion rate is 50%. This Percentage can be based on current traffic data or Historical Event data as well. We have 2 tv Event on TapTap and this strategy works to estimate conversion rate for each funnel.
This conversion filtering is important to avoid us set the target too high for the real RPS coming. One of the common misunderstanding is that we have to hold the same amount of RPS for all funnel, which is incorrect. In other place its also incorrect for you to give the RPS result number to your boss in the form of accumulated number if the API’s actually grouped in 1 funnel.
3. Execution
Execution, yay finally! Just click the button and see the result right ?
Not that easy Ferguso, to be honest all the load testing step really demands a really detail and critical analysis and execution.
Whenever TE says that they already make the script ready, usually they will test them in the local environment. After that it will be used for the production environment. Here are some things need to be prepared before we go to the execution.
Set the Time
In growing company like Tokopedia, it’s not really recommended if you make the system down even for 3 secs.
One of things about load-test is you are demanded to get the load-test result fast enough for the sake of Event Deadline, but you cannot just do it every-time. We had couple of downtime do to load-test because the system are choked and unable to serve the production request, and come chat from your boss, you tell the story.
So there are some things that we need to do to handle this. First We need to set proper time to do the load-test. If you think that the system is critical enough for company continuity you can consider to do it at night or when the system are least likely to be used by the users.
The other purpose of setting the right time is, so other engineer knows whenever you do the load-test. Usually our funnel doesn’t stands alone without dependencies to another team services. Hence the other team need to know so they can standby for observing the load-test as well as keep their system not down.
Priority
Ok we have set the time for our execution, but what about the priority? for the grand design we usually have our script with the complete funnel for overall user journey, but trust me its not that easy to load-test all the funnel from the start.
There are some reason for above statement, one of them is that its harder to coordinate across all funnel stakeholder to do the load-test. Let’s say in TapTap module cracking the egg would call 3 other services ( blacklist services, prize services, and promo code services), hence you need to coordinate between these 3 guys.
Also We need to test first the one that we think is the most critical for the user journey. This strategy will make the engineers know that they really have to analyse what is the bottleneck in the main function.
The new projects which need the new development is also important to be prioritised. The reason behind this is that comparing to another funnel this particular funnel/endpoint is the most unstable and unreliable than the others, because we never had the data of its performance.
Debugging
As I mentioned before, in order for our result to be good we need to be precise. It’s hard actually to follow how the Production traffic will be in our script and environment.
Debugging is very important both in the script testing as well as to find bottleneck in the system. This steps need repeatable debugging to really ensure what we test is right, as well as to get real bottleneck in the system. We don’t want to really solve bottleneck that maybe won’t be produced in production either.
When you hit the button to release the swarm, usually we will start with low RPS and gradually increases up to higher RPS. This approach is usually to make sure system are not choked in the first place, so we can analyse the system behaviour bit by bit.
In other perspective, this approach doesn’t really works if we want to test the system with spiking RPS from the start. Because it will make the system doesn’t have time to warp up and have to be able to serve the request quickly. The approach can be fit accordingly on your funnel model or to keep your system stable.
Monitoring
There are couple of things to do to observe the system condition when doing the load-test. First the TE who execute the script can see on their load-test dashboard for failure rate occurred because of the load-test. They can manually stop the Test if the error rate is going higher overtime. Or they can also decrease the RPS and see most stable RPS for the services. Some of the services cannot really recover real quick after they unable to serve the traffic, hence stopping the load-test is very recommended. Usually when the load-test is about to end, the TE will capture the load-test data so we can analyse later.
The SE can keep an eye on the services metrics. There are some metrics like
- Server CPU Usage
- Database Condition
- Server Error Logs
- Response time for the API
- Success Rate of the API
CPU usage is on of the most common things to be observed when the loadtest start. Usually apps CPU usage is the most easy to predict because it most likely depends on how heavy the process is and will grow linearly with the number of request coming in. We also can do pre-scale to the server easily by adding the instances which we can say horizontal scaling will always work for this.
We also have auto-scale mechanism to make the services VM instances number growing to help serving the request. The Auto-scale mechanism will be triggered whenever the CPU usages of the services in the cluster are exceeding certain amount of number, e.g : 60%. We usually avoiding using auto-scale approach for the big event, because it give a room for the service to not be able to spawned fast enough
Database is usually the most common bottleneck of the system, since it hard to scale horizontally. Its very important to really take a look of the DB condition since it really could spiking quickly, and whenever it slows down usually you have total downtime of the service.
Error logs are very important to trace any error that happen to your system, you can put any information here based on the logic in the code. One of the thing to remember is too keep logs not verbose which will make your head dizzy to read this.
API response time and success rate is the most easy metrics to see in order for us to know wether our services is still good or not. This metrics will reflect services condition directly to end users experience.
One of common problem for these is the engineers haven’t push the metrics or logs into the monitoring system, hence its very difficult to really see what actually is the bottleneck of the system and we need to re-test again after the code is released.
4. Reporting
Its the time for your boss to know the result of your heroic work!
The calendar was set, room was set, all the stakeholder are ready to hear the result.
As I mention earlier, the load-test result its not that easy to explain because of several factors. Prepare the result which is the most real result compared to the real user journey. Your result might be not really perfect yet but you can explain what actually being load-tested and what part is actually missing here.
Differs about Total RPS result and Per-endpoint Result. Total RPS result will accumulate all the RPS result across endpoint, which sometimes its miss-leading because number of request coming into that funnel will not be accumulated based on the number of API’s being hit.
Highlight end funnel result rather than starting funnel. This are worst case approach where we consider the system performance based on the latest funnel with the perception of all CCU in the starting funnel need to go through until the last funnel on the same time.
Describe how the system are improving compared to the last load-test, what actually went wrong, and what can be improved for the next trial. This list of action plan is important in order for other know what you need to do.
Also you can explain if the target of the RPS doesn’t make sense anymore, this is important to avoiding your API condition is always in the code RED.
#MakeItBetter
To me this is the most interesting part, where you can unleash your engineering skills!
But actually back then, improving our system performance is not easy as what we think. Minimum knowledge, less experience, less expertise resources and not a really ready infrastructure. All of these causing so much painful whenever we want to boost our system RPS. All we know is just that our RPS stuck between some certain of number, without know what to do next.
Above all factors, experience is the one that taught us all. We learn bit by bit by each Event that we had. I remember when my tribe for the first time doing load-test on TokoPoints (loyalty system) services, even when we got 1000 RPS , we felt very happy. We still have so many Blindspot about reliability and scalability world, thus we keep learn until our last Event.
Our Ramadan Event on 2018 was totally broken, we had so many downtime even-though we already did load-test for every single service that we have everyday. The number of traffic coming in was very unexpected. I remember sawing my Grafana ( prometheus metrics GUI ) and when the event start, my RPS goes up to 8k in a seconds. Things went down and It was look very terrifying actually.
Lesson learnt. After that Event we try to be very cautious whenever we want to have a Big Event. Intense load-test were triggered like everyday, and we realise that load-test is also our part of day to day job, which simply put we need to always maintain our reliability and scaleability.
The Question is : How to improve our system performance to reach our high target?
Actually seeing most of our tech stack, I can say that our most of the bottleneck are the same, as well as the solution for them.
In Tokopedia we use these:
- Golang for the code
- Redis for our cache system
- Postgresql for our Database
- NSQ for handling async message queue
- Cloud Based Hosting and Linux VM for the application server
- Microservices that hosted in same Data Center
- Gateway Proxy using GraphQL for forwarding request to end Services
- Consul for Key Value config handler
Based on above facts, I will explain some cases that usually We do for tackling some issues. Since I’m not a Tech Architect, or we can say someone who really deep in tech skills, I will just explain in high level perspective, and only based on our experience.
Reducing I/O calls
Opening an I/O call obviously will give additional weight and time to the system. which we want to avoid this if we can. Usually our calls happened whenever the main application ( Golang ) want to interact with Redis or Postgress Database to get/write data. Other then that, we also do an HTTP connection to another microservices to hit their API.
Usually this calls damage are more severe, if the connection were never closing or if so many connection are waiting since the other system response is slow, which will leaking our memory and causing too many open files. On HTTP calls to another services, we can put timeout on them as well as putting circuit breaker if any case the dependencies fails to response our request frequently.
We also can put config using consul to turn off the dependencies, if its not critical to be turned off. In example whenever we use promo code, we can skip fraud checking if the fraud system is down, hence promo service doesn’t have to add more burden when doing the calls.
Redis Coming in to cache everything, starting from cache your Database Result or cache other API result. with this cache obviously we will get data faster, so the process in our system will also be lighter.
But still using Redis will have more complexity than using Memcache. With Memcache we can cache all the static data ( Master Data) which the value will be the same across instances. We usually can prepare a cron inside of each of services instance, to clear them up whenever we want. Using this Memcache approach the data will significantly get without a bit of time, and you have to do is to do a horizontal scale for the instance which is easy.
Redis FTW
Redis is almost a very godly tools we used for cater everything. Since its using RAM, the performance is really good for get and putting a data. Whenever we want to do a normal get data, usually the flow is the following:
Get “key” from Redis → if not exist → get “key” data from Database → re-fill the value of “key” to Redis
With this pattern, we can see that every-time the key is not exist, we will jump directly to get data from the DB. As we know, doing query in Postgres DB sometimes are not really reliable enough. These could be affected by many factors, also if your query is complex which are impossible to handle high traffic situation. In our case Database is very fragile which could spike really fast, and not something we can depends on. The DB also is harder to be scaled horizontally compared to Redis.
I talked about this because, in some cases, most of the Hit initially will hit DB, which actually make the Redis seems useless. Imagine that 100k unique users hit the system and none of them have cache to their data. The request will come to the DB, and make the DB heavy.
To overcome this, we can set the mechanism like this
Get “key” from Redis → if not exist → return empty response → insert to DB async mode if needed
The idea of this is we retain the Redis data as long as we can. So whenever request come it will always hit the redis and never come to DB. The DB will be inserted if we want to insert initial data in case the user Data need to be store to DB at first attempt.
The DB also will act for backup purpose whenever the Redis Data is ruin. The side effect of using this method is that we need to really be careful no to ruin the Redis data, and never delete the Redis key which we need to do alteration of the value instead of deleting them hence it give chance if the DB value and Redis value is not sync. We usually have a consul toggle to bring the mode into this mode, we called Beast Mode, and to turn it off if we think we don’t need it for BAU which we reduced the risk of data corruption.
Databases
If we’ve already store our frequently fetched data into the Redis, so Database process will be more into storing our data into the database. Which means the Insert process will be executed more.
Doing the insertion is not really that simple, if there are too many commands of insert into the db, it will also can make database choked and CPU usage usually spikes.
To handle this, we need to set the max concurrent insert command that Database could handle. Usually we use NSQ to do the insert into the DB, because NSQ will take the job asynchronously so it will make the main request waiting, as well as we can set the max concurrent of the message being processed in order to keep Database from being choked.
We also can using batch insert to the Database, so we can populate some data before executing it all together into the DB. With this method we could minimise number of connection to the Database.
Pre-Scale!
Apart from all the things we have done, we also need to keep our machine strong. With scaling our server, we are hoping that additional number of server could get us to reaching higher RPS.
The Horizontal scaling is easy to do especially for Apps server. Because Apps server task is to executing our Binary and don’t have anything to do to synchronise between their instances.
For Redis, We can use something called Redis Cluster or Twemproxy. With Redis Cluster we can easily add/remove instance to our Redis cluster without affecting the data in the existing instances. Redis cluster will do re-balancing towards the instances, and will keep the data retained.
With Twemproxy, we can also do the loadbalancing for our Redis data. But the painful part about this is that every-time we need to add instances, we need to restart the Redis instances, hence the maintenance is very hard.
Redis will only used 1 core of your CPU cores hence every-time your Redis instances almost hit 1/n n: number of cores of our CPU Usages. It means that our Redis instance might be out of services. Hence it better to have additional amount of instances, instead of having a huge spec of single Redis instance.
Hotspot is also one of the Redis problem, this tell us that one of our Redis instances usage are much higher than the other instances. This could be because of we have a key that are stored in one of the instances, and the request coming to get that key are so many, while other Redis instances don’t get the same amount of traffic. To tackle this, we can change how we store the key by distributing this key among the instances, which called sharding.
For Database usually we just can add more slave, and upgrade the spec of the Master so it can perform better. Nothing much we can do to the DB, since horizontal scale to the master is not technically possible to do.
Conclusion
High Performance system is something that every company are demanded to have. The problem is not every Engineer or Architect or even the Manager know how to tackle this from the planning until the execution. Sometime we just focus on how to build the product but not to make the product reliable and scalable enough. I hope with this article, we know how to reduce the pain of our big event preparation and make us sleep better. #mulaiAjaDulu
Namaste.