How to survive in 1111: from engineer’s point of view

Tao-Sheng Chen
ShopBack Tech Blog
10 min readDec 15, 2019

--

In the past few years, Nov/11 (a.k.a. 1111 singles day) is the biggest online shopping event day. Some of EC companies created 5x, 10x or even more than average daily GMV in that 24 hours. Obviously, a time-limited campaign will drive huge traffic to infra and system need to deal with huge traffic. For some high growth companies, the campaign might generate traffic which has never seen

ShopBack also had the same experience. In past 4 years, every 1111 created another biggest spike breaks records. That means, we normally need to handle the traffic that is always higher than estimation and not easy to simulate before it really comes. But this year, ShopBack engineer achieves zero down time and maintain fast response during the whole event day. Here I would like to share, technically, how ShopBack engineer made 1111 event perfect.

Simply to say, a good plan with systematically thinking plus good execution is the basic but also the critical factor to win this.

But before deep down to details, we need to understand one critical concept: money can’t buy everything!

Money can’t buy everything

pay more won’t work

Most of new EC will leverage public cloud service provider (AWS, CGP, Azure …) to build up their own infra foundation. Since public cloud providers allow add more virtual machines and more other on-demand resource, a.k.a scaling. It is easy for none tech people or junior engineer to believe that all we need to do is to spend more money to occupy more resources (VM, network…etc) and then everything will be fine. It is not a solution for 1111, since money can’t buy high reliability and high performance if all you did is just spending money on resources.

Why? first and most critical reason is we need to know how to balance and how to scale-out in the right place, for this, none of cloud service providers can help. Imagine that there are 3 people want to travel to some place together, it is simple to just call a taxi. But if a 350 people company want to travel by taxi, even we want to pay more in hiring Uber/Grab, we probably won’t able to hire 100 grab in a short time. Obviously, just pay more money without understand our own limitation won’t work. Well, for sure if we know our purpose of what we spent and how this impact the infra, of course money talks.

This graphic show an example of memory usages in a service cluster, obviously, it looks like memory leak! This cluster was a 32 VM instances and it works perfect for in auto-scaling group if the scale is 3 to 32 VM. However, if the max scale group to 40 or more, all the VM had a phenomenon of memory leak and if the scale is less than 32, magically the memory leak disappear. If we can’t identify the root cause and knows the resource balance issue for this, spend more money on resource will not only a waste but actually damage whole infra.

Preparation/Planning

In many growing EC companies, the whole infra become bigger and more complicated, to make sure we provide the best service for customers, most of companies in large scale internet service need to fully understand how to operate their system and also know how to improve. That is not an easy task since (a) a good EC company grows pretty fast, so does system complexity; entropy and technical debts (b) engineers might still need to satisfy business requirement and add more variable inside system.

Therefore, it is actually system theory dominates the reliability and performance not the specific tools/platform we use. The system itself includes computers, human and environment. And to have the best preparation of 1111, we can’t just look each component individually, we need to see the system as a whole and then also improve each part of system. Maybe a service or a program needed to improve but to do so we need to improve human process in advance.

The target is, of course, to provide a web/app that could serve high traffic in zero down time and all API endpoints response in max 500ms. All the actions we did need to consider in whole system impact and also individual module impact. Therefore, the actual system scope covers (a) most of engineers (b) process of how to improve/change system (c) services/modules: means the code itself (d) the infra, including 3rd party service, especially cloud service provider.

Once we identify scope with system-thinking, we actually found out we need to start to work on the plan at July- 4 months before the event! this is because (a) the complexity of system (b) we need to improve both performance and reliability (c) at the mean time, engineers still develop and deploy new feature into production. (d) in past 2 years, our system have issues when high spike come in. Although the hiccup could be understood by users but still a thing we have to improve.

The preparation actually cover a few things:

(1) control variables

In a system, to make sure the stabilities, we need to reach a point that all input could be consume and won’t increase a huge amount of stock (mainly the IO of DB)that we can’t handle. At the same time, we also need to know all services are good and well-functional. Once of the key point is to control variables in system and make sure engineers know new variables came in.

(a) Deployment-freezing: obviously new feature go-live is very possible to introduce new variables. We set deployment-freezing 5 days before event to make sure we have time to observe and manage new variables. Also during freezing period, we can do another run of load-test to identify bottleneck, system limitation and also know where might be risk. One thing important here; deployment-freezing includes not only freeze the code change but also freeze the configuration change, 3rd party library change and process change. In other words, whole system should be in restrict variables control.

root cause tracing map

(b) Incident Root cause review: According to 1:29:300 rule, every incident should be seriously consider to resolve entirely. There are several things we did for root cause: 5Why, a specific SRE to review all the cases, a category to analysis incident types and also who should take care of which types of incidents.

(c) Production Load Test. Real traffic behavior is almost impossible to simulate exactly and entirely. This is because we can’t 100% foresee human behavior before happened. However, we still need to do load-test to understand the cluster’s capacity to know the limit of scale-up/out and also if we see another bottleneck after we resolve an existing bottleneck. Next session will explain more on it.

(2) Simulation

ShopBack engineers have two kind of simulation: Production Load-Test and Plan-Walk-Through.

Production load test is that we use some tools to simulate huge amount of concurrent web/app users to use our system. Because it will surely impact real user, we could only do that at 2am to 5am (depends on timezone). Load test see whole infra as a whole, so we can see bottleneck and could adjust that during different runs of load test. For example, maybe scale-up certain service is better than scale-out.

Plan-Walk-Through is a none-tech practice. Just have someone to speak out what will do and what is expected result in the end of preparation meeting. Consider ShopBack had 3 engineer’s hub located in 3 different countries, we have to make sure all engineers align the same expectation in detail. Two things to remind: Firstly, we don’t even need to explain this practice to all engineers, we just need to make sure the one who host the discuss/meeting will do so. Secondly, a round table check, means ask everyone to confirm is always necessary.

(3) Learning from other events

Before Nov/11, many EC had other smaller but still important events. For example Aug/8, Sept/9 and Oct/10. These dates also have big campaign. We will have not only record most of things happened during these days but also ask our self during retrospective meeting: what we could do for make sure this thing won’t happen in 1111?

A part of wiki page links

Ultimately, we will have a longer and longer check-list. These kind of internal knowledge should became organizational-knowledge. SRE should have a method to make sure it won’t be just in some body’s head but will be in a knowledge base which stored in all SRE’s toolbox. Most simple way is a wiki page.

(4) Find top few bottlenecks to improve

During the simulation, we definitely could see bottlenecks. Especially (a) some APIs end points that is not able to handle high traffic well. (b)some APIs actually consume too many resource (c) some of the service are not stable yet and got bugs.

It is easy to understand that we need to improve those. But the key point is how. Firstly, all the services need to do their own load-test and SRE gather and evaluate the separated load-test result. Of course a separated module load test can’t really show the whole infra performance. Therefore, secondly, overall production level load testing should be done (as in session (2)). We arranged roughly 5 times of mid-night product site load test to really identify the bottlenecks. Finally, for some risky service, we established an ad-hoc team to work on only on improvement of quality.

(5) Plan for unexpected

Donald Rumsfeld“…But there are also unknown unknowns — the ones we don’t know we don’t know”.

How to prepare for the things unknown unknowns? The key point is to make sure all knows and known unknowns are crystal clear and then we move our focus on watch and prepare our mind to response when see the signal of those potential risks (unknown unknowns). In reality, we need to have senior engineers on-duty to expect for unknowns and also based on their own experience to quickly response. This is a part of D-Day plan but should be consider earlier.

For example, we made a deployment-freezing schedule and started from 5 days before Nov/11. The reason is not only reduce the variables of system but also make sure we have time to see unexpected things and have time to response.

Another example, we identified a ad-hoc team on D-Day (1111). Inside the team everyone got a specific mission, for example, monitor RDS usage, monitor throughput of specific service.

(6) D-Day Plan

Even the code improvement and other preparation done well, we still need to have a concrete plan for that specific day. Just like launch a spaceship, everything in that day should be in easily control and engineers should be able to react quick on accidents. Our plan cover D-1 day , D-Day and D+1 day.

(a) On-duty shift: The 1111 is 24 hours long. It is not possible to ask engineers to keep alarm for 24 hours. However, as a start-up, we don’t have many engineers to do 8x3 shifting. So in that 2 days (D-1 and D day), DevOPS engineers will have about 16hours + 8hours rotation in 48 hours. During Nov/11 9pm to Nov/12 2am, we will also arrange a few engineers to join the event for specifically monitor some risky services which identified during load testing or previous infra incidents.

(b) Run manually check in D-1 Day: Most of complex system have alerting and monitoring system which will send message to engineer. Of course, public cloud provider (AWS, CGP, Azure…etc) provide some tools/services to help it. However, the whole preparation period will of course show a few potential risk. We will make a list of human checking actions and run in one day before 1111. That can make sure all the potential risk will be re-evaluated again and if see anything wrong could be mitigated immediately. The manually check is not the same purpose of automatically healthy check. It can NOT be replaced by normal automatically system check. It is a check that help and also push SRE to re-think of current situation. For sure in our scale, it only applied in those services might have potential risk. Even we can’t sure if the risk will happen or not.

(c) One-Spoken-man policy in D-Day: As a multinational company, employees for sure will use message system to ask questions, raise issues, discuss detail in text channel. However, just like command center of NASA, we need to have single communication channel to avoid confusing on actions and decision. Since some of action and decision need to be made and aligned quickly. Our engineer head is the spoken-man in that day, only he can answer questions from business/market colleagues unless he tag an engineer to ask for answer.

(d) Retrospective in D+1 Day: It is very important to learn from experience. There might not be big incident during this event, however, there might be things we could improve. So after event day, no matter good or bad, we will at least have a quick retrospective meeting.

The D-Day

As long as we prepare well, we won’t actually need to do much on the D-Day. Just follow the plan. 3 countries engineers followed the on-duty time table, keep monitor on dashboards and expected the traffic coming.

Although the on-duty engineers were a bit nervous but the result Nov/10 9pm to Nov/11 2pm is actually good. Our infra did receive another breaking record high traffic but no critical issue happened. Thanks for those talented and hardworking unsung heroes to make this happen!

--

--