Tales from the trenches: getting EmpathyBroker ready for Black Friday

Photo by Nicole De Khors from Burst

Black Friday is no joke in the eCommerce world, especially in the engineering domain. It’s the busiest day of the year for almost any eCommerce brand out there, so it creates a series of technological and human challenges like no other event throughout the year.

In this post I’m going to share how we as a team get ready for such an important event, not only critical for us as a company but for our customers.

Learning from the past

The first step to prepare for Black Friday properly was to come up with the traffic estimations during the peak times. The operations team crunched some numbers using the monitoring and log data to analyse daily traffic and big peaks over the years.

How did they calculate a reliable estimation? They studied the trends on the day-to-day traffic and during the peaks so that they could build an estimation model to test it using backtracking on past events like other Black Fridays, Summer Sales or Single’s Days.

Embrace Failure

The motto for this kind of event is to work towards the best-case scenario but prepare for the worst. So, from support to operations, the teams prepared for a situation where everything crashed and burned.

In fact, there are two different ways or approaches on how to embrace failure: there’s a human side and a technical side.

The human side

The first important aspect is that of empathy. This is the biggest moment of the year for our customers, so our support team worked on a special communication protocol for Black Friday, where transparency and proactive communications were key as we wanted to make sure we kept them updated at all times.

One of the elements that made the communication protocol special was that they sent reports to customers on how the APIs were performing and what was the error rate. So, the team created several templates to be able to deal with this kind of communication in as streamlined a manner as possible.

The technical side

From the services standpoint, resilience and fault tolerance play a key role in how services are implemented at EmpathyBroker (thanks in part to Netflix). To keep this resilience as a core value, the company’s fail fast and best effort response strategies were revisited to avoid any failure spreading out and to prevent downstream failures taking down services due to memory management or lack of back pressure.

Like for airplane instruments, ops worked on different alternative systems to provide the same information, so in the case of a system crashing there was another system providing the same information. This is really valuable for things like tp99s, availabilities, error rates, and so on.

Benchmark like there’s no tomorrow

Our search and ops teams worked together for weeks tuning different parameters, detecting and improving poor performing code paths, modifying data storage queries. For testing, they used organic traffic patterns from production and increased concurrency to the roof to be as accurate as possible on an actual workload observed during Black Friday.

During the tests they simulated several failure scenarios and measured how they could impact performance metrics and if any downstream issue could spread out to upstream components.

Since EmpathyBroker’s main workload is read heavy, the search team came up with the idea of prewarming the several cache layers using actual traffic data from the average day-to-day and past Black Friday’s user queries.

Scale out!

Both the technical infrastructure and the human team were scaled for the event. Usually, there are three people on call duty (1st level) every day, but for such a special event, the entire engineering, ops and support teams were on call, so they were actively monitoring every relevant metric to respond as fast as possible.

In past years we’ve observed from 7x-10x traffic peaks during the Black Friday opening. Not only the number of concurrent users grow 10x but the user engagement and the number of actions per session increases 4x. To increase flexibility and improve the scaling capacity the search and ops team implemented a new version of the sharding strategy. As a result of this system, the ops team created several backend clusters, so they could allocate and group different customers together based on performance and QPS. This way, they could change routing from one cluster to another depending on how a customer’s traffic was performing, on the fly.

With all the benchmark reporting at hand, the engineering and ops teams figured out what infrastructure slices had to scale 8x, what slices needed just a 2x growth and what slices would do fine with the current provision of resources.

Show time!

As every backend engineer, every ops and support member needed to be on call to support our customers, everyone decided to get together at our HQ in Gijon and we organized an area of one of the floors as a situation room to keep everything closely monitored and under control.

As a good part of the team planned to be at the office, so why not everybody? The rest of the team wanted to experience a Black Friday release, see first-hand how the different teams manage and collaborate to control the situation and, well, to offer moral support. So everyone got together and joined the teams in Gijon for the Thursday evening when most of the Black Friday sales opened to live the experience together.

Takeaways

Here are a few things the teams learned along the way in terms of preparing and dealing with big events like this.

When estimating the traffic for a special date like Black Friday don’t get fooled by hourly averages, especially during the opening, because the big hit is happening in the first 5–10 minutes.

Every benchmark, every test, every assumption tested has to be documented somewhere. It doesn’t have to be fancy, it just has to be somewhere to be accessed and read again next year.

While battle testing your APIs, don’t focus solely on RPS, concurrent connections and queued operations are two important metrics that could take your services down.

Communicate to your customers proactively, not just when there is an issue going on but even when everything is working as expected.

After the event, save every performance metric you have at hand, it will be very valuable to prepare the next battle and save your team some time crunching numbers again.

Know your customers’ schedule and don’t focus only on the opening time. Most of the shoppers hunt for last minute deals and your customers may send last minute newsletters to also encourage this. So, pay attention for all possible peak times as these may also appear at the closing time as well.

Wrapping up!

It’s been rough, it took a lot of effort but thanks to all the teams’ hard work, careful planning and preparation, everything went off without a hitch! So now we can look ahead to the next big event.