The Truth Behind Black Friday Madness in the B2B E-Commerce Industry
Behind the Screens: Insights into Teamwork and Challenges during Black Friday Chaos
I bet you love Black Friday. As a customer, it’s a great opportunity to purchase wishlist items at a significant discount. Nowadays, you can do it from the sofa, which is even more comfortable. As you can guess, you’re not the only one who has discovered that trick - your family, friends, and everyone else are doing the same. While you and your buddies are enjoying those sales, someone on the other side is paying the price. E-commerce sites handle two or three times the traffic compared to any other day. Everything may look and feel like any other day, but there’s a big challenge under the hood. A lot of effort is made by engineering teams to ensure that your website won’t get stuck, and you’ll have a great experience while shopping.
Despite sounding like a win-win situation, if you’ve been working in the e-commerce industry for a while, you might find it frustrating. While it should be the most profitable day of the year, there’s a high probability that your website will be out of service. Too many people rush to buy your goods at the same time, and your system probably won’t handle that. It has a bottleneck somewhere, no matter how much effort you invest to make it auto-scaling. The problem is that we usually discover it when it’s too late. Over the years, we’ve created a checklist with manual preparations to ensure our readiness.
Having worked at Dynamic Yield for a few years now and survived many Black Fridays, I’ll share the procedures we follow to promise the best for our clients. Dynamic Yield is a B2B company that leverages hundreds of e-commerce businesses every day. In other words, we collect, process, and serve the data of our customer’s customers (end users), handling millions of online users a day, which generates tens of thousands of events every second.
Black Friday is the time of year that we work for and wait for. It’s a moment of truth for every engineer who has spent time trying to build a state-of-the-art system that can handle such a scale.
While serving millions of users daily is challenging, handling double or triple the number of users and their requests for a single weekend is not as easy as it seems. Here is what it takes from an engineering perspective:
Black Friday Preparations
- R&D leaders meeting. Gather together to make sure everyone is aware and aligned. Over the last years, we grew our R&D department rapidly, a lot of new employees joined the ride and a lot of new components were added. We need to make sure that everyone understands and is aware of this big event, and nothing will be missed.
- Mapping all the services. We maintain a list of all our services, so every team can derive action items as the next step. In addition, we have a list of all shared resources, like Databases, some are managed services (such as Redis, MSK, Scylla, etc.) and others are self-managed (Elasticsearch, Prometheus, etc.).
- Estimations - Based on:
- Typical loads, utilizations, traffic, etc.
- Previous years' Black Friday trends.
- Expectations (compared to daily routine) that we get from our customers
- Whether the service is already elastic (Horizontal Pod Autoscaler in Kubernetes, Auto Scale Group in EC2, etc.) or not.
- Cost Consequences
We decide on the target numbers and tune (temporary) limitations. - Scale-out party, assign scale-out tasks to engineers, for instance:
- Increase Elasticsearch coordinating nodes by 150%
- Add another replica (shard) for all Elasticsearch indices, for better availability.
- Set a new (temporary) maximum threshold for pod/instance replicas in each of our services. - Contact our 3rd party providers - sharing estimations details such as expected traffic and required resources (compute, memory, storage, etc.), and ask them to increase those resources one week ahead (most Black Friday sales tend to start a few days before) and reconfigure what we can (like maximum throughput in Redis).
- Polish our monitoring dashboards & ensure that we have visibility for everything - we are using Grafana with a variety of data sources such as Prometheus, Graphite, etc., and other observability tools like OpenSearch, Tempo, etc. With so many code contributors it’s hard to maintain so many Grafana dashboards and one month before Black Friday is a great opportunity to review, cleanup, and update all of our dashboards and make sure everything is up to date and we focus on what matters.
- Reviewing and challenging our disaster recovery plans (DRP). Over the years we learned (the hard way) what can go wrong and how to bypass the problem. Although we aim to build a robust and resilient system, we want to make sure that if one service is out of order there’s no domino effect and we can live without a single feature for a short time rather than being completely out of service.
- Code freeze - a few weeks ahead of Black Friday we avoid deployments of new features and deploying only bug fixes or patches. During that period, most of our engineers develop new features locally or in remote development and staging environments, holding off on deployment until December.
During Friday
On Friday morning/noon, we open a war room - a shared, large meeting space with a representative engineer/leader from each team. That way we can quickly tackle any unplanned scenario, collaborate effectively, and get the most accurate up-to-date picture of what’s going on.
The atmosphere during that evening has improved over the years. As we grew and gained experience, the atmosphere shifted from tension, alertness, and awareness to excitement and anticipation of achieving new traffic records.
Imagine one large table (expanding over the years) surrounded by laptop-armed software engineers, chargers, and power splitters scattered around, with at least two huge screens presenting real-time graphs.
The event is also indulgent, featuring a tremendous variety of snacks, beers, and pizzas.
We usually turn off the lights a few hours after the peak time of both the US and Europe and sync our on-call supporters in case something goes wrong early in the morning.
Lastly, our engineering leader sends a proud email with some geeky high records spikes screenshots.
Post Black Friday
One comfortable fact about Black Friday is that it comes right before Cyber Monday, another big e-commerce event, so we can keep the extra resources for a few more days (after Cyber Monday is over) and then scale in and down again.
Then, we conduct a postmortem, discussing what went well, what went wrong, and deriving action items, prioritizing them to ensure we learn and improve from year to year.
We also examine the new high achieved during those days and make sure that our system thresholds are tuned accordingly. We also summarize it for reference in the coming year.
A few days later, when the data is available, we perform a cost analysis to see how much we spent during this holiday period compared to a typical day.
Conclusions
There’s no doubt that Black Friday is an exciting, sparkling, and shiny event, with many discounts and presents that bring joy to everyone. This story emphasizes the extensive work behind the scenes to ensure a smooth operation, making Black Friday feel like just another day. Countless efforts were made around the clock to deliver a seamless experience!
I hope you enjoyed and I would love to learn how you prepare for Black Friday. Please leave a comment below :)