The Challenges of Scale, Black Friday and Cyber Monday

Tom Christ
Simon Systems
Published in
4 min readJan 20, 2022

The Meaning of Black Friday, literally and figuratively

Traditionally, Black Friday has held great significance for the retail world. The name comes from an old accounting practice of recording negative credits in red and positive in black. The Friday after Thanksgiving was the biggest shopping day of the year when retailers could expect to see black balances. Therefore, many retailers, including our partners at Simon Data, use this day to kick off sales and promotions in hopes of boosting their end-of-year sales.

However, the shopping window has expanded as more shopping has moved online. In 2005, the National Retail Federation coined Cyber Monday to refer to the deals that often run the Monday after Thanksgiving. Today, it’s not as clear-cut. Sales may run from Black Friday through Cyber Monday and beyond. Thus, making that entire week essential to building awareness of deals and driving overall company revenue.

What are the challenges, and how do we address them?

Traffic and Events

Due to the high levels of anticipation and the deals to be had, many clients see elevated traffic levels on their website properties during the Black Friday and Cyber Monday (BF/CM) timeframe. With this traffic comes elevated browsing and transactional event volume. As you can see below, incoming events increase by up to three times compared to the Wednesday before, the Wednesday after BF/CM, while staying elevated over the weekend between BF/CM

These events can be considered a proxy for the increase in resource requirements placed upon a system. As any engineer can tell you, an increase in throughput of up to 3x at peak can significantly impact application and data platforms.

Marketing Blasts

An everyday use case is reactivating historical clients to raise awareness of BF/CM deals. These sales can involve messaging a very large population of contacts. Simon Data has extensive experience handling surges in throughput (large marketing email campaigns typically result in exponentially decaying event volumes, see below). This surge can result in BF/CM-specific issues regarding data readiness and orchestration across providers. Due to the common use of these large-scale campaigns, we reviewed systems that could not automatically scale (and many that can!) for capacity. They were adjusted based on historical increases in traffic, plus headroom for unexpected growth, and adjusted accordingly.

.

Scale

For the reasons stated above, and various others, BF/CM is a period of increased usage and throughput. Simon Data applies industry-standard approaches to scale our infrastructure. Our data processing pipeline involves various scaling strategies, depending on the technology and the usage pattern. Some methods include sharding, leveraging AWS auto-scaling configurations, rate limiting, batching, escape hatches, and all the other approaches in an engineer’s toolbox.

Sometimes, a piece of the pipeline isn’t amenable to approaches that dynamically scale. The cause could be a piece of underlying technology, lack of reliable signaling, or other reasons. A variety of problems can arise when this is the case. Examples include long-running batch jobs, delays in data refreshes, or latency when processing events. These issues can be especially damaging in the case of streaming event processing. Event queues inherently are naturally opaque. A faltering link in the chain can cause soft failures (e.g., a single shard in an event stream becoming overloaded, issues with routing keys, intermittent failures in a stream lookup resource), which cause backpressure all the way to event ingestion. Symptoms of these soft failures can include extensive buffering, too many open resources, or spikes in resource utilization as the pipeline drains.

Coordination

In addition to the native challenges of data processing, Simon Data’s platform sits at the center of the marketing machine for our clients, which means integrating closely with various partners and providers. Each of them have different operational characteristics and act in a variety of ways. These differences appear when under load-different API status codes, different strategies for delivering larger than expected datasets, and handling failed parts of a batch. This means that we face a set of challenges that are the cross-product of our partners and their network of providers.

All of the above makes BF/CM a challenging (but important!) time.

Given all that, how did we do this year?

We helped our partners launch marketing campaigns that typically have a very high ROI. Many of these were bespoke for the launch of the holiday shopping season. We experienced no major outages and helped address any unforeseen issues that arose. Some of these issues involved catching content issues in campaigns about to be deployed and validation errors with datasets created for BF/CM campaigns Our robust validation and error detection capabilities are a core competency at Simon Data. We can provide this level of assistance due to our position as both a CDP and an orchestrator of marketing technologies.

We’ve run into issues with underlying infrastructure in years past, specifically an AWS Kinesis and Lambda outage in our primary region. The resiliency of our architecture prevented us from losing data while at the same time rapidly recovering once the outage ran its course.

Over the course of the BF/CM period, we ingested over 750MM streaming events, messaged hundreds of millions of contacts, and took just over 1BN messaging actions! Including sending emails, syncing contacts to and from a list, and the variety of other marketing actions that customers can execute on Simon Data.

--

--