The Holidays at Bluecore: Helping Santa Deliver an Unprecedented Volume of Gifts, One Email at a Time
The holiday shopping season in November and December produces the highest revenue for Bluecore’s customers: retailers and brands. The peak of the season is Black Friday and Cyber Monday. With large spikes in traffic, retailers’ websites are pushed to new extremes. Each year there are reports of retailers’ sites experiencing outages because of the increased site activity. Every minute offline can mean large losses in revenue.
Retailers depend on Bluecore to drive revenue from highly personalized email campaigns during this time of year. The volume of data we ingest follows that of our customers, and so we expect and prepare for a massive spike in activity. In addition, the amount of traffic leaving our system is greatly increased. We need to build a resilient system to support our customers while avoiding the costly implications of too much over provisioning of resources.
In order to prepare for higher traffic during the holidays we’ve found that reviewing prior year results, forecasting traffic changes, load testing, and fixing issues or coming up with “creative” short term solutions have been highly effective!
The Engineering team kicks off preparations 3 months before Thanksgiving. The technical leads for each team meet for a high-level review of last year’s planning process and brainstorm specific technical investments that we know are necessary for this year. This year we came up with a list that we knew required some kind of work or investigation including:
- Checking if our load testing tools from last year can simulate the expected load for this year.
- Coordinating with 3rd parties that we have partnerships with that could be affected by additional load.
- Determining if we will need to provision additional resources from our underlying cloud computing provider, Google Cloud Platform.
- Auditing the new features have we built this year that could be affected by higher load.
After this kickoff meeting, we met every other week to prioritize what we needed to do to be holiday ready and to prepare for and review results of load tests of our system.
These meetings provide us with LOTS of ideas and issues. In order to prioritize what to focus on first we go through the following questions:
- How likely is this? Has this ever happened before?
- What is the impact if it does happen?
- Is there a workaround if it does happen? (e.g. can we pause something or turn it off?)
When we find high priority issues that need pre-holiday fixes but may need time consuming or high risk changes, we try to see if we can implement some of the following “creative” solutions for the short term:
- Over-provisioning: Throw money and computing resources to avoid spending engineering time.
- Documentation/runbooks: Test and document how to turn something off or implement a workaround.
- Just disable/turn something off that isn’t needed right now.
Our main concerns when it comes to the holidays are:
- The volume of events ingested from shoppers on our clients’ websites.
- The number of emails leaving our system.
Based on previous years, we estimated that our event ingestion volume this year would be about 4x our peak load on a normal day. Our outgoing emails are less predictable as we have increased the ability of our clients to set up their own batch emails. However, we used previous years as well as our understanding of how clients use email during the holiday season to say that we would get at most 10x peak load.
We wanted to ensure we were prepared to handle the increase in traffic — the only way that we could demonstrate to ourselves, our leadership and, most importantly, our customers that we could handle the increased volumes was to simulate the volume in question.
A well-known tool in resilience engineering is the Gameday: an exercise that tests the response of a system to some simulated event (failures, extreme volume shifts, degradation of system resources). At Bluecore, we use this to simulate a situation (e.g. 4x peak traffic in event ingestion) over the course of a few hours. In order to get a meaningful understanding of our system, we break our simulation into progressive steps (e.g. 2x, 3x, 4x peak traffic). As part of our simulation, we take a step, validate our assumptions on how the system would respond, note any irregularities, and if nothing is broken, we take the next step.
For each of our holiday readiness gamedays, we held a brainstorming session for customer-impacting questions we wanted to answer and then figured out how to monitor and test the question. Some examples were:
- Can we handle 8x event ingestion traffic (2x more than peak expectancy, to allow for some margin of error)?
- Can we handle 10x email sending volume for highly personalized emails?
- Will the added load cause our analytics pipelines to slow down?
We then identified conditions that the system needed to meet in the situation:
- Event ingestion processing throughput stays high.
- Throughput of personalized email campaigns stays high.
- Analytics data stays up-to-date.
Finally, we identified any tools we would need in order to test our system. These tools were configurable to allow turning the test on and off as well as making the test criteria change.
For the event ingestion scenario we created two tools:
- A load testing tool that took real event data, stripped out personal information and replayed it to our test client endpoint.
- A stubbed customer service that would mimic whether customers opened/clicked emails.
For the email sending scenario we created the following tool:
- A tool that took real event data, stripped out personal information and added it to a fake client. This enabled us to set up email campaigns using the fake data to make sure it could make it through our system.
With these questions, conditions, and tools we began our gamedays. After each gameday we would take the results fix any issues that we saw, then rinse and repeat!
How we did
We made it through the 2018 holidays without any outages! Our preparations allowed us to see where we had weak points and fix them when we could or work around them as we plan longer term fixes that will allow us to continue to scale in the coming year. It was a massive group effort across not only our Engineering team but also the entire company!