The following graph illustrates why the fourth quarter is significant for many e-commerce companies. It’s a typical daily order creation graph for a part of the fourth quarter. Looking at the graph, even without any labels, you can clearly call out Black Friday and Cyber Monday. We here at the Groupon Product and Engineering organization, in partnership with Business stakeholders, run a peak readiness program in the fourth quarter to ensure our platform and services can scale to handle increased demand and continue to deliver great experiences for Groupon users — consumers and merchants.
This year has been anything but normal. In addition, we are in the midst of cloud migration. This poses some interesting challenges for peak readiness planning. This article provides a glimpse of how we are tackling these challenges and getting ready for this year’s peak period.
As a local experiences marketplace, Groupon’s goal is to seamlessly connect consumers with merchants. To that effect, our business teams have specific goals around how we surface what merchants have to offer to meet increased consumer demand during peak period. On the engineering side, this translates into two key program goals:
- Achieve or exceed platform availability and latency SLO, where the platform entails all underlying services responsible for delivering user experiences through Groupon sites and mobile applications.
- Marketing campaigns are executed successfully.
The overall platform availability SLO is 99.95% and we have specific latency SLOs for consumer experiences such as Home Page, Deal Page, Browse, and Checkout. In addition, we have specific goals around MTTR as well as the Change Control process. How we calculate the overall platform availability is a topic for another blog post.
Considering Groupon scale - two brands (Groupon and LivingSocial), 15 countries, ~70 million users*, ~230 million daily deal impressions, multiple channels (Health-Beauty-Wellness, Food & Drink, Goods, Things-To-Do, Travel, Coupons, and more), four application platforms (Desktop, mobile web, iOS, Android), 600+ microservices and API platform that handles millions of RPM, hitting these goals becomes an interesting challenge.
* # of users who visited Groupon sites/apps in last 30 days
We have had our share of challenges in previous years where user experience was impacted during peak periods. We have learned from those experiences and taken corrective actions. As a result, we have been able to consistently achieve peak period engineering goals for the last couple of years. One of the key changes has been ‘continuous operational readiness’ where the emphasis is on all services being operationally ready at all times, not just in the fourth quarter. We have built automation via internal tools like Service Portal to track continuous operational readiness.
This year’s peak readiness program builds on the progress to date and aims to make it more efficient by simplifying the overall program structure and empowering teams to do more while holding the same quality bar.
The team empowerment and program simplification strategy is influenced by the following factors:
- There is a greater need than ever to innovate faster
- Engineering teams have more limited bandwidth than ever
- Need to counterbalance the added short-term complexity introduced by cloud migration
With more power comes more responsibility. As teams are empowered with a decentralized decision-making model, it’s also implied that teams are not only responsible for achieving SLOs for their services but also for not adversely impacting SLOs of dependent services.
The cloud elasticity will be a huge plus for peak readiness effort in the long run. As we are in the midst of cloud migration this year though, it adds more complexity to the mix. Right now we are operating in hybrid mode where some services are in the cloud while others are still on-prem and others have footprints in both. We are addressing this complexity in different ways:
- In order to minimize latencies due to calls across on-prem and cloud, we are staging cloud migration in logical groups of different layers that constitute the overall platform stack. In the first phase, we migrated all frontend applications and the API layer.
- We have built tools and services that allow us to quickly ramp up and down the traffic flowing through the cloud.
We divide the peak readiness effort into multiple tracks such as Performance Testing, Marketing, Operations, Customer Support, Mobile Applications, Merchandising, and Business Reporting. Each has a track captain who leads the overall effort for a set of outcomes defined as track deliverables.
Performance Testing is one of the most critical tracks. We update the overall traffic model taking into consideration multiple data points which include last year’s peak traffic, current traffic pattern as well as planned marketing campaigns. The updated traffic model forms the basis for the current year’s load test targets. Through a combination of individual service load tests and aggregated load tests that span across on-prem and cloud, we then simulate expected peak user traffic patterns across Groupon sites and mobile applications to validate overall operational readiness. The aggregated load tests are run on a regular cadence. This helps us catch and fix any regressions quickly.
Other key deliverables include:
- Site levers: These include mechanisms like rate limiting, circuit breakers, and the ability to dial cache level up/down across different layers within the platform stack. The site levers are like an insurance policy - we hope we don’t have to use them but if we run into major issues then we have means to minimize the overall impact.
- Mobile applications stability and scalability: This includes, among other things, auditing the application startup behavior for any regressions and inefficiencies as well as validating that overall latencies are within an acceptable range at high load conditions.
- Fallback Plans for third party service dependencies: It may not be practical to have fallback plans for all third-party service providers. It boils down to risk categorization and mitigation plans. E.g. It’s crucial to have a backup payment provider to route payments through in case of issues with the primary payment provider service.
- Ability to dynamically change the site-mix: This involves tweaking relevance algorithms to align with merchandising plans. It provides flexibility to dynamically change the site-mix in the form of a combination of deal types (local, goods, travel)to be surfaced and promoted.
- ‘Run of Show’ for all tasks to be performed during peak week (i.e. Tuesday before Black Friday through Tuesday after Cyber Monday).
- Graphs, dashboards, and reports to track various metrics:
Ultimately peak readiness is everyone’s responsibility - engineers, product managers, PMO, operations, architects - everyone contributes. Even VPs are on call and work in shifts during the crucial peak week. The Groupon engineering culture where teams take pride in building scalable and reliable solutions to meet user needs is certainly a big contributing factor to the success we have seen over the last few years.
Although there is still a lot to be done and we are certainly not at the Chaos Monkey maturity level, the progress we have made to date through automation and process improvements allows us to take the next step of program simplification and team empowerment, which makes the overall peak readiness program more efficient.
These improvements coupled with diligent planning and execution, give us the confidence to take on the new challenges this year has posed and hit the peak readiness program goals. We look forward to a successful 2020 peak period. We will then reflect on results, learn from them, and improve further next year!