Peak Readiness Program @ Groupon

Sanjeev Nagvekar
Nov 6, 2020 · 6 min read

The following graph illustrates why the fourth quarter is significant for many e-commerce companies. It’s a typical daily order creation graph for a part of the fourth quarter. Looking at the graph, even without any labels, you can clearly call out Black Friday and Cyber Monday. We here at the Groupon Product and Engineering organization, in partnership with Business stakeholders, run a peak readiness program in the fourth quarter to ensure our platform and services can scale to handle increased demand and continue to deliver great experiences for Groupon users — consumers and merchants.

Daily order creation in fourth quarter
Daily order creation in the fourth quarter

This year has been anything but normal. In addition, we are in the midst of cloud migration. This poses some interesting challenges for peak readiness planning. This article provides a glimpse of how we are tackling these challenges and getting ready for this year’s peak period.

Program Goals

  • Achieve or exceed platform availability and latency SLO, where the platform entails all underlying services responsible for delivering user experiences through Groupon sites and mobile applications.
  • Marketing campaigns are executed successfully.

The overall platform availability SLO is 99.95% and we have specific latency SLOs for consumer experiences such as Home Page, Deal Page, Browse, and Checkout. In addition, we have specific goals around MTTR as well as the Change Control process. How we calculate the overall platform availability is a topic for another blog post.

Considering Groupon scale - two brands (Groupon and LivingSocial), 15 countries, ~70 million users*, ~230 million daily deal impressions, multiple channels (Health-Beauty-Wellness, Food & Drink, Goods, Things-To-Do, Travel, Coupons, and more), four application platforms (Desktop, mobile web, iOS, Android), 600+ microservices and API platform that handles millions of RPM, hitting these goals becomes an interesting challenge.

* # of users who visited Groupon sites/apps in last 30 days

Strategy

This year’s peak readiness program builds on the progress to date and aims to make it more efficient by simplifying the overall program structure and empowering teams to do more while holding the same quality bar.

Program Strategy
Program Strategy

The team empowerment and program simplification strategy is influenced by the following factors:

  • There is a greater need than ever to innovate faster
  • Engineering teams have more limited bandwidth than ever
  • Need to counterbalance the added short-term complexity introduced by cloud migration

With more power comes more responsibility. As teams are empowered with a decentralized decision-making model, it’s also implied that teams are not only responsible for achieving SLOs for their services but also for not adversely impacting SLOs of dependent services.

The cloud elasticity will be a huge plus for peak readiness effort in the long run. As we are in the midst of cloud migration this year though, it adds more complexity to the mix. Right now we are operating in hybrid mode where some services are in the cloud while others are still on-prem and others have footprints in both. We are addressing this complexity in different ways:

  • In order to minimize latencies due to calls across on-prem and cloud, we are staging cloud migration in logical groups of different layers that constitute the overall platform stack. In the first phase, we migrated all frontend applications and the API layer.
  • We have built tools and services that allow us to quickly ramp up and down the traffic flowing through the cloud.

Execution

Performance Testing is one of the most critical tracks. We update the overall traffic model taking into consideration multiple data points which include last year’s peak traffic, current traffic pattern as well as planned marketing campaigns. The updated traffic model forms the basis for the current year’s load test targets. Through a combination of individual service load tests and aggregated load tests that span across on-prem and cloud, we then simulate expected peak user traffic patterns across Groupon sites and mobile applications to validate overall operational readiness. The aggregated load tests are run on a regular cadence. This helps us catch and fix any regressions quickly.

Other key deliverables include:

  • Site levers: These include mechanisms like rate limiting, circuit breakers, and the ability to dial cache level up/down across different layers within the platform stack. The site levers are like an insurance policy - we hope we don’t have to use them but if we run into major issues then we have means to minimize the overall impact.
  • Mobile applications stability and scalability: This includes, among other things, auditing the application startup behavior for any regressions and inefficiencies as well as validating that overall latencies are within an acceptable range at high load conditions.
  • Fallback Plans for third party service dependencies: It may not be practical to have fallback plans for all third-party service providers. It boils down to risk categorization and mitigation plans. E.g. It’s crucial to have a backup payment provider to route payments through in case of issues with the primary payment provider service.
  • Ability to dynamically change the site-mix: This involves tweaking relevance algorithms to align with merchandising plans. It provides flexibility to dynamically change the site-mix in the form of a combination of deal types (local, goods, travel)to be surfaced and promoted.
  • ‘Run of Show’ for all tasks to be performed during peak week (i.e. Tuesday before Black Friday through Tuesday after Cyber Monday).
  • Graphs, dashboards, and reports to track various metrics:
Metrics
Metrics

Ultimately peak readiness is everyone’s responsibility - engineers, product managers, PMO, operations, architects - everyone contributes. Even VPs are on call and work in shifts during the crucial peak week. The Groupon engineering culture where teams take pride in building scalable and reliable solutions to meet user needs is certainly a big contributing factor to the success we have seen over the last few years.

Continuous Improvement

These improvements coupled with diligent planning and execution, give us the confidence to take on the new challenges this year has posed and hit the peak readiness program goals. We look forward to a successful 2020 peak period. We will then reflect on results, learn from them, and improve further next year!

Groupon Product and Engineering

All things technology from Groupon staff

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store