From a reactive service mindset to a reliability driven consumer experience — How we realized stability across all digital channels

Vikalp Yadav
adidoescode
Published in
4 min readOct 21, 2021

On a regular day, over a million consumers spanning across the globe browse and purchase products from the adidas online store. With consumers’ lives becoming more and more digitally centered, online is evolving from being just another distribution channel to completely redefining the consumer-brand-relationship. Running a global eCom business at scale has its challenges. Yet, our consumers enjoy one of the smoothest and reliable online experience while shopping with us. Behind the scenes, are a team of Site Reliability Engineers (SRE) who enable this stable end to end shopper journey by orchestrating a highly complex system to reduce friction points across our digital services before consumers even experience them. Our heroes — the Site Reliability Engineers are setting the foundation for growth as we scale our eCom business to € 9bn revenue by 2025.

Learn here how we approached the journey from a reactive service mindset to a stability driven consumer experience, and hopefully draw inspirations for your personal engineering challenges.

Rising Complexity:

As digital services enhance and increase in number, the complexity of the digital landscape rises equally. Below the surface of our online store and apps, reveals a huge mechanism of small wheels operating as one. As our digital platform is built on a microservices based architecture, we need to ensure operational excellence while maintaining.

  • 1.5 million Requests per second
  • >3,000 orders per minute.
  • 22000 K8s
  • A complex mix of 200 plus payment methods
  • 450 million Lines of code changing continuously with multiple deployments per day
  • 3 billion Logs per day to be process for smart alerting

“Everything is connected” creating the smooth experience our consumers have when they shop with us.

Our Vision is clear and simple:

Ensure a frictionless and premium shopping experience for our consumers across all digital channels.

The imperative for Change:

We have a complex microservice based ecosystem has evolved over a period of multiple years. This coupled with a transformation to a product led organization, brings challenges which of its own.

  • Many feature teams running agile Development in a fast-paced environment. The team tend to drift away from the End-to-End consumer experience, often focusing on individual silos.
  • In a high revenue growth environment, Stability and Reliability related backlogs often gets de-prioritized when compared to releases of new and sexy features.
  • The technical skills of a traditional support engineer no longer suffices the skill requirements. There’s a need for a Superman who knows it all.

Realizing these challenges is a first step and requires a honest reflection of the current set-up. Moving to a stability-driven approach is a journey and it starts here.

Rising to the Challenge With SRE :

In order to overcome the outlined challenges we have taken a leaf out of Google’s playbook and established an industry-leading practice tailored it to our specific needs: The concept of Site Reliability Engineering (SRE).

The idea of SRE is to put a system in place that enables us to create scalable and highly reliable software systems and sites. It’s less about reacting to issues when they occur and more about proactively preventing them from happening in the first place. Our Site Reliability Engineers make sure that all possible incidents never see the light of day.

Key Success Factors

We learned by doing and as a result established a successful process that automatically measures and evaluates the dependencies between different aspects of the consumer journey. On the way we experienced essential factors to success

Observability: Its of utmost importance to have a plan in place to move from reactive detection to AIOPS driven predictive detection.

Resilience: Set up failovers and graceful degradation for critical aspects of the online shopping experience before they have an outage. Think in terms of scenarios and prepare solutions for them.

Security: Cyber security is uncompromisable. BOTs are a menace for products with high brand heat. Plan to protect Hype drops (special and limited-edition product launches).

Release excellence: Majority of the bugs are injected post a release internal or from 3rd parties. Have a transparent error budget driven process to drive excellence.

The secret ingredient — A killer KPI

Above all, the secret ingredient is to identify a KPI that connects business and technical view in a holistic objective. For us the KPI % Revenue bleed / net sales (killer KPI) brought all our teams to the same table.

The evolution from a traditional KPIs like %System availability and number of P1s within 95% SLA to a much more advanced measure of mean time to detect (MTTD), mean time to restore (MTTR) was not enough. Evolution and adoption of killer KPI ensured that all these supporting KPIs came together. This killer KPI now acts a single currency of prioritization for bottom line driven initiatives and Backlogs.

Not quite the end:

Today, we are amongst the top ten percent of organizations practicing such an advanced level of SRE maturity. The Killer KPIs has dropped by more than a half over the last two years. Site Reliability is now an intrinsic to the ways of working that will drive us to 9 billion in 2025.

The journey is not over yet, but the most important achievement is the change in the mindset that engrains Reliability as a way of life.

This is not an official project report, this article is the author’s personal view and does not represent the opinion, strategy or goals of the company.

--

--