eBayTech
Published in

eBayTech

The Staging Dichotomy: Part Two

A two-part series on how eBay turned around an impeding staging environment into its biggest asset for developer productivity.

The Staging Dichotomy: Part Two Cover Image

Infrastructure

Unlike the data track, the infrastructure track was more objective. Primarily, because the needs were apparent, it was about two things — reliability and predictable performance. Stability was the key here. The staging infra should always be available, but more than that, it has to be consistently performant. We were determined to avoid scenarios where one part (a zone or a shard or a stack) of the environment is fast, and the other is unpredictable. We also did not want variances in how each application performed. The goal was to have a setup in which every application scaled based on its usage and guaranteed consistent performance numbers. We did a couple of things to achieve that.

Hardware Refresh

A complete, staging hardware refresh was long overdue, and this was a perfect opportunity.

Monitoring and Remediation

The idea here was pretty simple. We have state-of-the-art monitoring and remediation systems in production. The goal is to bring them over to staging. Few tools were already available, but lack of maintenance and firewall misconfigurations made them unreliable. The core team set out to identify all these gaps and develop solutions to address them. Some were straightforward, and others more nuanced. Most complexities were due to the zoning differences between the production and staging environments.

Auxiliary Systems Parity With Production

For an online marketplace to be fully functional, it is just not the application servers and databases that need to be up and running. There is a whole set of auxiliary components that are pivotal for end-to-end operations. These include systems like Hadoop, Kafka, Pronto, Rheos and Elasticsearch to name a few. These systems lacked parity (and in some cases non-existence) between staging and production. As a part of the initiative, we secured resourcing from the system owners and worked with them to address the gaps. We were smart about only bringing parity when it made sense and not blindly replicating production in staging. For instance, not all Hadoop pipelines are required in staging. Some are based on customer behavioral data, which will not be helpful in a testing environment. Thanks to all our partners, we have attained a sufficient parity for developers to be productive in staging again. However, it is not 100% done yet and continues to be an ongoing activity. Hopefully, in the future, this will be a non-issue, where systems are first built in staging and then ported to production. We already see that happening for new systems.

Fighting Off Regression To The Mean

It is hard to build great systems but even harder to keep them great.

It is indeed one of the hardest things to do. Now that there is high-quality data and a stable infra, we potentially could solve the chicken and egg problem, i.e., ensuring application teams always keep their functionality up and running in staging. But there is an even more important question — how do we make sure that in a couple of years from now, the staging environment does not regress to the same state as it was when we started this initiative? This desire is what we meant to fight off regression to the mean. We tackle it with a multi-step approach.

First Step

Fixing the gap uncertainty problem was the first step. To do that, we built a system called Smoke, which generates staging traffic 24/7. And the way it does that is by executing priority one (P1) and priority two (P2) integration test cases every 15 minutes across all applications round the clock. This is how it works.

  • Teams then onboard the application to the Smoke system. It is part of our cloud console UI. For new apps, the onboarding happens during the app creation process itself. In this step, the system tries to detect all the dependencies of the application. It includes upstream dependencies, databases, auxiliary systems and underlying internal infra like GitHub, Maven, npm, etc. App owners can tweak the dependencies as needed to ensure correctness. Identifying the correct dependencies is vital to avoid false alerts (more on that below).
  • Once dependencies are sorted out, the smoke system does a series of validity checks before traffic generation. The checks included things like verifying pager settings, so alerts go to the right teams. Verifying if the tests take less than five minutes to run, as anything more than that will cause a burden to the 15-minute repeats. If it has to be more than five minutes, the system recommends splitting into multiple jobs for the same app. Few other health checks like these follow suit.
  • The final action is to enable traffic generation. Since the tests are integration tests, it is basically hitting HTTP endpoints for services or calling a grid for UI automation. As outlined above, these tests run every 15 minutes 24/7.
A representation of the Smoke system generating traffic 24/7
A representation of the Smoke system generating traffic 24/7
Triage workflow when an integration test fails
Triage workflow when an integration test fails
  • A pager alert is triggered to the on-call person of the identified service. The expectation here is to acknowledge the incident. Whether they act immediately or not depends on the situation. In the case of production, it is all hands on deck. For staging, if the alert comes in an off-hour, the team can decide just to acknowledge and fix it once back online. We want to avoid both a staging burnout and also not dilute a production alert.
  • If a team has not acknowledged the incident within two hours, our technical duty officers (TDOs) get involved. We dedicated three TDOs just for staging monitoring in three time zones to cover a 24-hour cycle. They immediately start the investigation process leveraging the renewed tooling and reach the appropriate teams. The TDOs were one of the biggest differentiators in the triage process. They approach it holistically and hence prevent many incidents proactively.
  • The triage ends with the owner(s) applying the fix. If it is app owners, they need to debug the test failures and fix them. If it is infra owners, they analyze why the associated health checks failed, which again is continuously monitored by the Smoke system, and address it. If a team acknowledges an incident but does not resolve them within 10 hours, the TDOs get on a call with them and work together to bring back availability. These timelines strike a good balance of giving teams enough time to fix and avoid slack.

Second Step

Creating the traffic generator system was the first step in fighting regression. The second step was to instill a sense of accountability, progress and north star for application teams to keep improving. For that, we set a 99% staging functional availability goal for each application. Though it does not guarantee that the overall staging availability will be at 99% (as that would mean that each domain should be at 99.89%), it was still an audacious goal. In other words, a domain can only have a downtime of approximately seven hours a month. Now that we have round-the-clock traffic, we collect and report these numbers during operational reviews with leadership. The intention of these reviews is not to question teams on the goals but to provide an opportunity to highlight obstacles that leadership can assist and share best practices. It also provides visibility to teams and motivates ownership of their availability numbers.

The staging availability was approximately 55% in August 2020. Today it is at 96%.
The staging availability was approximately 55% in August 2020. Today it is at 96%.
Domain-wise availability numbers for the last three months.
Domain-wise availability numbers for the last three months.

Virtuous Cycle

Our steps to resist complacency turned the original vicious cycle problem into a virtuous cycle. Now that staging is working in the way it ought to, developers started seeing a boost in productivity. What was once a frustrating and time-consuming experience has now become idyllic. The overall efforts provided the right tooling for rapid product development and instilled confidence, which was once lacking. The 99% goal setting was a starting point for teams to become serious. They are slowly thinking beyond the goal and are even more motivated to keep staging up and running. That was our intention all along. App owners see the benefits of how their upkeep helps other applications and vice versa.

The virtuous cycle of the staging ecosystem
The virtuous cycle of the staging ecosystem

If a system just works, it blends in and becomes invisible. That is our ultimate goal for staging.

Wrapping Up

Today, 90% of all automated integration testing happens on staging. The pass rate is at 95%, compared to only 70% in 2020. Flaky tests are a big frustration point in software development. Even a minor improvement can have a multiplier effect, and we saw that with the jump in the pass rate. Also, in the past, teams pushed code to production with less confidence and executed a considerable number of sanity tests directly in production to validate functionality. Now that reluctance is gone. Only around 5% sanity testing happens on production, with the rest becoming integration tests executed on staging. Speaking about release velocity, we reduced our native app (iOS and Android) release cycles from three weeks to one week, and staging was a key enabler in achieving that.

--

--

All about eBay's technology from its engineers, researchers and product owners.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store