From Peak to Peak: A Tale from the Trenches

Published in

Glasnostic

4 min readMar 19, 2020

Online shopping sites are sensitive to periodic and seasonal changes, and it is not uncommon for them to go down during an influx of traffic, like in this Robinhood example. Regardless of whether the failure is temporary or longer-lasting: it is often the happenstance architecture and lack of operational control that lead to a domino effect across the service landscape.

I have experienced first-hand how severe the impact of a lack of control over an organically growing (micro)service landscape can be. The online supermarket I used to work for was almost entirely siloed according to two “blood types.” A more traditional Supply Chain and Enterprise IT group, and developers working on the e-commerce side of the business.

Architecture

On the e-commerce side of our business, the architecture grew organically and rather uncontrollably. Changes were pushed daily and by many teams simultaneously. Mistakes were made, rolled back or amended. Our service landscape evolved continually and regularly clashed with the rules and processes of the supply chain and enterprise IT group, which was used to more traditional IT Service Management and waterfall-based Change Control processes.

The architecture of the supply chain and enterprise IT group was closely monitored and controlled by a specialized group of architects. Changes to that architecture or any operational aspect of its environment underwent a set of stringent change management processes.

Especially when the e-commerce folks needed to make a change in, or integrate with, a supply chain or enterprise IT environment, the differences in decision-making speed and hierarchy became apparent. The faster-moving e-commerce group didn’t understand the need for the control processes that were required for the more traditional supply chain application landscape. Conversely, the supply chain folks considered the e-commerce group with their DevOps-oriented infrastructure and agile, cloud-native development practices reckless and irresponsible.

Demand Peaks

A few times each year, the company would rally to prepare for, e.g., the predictable onslaught of Christmas shopping. During the weeks leading up to those peaks, code and change freezes were instituted to maximize stability, and many technical folks were on standby to react quickly to any issues or even outages. Nevertheless, the growing popularity of online grocery shopping and unpredictable customer behavior regularly led to unforeseen problems.

Conway’s Law

One of the remarkable effects of the heavily siloed organization is that fault lines in the architecture mimic organizational boundaries.

Org charts’ comic by Manu Cornet, https://bonkersworld.net/organizational-charts, CC BY-SA 3.0 — *Architecture follows organization. (“org charts” by Manu Cornet,* *CC BY-SA 3.0*)

In our case, these fault lines ran between the back-end e-commerce engine and the back-end of the Android and iOS mobile apps, as well as between the e-commerce engine and the supply chain core. During peak moments, it was not uncommon for these integration points to break, causing a cascading failure throughout the e-commerce landscape.

Failure Begets Failure

These domino effects were difficult to diagnose and troubleshoot. They changed characteristics often and took different paths through the service landscape almost every time. Without a way to control them, it was “all hands on deck” for those on call to stop the bleeding while scrambling to isolate their underlying causes.

Most of the time, this required a herculean effort. And, unfortunately, this pressure-cooker of high-stakes fire-fighting gave rise to a new problem: the creation of individual “superhero” firefighters. Individual engineers, who alone would possess the ability and command the knowledge needed to diagnose and remediate issues.

Fixing failures was regarded as a success. After all, fixing one problem increased our chances of coping with the next peak, right? Ironically, this attitude also set us up for the — inevitable — next failure, waiting to wreak havoc on us during the next peak. Instead of investing in our ability to remediate, we tricked us into believing that there would be no next failure. Nothing could be further from the truth, of course, and combined with more traffic, the eventual next failure invariably turned out to be more severe than the previous one.

Mission Control

This is why, even when our process of diagnosing and remediating failures did save us during one peak, the chances of it working during subsequent peaks were slim.

In our increasingly complex environments, manual debugging is not a feasible long-term approach. Humans are bad at diagnosing underlying causes, in particular at a scale beyond one or two “superheroes.” Humans can’t sift through heaps of telemetry, logs and distributed traces quickly enough to correlate signals across the landscape and spot the complex emergent behaviors that lead to failure.

They need real-time visibility into the complex interactions of systems and the ability to control these behaviors with predictable and effective operational patterns like backpressure, circuit breaking and bulkheads. This is the essence of Mission Control.

The ability to apply such operational patterns would have been a game-changer for our operation. It would have contained, if not prevented, most of the outages we scrambled to address during those peaks, keeping customers happy and letting them complete their order.