When the canary stops singing…
Ensuring Availability in AWS back in 2016
This is the story of when I just had to open my big mouth in front of a Senior VP at Amazon, and the journey that impulsive decision put me on.
Charlie Bell was an absolutely legendary figure at Amazon for nearly 24 years. You may have heard of him because he recently ruffled some feathers with his high profile move to Microsoft. Charlie was an S-team member, part of the elite few who had direct access to Jeff Bezos with large influence over company-wide decision making. His list of accomplishments is impressive. But I personally knew him for one thing.
Every Wednesday, Charlie led a 2 hour grueling deep dive into every aspect of AWS operations. This was an amazingly open meeting: any of the ten thousand or so AWS engineers was free to join. We routinely had people from every remote engineering outpost in the world dialing in, from Cape Town to Edinburgh to Bangalore. I happened to work in the building where the meeting was hosted in Seattle, so I made it a tradition to attend in person. I wouldn’t miss it, rain or shine. It truly was the Greatest Show on Earth. I was constantly star-struck, as this giant conference room was packed to the brim with some of my favorite Senior Principals and Distinguished Engineers, Directors and VPs at Amazon. Charlie was a high ranking executive with thousands of engineers under him, but he still had an uncanny ability to drop down to incredibly low-level technical details during the meeting and unceremoniously challenge anybody who did-not-know-their-stuff or tried to give an evasive answer.
Since AWS was too big to cover in a single meeting, there was a handmade wheel of fortune with all the AWS services to select whichever poor soul was to be scrutinized by Charlie that day. Eventually we had too many AWS services so we replaced it with a digital version of it.
For every interesting operational issue there was a COE (Correction of Error) document written, and Charlie used the Wednesday morning meetings to dissect it. The COE process helped ensure the team understood root causes, that they had been reviewed in a consistent way, and they had been addressed correctly. Charlie leveraged the lessons learned to make sure everybody in AWS cared about quality and operational excellence as much as he did.
I distinctly remember one of these Wednesday morning meetings, back in 2016, where Charlie was very vocally expressing his displeasure at a particularly egregious operational issue that should have been caught by better monitoring. We needed better canaries.
Canaries, the cute yellow bird?
A century or so ago, canaries were used in coal mines to detect the presence of carbon monoxide. Given their high metabolism, the birds would die from carbon monoxide poisoning before humans were affected. If you were a miner, as long as you could hear the bird singing, you were ok. When it stopped singing, you knew there was trouble ahead, so it was time to get out of the mine.
That inspired the word “canary” in the context of software engineering, although it’s a bit overloaded in the industry.
Companies like Netflix have coined the term to describe the process of deploying code to a test stage before pushing it officially to all of production, and shifting a small percentage of production traffic to it for a while to analyze if any key metrics were negatively impacted by the change.
At Amazon, there was an entirely different thing we called “canary.” A canary was a test continuously running against a production environment, validating that a critical user scenario was still operational. Say it was business critical to ensure that you could purchase a specific toothbrush on amazon.com, so you would write a test to impersonate a customer doing that, and you would run it every minute to ensure that even at 3am, customers could buy the toothbrush. The last step was to tie your test results to the paging system so that you could be alerted when the user scenario was broken.
Testing against test stages before you push code to production is paramount, of course, but even after safely pushing perfectly valid code to production, things can go wrong — for example if a critical dependency is having issues. So continuous testing against production is important too. Surely, you could just depend on the alarms you’ve already set on your regular production traffic. But the problem is that production traffic is non-deterministic, so you have no guarantees that all your critical scenarios are always being exercised by customers. If something is important, you write a test and you validate it 24x7.
I never quite understood how Amazon and Netflix had historically ended up with two entirely different things with the same name, but I suppose the high-level concept of ‘canary’ was true in both: the canary emitted chirps of some sort to indicate things were ok, and the absence of positive chirps indicated a problem.
At the beginning of that AWS Ops meeting, Charlie sort of unofficially decreed that all critical business scenarios in AWS needed to have canaries validating the scenario worked at all times. Problem is, we didn’t have great tooling to do that. He specified a destination, but it was up to the individual engineers to figure out how to get there.
What we needed was infrastructure that could bootstrap a piece of test code and execute it in a number of networks (our internal corporate and production networks, as well as public AWS regions). That code needed to execute forever, at some desired throughput (such as once per minute). And it needed to emit metrics such as pass/fail and latency so that engineers could dashboard them and alarm on them. This sort of thing would be relatively trivial to do today in 2022, with a quick AWS Lambda function, but back then it was harder to make it all work.
As Charlie spoke, I had a sudden epiphany. I had a tool that had that functionality! I had created the load and performance testing platform that Amazon used to ensure thousands of its services were ready to handle peak traffic (“TPSGenerator”). I never envisioned it to be a canary execution platform, but once you thought abstractly about what functionality you needed in a canary platform, it was surprisingly similar. Ultimately, it was a platform to execute tests at a desired rate. I was using it to run at millions of tests per second on potentially thousands of machines, but there was nothing preventing us from running at one test per minute. The platform had all the right hooks to bootstrap tests in all networks, and posted the right metrics to Amazon’s internal monitoring system (our internal version of AWS Cloudwatch), so you got things like dashboards and alarming for free— all desirable features for a canary platform! And, engineers already knew how to run TPSGenerator (for load testing), as most AWS services were already using it, so there wouldn’t be a significant onboarding cost (no need to learn yet-another-tool).
I needed just a few small changes. I needed a canary to run forever, whereas the load test platform had been optimized to run for a finite amount of time. That should be an easy change! I thought. I should make it right now! All of a sudden Charlie’s voice faded in the background as I opened up my laptop, booted up my IDE (Eclipse) and started making the code changes, right there, entirely oblivious to whatever was happening in the meeting, just engulfed in my little new technical challenge.
The meeting was 2 hours long. I frantically wrote code for about an hour. I added a lightweight daemon to monitor the health of the main app, and kill it and restart it if it was unhealthy. I made some changes to the main() to run in an infinite loop. The idea seemed solid, and my proof of concept was promising. I was giddy.
I closed my laptop and turned my attention back to the meeting. There was about five minutes left, and Charlie was forcefully reiterating his strong expectation that AWS teams needed to spend time creating canaries, and some high ranking VP was pushing back on his desired timing because the infrastructure to execute canaries and alarm was non-existent.
I coyly raised my hand, defying every fiber of my being that was telling me NOT to sign myself up for what could be more than I had bargained for. After all, the idea to leverage the load and performance testing infrastructure for canaries was somewhat impulsive. The prudent thing would have been to wait a bit more, flush out my design, test it on one or two teams, and then introduce it more broadly. But I am impatient and bullish about things when I’m excited. Sometimes you need to strike while the iron is hot. It was now or never.
“Hum, Charlie, I have modified TPSGenerator to be able to run canaries, and it can handle the alarming part too,” I said nervously, my mouth dry.
Charlie raised his eyebrows. “OK, present it next week!” he barked with a smile, got up and left.
And then just like that the meeting was over and people started filing out of the conference room to run to their next meeting, before I could say anything else.
What the hell did I just sign up myself to do, I facepalmed in the now empty room.
The hard work begins now…
I went back to my desk, opened my laptop again, and got back to the IDE. That day I realized something that should have been extremely obvious to me had my judgment not been clouded by a haze of excitement during that meeting. There’s a huge difference between a quick and dirty proof of concept written in one hour and a production-worthy system that needs to be critical to a billion dollar business. But I had already opened my big mouth in front of a Senior VP. It took months and many more engineers to take that crappy hack that I pulled off in a frantic coding spell and turn it into something dependable.
In designing any system, you often face tradeoffs. You of course want precision in your system. If your load test is supposed to generate 100 transactions per second (“TPS”), then you want it to generate exactly 100, not 99 or 101. But in a company with the scale of Amazon, you also need a load generator to be able to generate up to millions of TPS in some cases, so scalability is critical. In a distributed system, can you scale infinitely and always be precise? No. At some point you have to be willing to sacrifice a little bit of one to get the other. In designing TPSGenerator, I had made a number of architectural tradeoffs that benefited scalability over precision. When you were trying to run a load test at a million TPS, plus or minus a few was ok. It mattered more that you could generate immense amounts of load. However when you were running a canary at 1 transaction per minute, that exactness mattered, it couldn’t be 59 seconds or 61 seconds. So canaries needed precision over scalability.
Here’s another tradeoff. In a load test where potentially thousands of machines are coordinating to generate a desired load, a machine or two dying in the middle of your test is not a big deal. Machines are essentially cattle not pets. In fact my architecture had assumed that at scale, at least some machines were pretty much guaranteed to fail in the middle of the test. So others would pick up the work eventually. A canary is low traffic, so you don’t need thousands of machines, you need one, but if that machine dies, it needs to be immediately replaced. So it’s a pet, not cattle. We made some changes so that in canary mode, we had a host on standby for resilience, which got us into the business of leader election (a bit of complexity I had managed to evade in the load test architecture).
One lesson learned during this time was that, even as two products look identical, there’s nuance to the choices you’ve made in your architecture to optimize for one case or another. Optimizing for scaling, not exactness, and optimizing for cattle not pets, were the right architectural choices fort a load test, but turned out to be fundamental architectural challenges for canaries.
So, at the end of the day, was leveraging the infrastructure that I had built for load testing for canaries a good choice?
On the technical side, with 20/20 hindsight, no, because I frankensteined two things together that didn’t necessarily belong with each other, and although on the surface seemed functionally similar, had entirely different non-functional requirements. There are a few forks in the code (if load-test do X, if canaries do Y) that I deeply dislike. If-statements littering your code like that is a huge anti-pattern, and a testing nightmare.
But also on the technical side, we were able to provide a huge amount of new functionality with very little additional code, and leverage a lot of the expertise that we had built over years of operating the platform, so there were technical wins too.
And on the business side, it was good. We were able to provide immediate business value to hundreds (eventually: thousands) of services that really, really needed asap. And we did it with minimal engineering toil and minimal cognitive effort on their side, since it was using a tool they already knew and used frequently.
In terms of career growth, my work on canaries and availability got me invited to a lot of meetings with Andy Jassy (Amazon’s CEO now, back then the CEO of AWS), Charlie Bell and other high ranking executives. It was a much more intimate, smaller crowd, and that first-hand exposure to senior leaders was hugely influential in who I am today.
While leveraging canaries for load testing was not perfect, I tend to be a pragmatic over a perfectionist. Today, in the year 2022, a surprisingly large majority of the original code I frantically wrote in front a Senior VP during a two hour meeting in 2016 is still running strong in production — and that makes me smile. The world works in mysterious ways; sometimes, you just need to be spontaneous and embrace serendipity.