Computers Are Hard: bugs and incidents with Charity Majors
Everything’s failing all the time so we’re gonna embrace that and lean into it instead of being afraid.
One of memorable moments working at an enterprise software company has been seeing a very senior engineering manager stumble between desks, laptop in one hand and phone in the other, screaming: ‘Fuck, fuck, fuck! We’re going down!’. She then barged into a conference room that is permanently booked for the incident response team and started paging people in an office halfway across the globe. It was still the middle of the night for them, but when you’re on call, you’re on call.
The first minutes of an emerging outage are frantic. Alerts start pinging, number of support tickets goes through the roof, engineers and customer service scramble to assemble the response team. But then it settles down. Tech companies have runbooks written to make sure a service failure is addressed quickly and efficiently. In action, the process looks like a NASA launch center. A bunch of people are staring at screens and responding to what they see in perfect synchronization, with everybody knowing exactly what to do.
But it begs the question: why did tech companies need to operationalize every step of incident response? There is no other industry so accustomed to its products breaking that it’s considered part of the daily routine. Think about it. If one day every Prius in the world refused to start for a couple of hours, Toyota would face international scrutiny and would be forced to recall hundreds of thousands of cars. If an app isn’t working, you google if it’s down for everyone or just for you and come back later.
What makes software so special that we have to live with continuous fixing of bugs and emergence of new ones? To find out, I reached out to Charity Majors, co-founder and CTO of Honeycomb, a company building tools for developers to debug and better understand systems they work with. We talked about how and why software breaks, what engineers do when that happens, and about Charity’s own adventures in debugging routers in Romania.
Wojtek Borowicz: Say you run an online service, be it a store, or an app, or something else. And then one day it’s down. How can an outage happen? What are some of the causes?
Charity Majors: We’re shipping software all the time. There’s far more to do, and fix, and build than we have time for in our entire lives. So we’re constantly pushing changes and every time we change something, we introduce risk to the system. There are edge cases, things you didn’t test for, and then there are subtler things. Not directly something you changed, but the interaction of two, or three, or four, or five, or more systems. This is really hard to anticipate and test for in advance. That’s why we’re increasingly thinking less about how do we prevent failure and more about how we can build our systems so that lots of things have to fail before the users ever notice.
It’s mind-boggling to sit back and actually think about how complex these systems are. What amazes me is not that things fail but that, more or less, things work.
What does that mean we’re constantly introducing changes? Do developers deploy new code once a week? Daily? Every hour?
It depends on the system. But your system is built on top of another system, built on top of another system… so it could be someone introducing a change you have no visibility into and no control over. Timing doesn’t matter. You should just assume that changes are happening literally all the time. That’s the only way to plan for risk.
So it’s entirely plausible that your service went down even though you didn’t make a change yourself?
Oh yeah! And changes are not just code. They can be other components, too. Like if a piece of hardware fails. Or there’s a storm on the East Coast. Or you need to add more capacity to handle increased load.
On the user’s end, most of those failures look the same though. Twitter’s Fail Whale is perhaps the most popular example…
These are the easy ones! The ones that tell you: hey, I failed — those are the lucky ones. Most failures are not that graceful. Most failures don’t give you a Fail Whale. Suddenly, things are just not working the way you expect them to. And sometimes you will never figure it out or even notice it.
When the system is down, how do you recognize which component failed?
This speaks to exactly what I’m doing with my life right now. Because we’re in the middle of this great shift, from the monolith to microservices or from one to many. The old world was one where you had The App. You deployed The App and you had The Database. You could kind of fit the whole system in your head and visualize it. You could see where requests were going and you could reason about it. And we monitored those systems and found thresholds. As long as some metric was between this and that, we’d call it good.
That whole model is starting to completely disintegrate. Now, instead of The App and The Database you have tens, dozens, hundreds. You’re depending on all those loosely coupled, far-flung services that aren’t even yours, yet you’re still responsible for your availability. So increasingly instead of monitoring the problems, you really just need to focus on building visibility, so that while you’re shipping code you can look through the lens of your instrumentation and see: am I shipping what I think I am? Did I build what I wanted to? Is it behaving the way I expected?
If engineers just developed the muscle memory of pushing to master, looking at it as it was being deployed, and asking themselves those questions, 80–90% of problems would be caught before users notice. But it’s scary to people because it’s very open-ended and exploratory. There are no answers and no dashboards that say: there’s the problem. The problem is usually like half a dozen impossible things combined.
Does moving to this model of tens or hundreds entangled services also make it tens to hundreds times more difficult for engineers to diagnose problems?
More than that. It’s exponentially harder. The hardest problem is not fixing the bug, it’s finding out where in the system is the piece of code that you need to debug… well, I should backtrack. It’s very hard when you’re using the tools we’ve had for the past two decades. It is not harder when you get used to it. But there’s a learning curve. It’s a shift from the mindset of control to the mindset of: everything’s failing all the time so we’re gonna embrace that and lean into it instead of being afraid. We’re gonna get our hands wet every single day, looking at prod, interacting with prod, and not gonna be scared.
I come from Ops and Ops are notorious for telling developers to get out of our way. Stay out of production, we don’t trust you, it’s scary here. And that’s a huge mistake that we’re just beginning to make up for. Instead of building this terrifying glass castle, we should have built a playground with bumpers and safety guards. Your kid should be able to run around in prod, get a bloody nose, eat a lot of dirt, but not kill themselves. It shouldn’t be scary. Engineers should grow up learning how to conduct themselves in production.
Is it fair to say then that software has become so complex that you cannot build it in such a way to prevent failure?
You should build it with the assumption that it’s failing all the time and that’s mostly fine. Instead of getting too hung up on failures, we need to define SLOs — Service Level Objectives. It’s like a contract everyone in the organization makes with our users. We’re saying that this is the level of service that’s acceptable and you’re paying us to provide it. So like, 0.5% failure rate or whatever. Anything better than that we don’t have to obsess about. We can go and build product features until that threshold starts to be threatened in which case it’s all hands on deck. This is really liberating. It’s a number we’ve all agreed on, so it has potential to ease a lot of frustrations that many teams have had for years and years.
Let’s go back to outages. Now that we identified one, how do we go about fixing it?
Well, step one is, figure what the outage actually is. When it started, what is the scope, who it impacts, any dependencies… this is harder than you might think. Fixing the problem is usually trivial compared to figuring out precisely what is happening. Because of this, we have a tendency to jump straight into fix mode and just start blindly doing things that we have seen fix prior outages. This is terrible! You can easily work yourself into a much worse state than you were to begin with.
So the first step to fixing it is truly understanding it, and making sure that you understand it, and communicating it to other stakeholders in case there’s something they know that you don’t know. How you fix it depends entirely on what it is.
Is it common in tech companies that when a service is down, people who respond to the outage are not the people who built the element that failed?
It is common but I think it’s changing. It’s hard to build on-call rotations and feedback loops that are tight and virtuous. One thing I like to do with my teams is anytime we get an alert for a deploy that’s just gone out, we page the person who merged the diff. It’s super simple and 95% of the time it is that person’s responsibility. And they want to know when it’s still fresh in their head. They don’t want to find out five hours later when it’s gone through all the escalation points. Everyone wins! But it’s hard to do and it’s not something anyone gets right on the first try. Microservices help with this because in theory you’ll only get alerts for the stuff that you own.
Now, there’s always going to be a tier of system specialists who literally specialize in the system as a whole — often called Site Reliability Engineers (SRE). But they don’t want to be the ones who get all the pages either. They want to be escalated to when it’s clear that the problem is bigger than any one component.
This gets to the bottom of something super important: the idea of ownership. We don’t just write code and fling it over a wall. We own it. Some people think of it a scary and something that won’t let them sleep at night. It’s not. Ownership means you care. You deeply care about the quality of your work — it’s your craft, right? You want to build something well and you want your users to be happy. We all have that desire but it has been squeezed out of many of us by shitty on-call rotation and frustrating times when you’re responsible for something but don’t have the tools or the authority to make the change that needs to be made. That’s just a recipe for frustration.
Have you ever been in a situation when you were on call or responsible for a problem and you looked into it and had no idea what’s broken?
What’s your next step when this happens?
You start digging. Using my own tooling, Honeycomb, the right thing to do is generally start at the edge and start bisecting. Start following the trail of breadcrumbs until you find something. This is hard to explain to people who are used to the dashboard style of debugging, where you kind of have to form a hypothesis in your mind and then go flipping through dashboards to verify it. That’s kind of how debugging has worked… but that’s not debugging. That’s not science. That’s magic, gut intuition, and pattern matching.
It is very different when you have instrumentation for your entire stack and you just start at the edge, start slicing and dicing… for example, instead of going ‘I see a spike in errors. It smells like the time Memcached was out. I’m gonna look at some Memecached dashboards’, you would be like: ‘There’s a spike. Let’s slice by errors. And now slice by endpoints. Which endpoints are erroring? Looks like it’s the ones that write to databases. Are all of the write endpoints erroring? No, only some of them. What do they have in common?’
On every step of the way, you examine the result and take another small step. And there are no leaps of faith there — you just follow the data.
An open-source technology for caching data. It allows web applications to take memory from parts of the system with plenty available and make it accessible to parts that are short on memory. This way engineers can use the system’s resources more efficiently.
From what you’re saying it sounds like this is a novel approach. Would you say that most companies still base their incident response on gut reactions and pattern matching?
Yes, absolutely. These are the dark old days and I’m trying to get people to see it can be so much better.
Is there such a thing as a completely unpredictable outage?
Yeah, absolutely. Let me give you an example. When I was working at Parse, one day the support team told us push notifications were down. I was like: ‘push notifications are definitely not down’. They were in the queue and I was receiving push notifications, so they couldn’t be down. Two or three days passed and the support team is back telling us people are really upset because pushes are down. I went to look into it. Android devices used to have to hold a socket open to the server to subscribe to pushes. We made a change that caused the response to exceed the UDP packet size. Which is fine. Usually, the DNS would just post over TCP. And it did: for everyone except one router in eastern Romania.
You can’t predict this stuff. You shouldn’t even try. You should just have the instruments and be good at debugging your system. That’s all you can do.
How much of a factor in bugs and outages is human error? How much responsibility can you assign to one person or one team?
I really don’t like the phrase human error. It’s never a single thing. People who do this for a living, resilience engineers, always stress that there are many contributing factors. Even if a human was the last link in a chain, that is still a long chain that led them to think something was the right thing to do. No one is maliciously doing it. Everyone’s doing their best. And when you try to pinpoint humans as the source of the problem, people just freeze. They start pointing fingers and stop being willing to share what they know. Then you’re not gonna make any progress whatsoever. People have to feel emotionally safe, they have to feel supported, they have to know they’re not gonna get fired because we’re all in this together.
I like to think of computers as socio-technical problems. It’s not just social, it’s not just technical. You can rarely solve a problem just by looking at the tools and you can rarely solve a problem just by looking at humans. They need to work in concert with each other.
Why do some outages take much longer to fix than others? What has to happen for an incident to be so catastrophic that a service stays down for days?
Usually it comes down to data. Data has gravity and mass — that’s how I like to think of it. Everything gets scarier, and longer, and more permanent the closer you get to disk. You never want to be in a situation where you only have one copy of the data because you could go out of business in a blink of an eye.
Here’s a thing that happened to me at Parse. We had a bunch of databases with multiple copies of the data. Sometimes, all of the replicants would die except the primary. I could not turn back on access until I copied the entire replica set. That could take a very long time. Other times it wasn’t even a question of best practices and being safe. It could be a case where the database won’t start back up until it performs a consistency check or until we copy over the only remaining copy from the tape archive. There are all sorts of things that can happen when you’re dealing with data so it can take a lot of time.
Storage devices that store data on magnetic tapes. They’ve been around for decades but have been pushed out of personal computing by technologies that allow much faster access, like HDD or SSD. Many companies, however, still back up their data to tape drives. Tapes are secure and extremely durable: they can go decades without maintenance and remain functional.
So when you’re experiencing a particularly nasty outage as a user, it’s due to how fast data can be backed up and not because engineers on the other end are not typing code fast enough.
I guarantee you they’re working as fast as they can. There’s an amount of time it takes you to figure out what the problem is. And then there’s the amount of time it takes you to recover. There’s also a lot of stuff like, maybe it’s not down for everyone but this particular shard is gone for days because a backup broke down? Or maybe they’re writing tools to help recover for as many people as possible?
It’s kind of ironic that in our quest to keep everything resilient and redundant and up 100% of the time, we’ve sliced and diced and spread everything around that now there is a hundred points of failure instead of just one.
Some companies suffer from outages more often than others even though at least on paper the Silicon Valley attracts the best engineering talent in the world. Why?
First of all, I’m gonna push back on the idea that Silicon Valley companies have the best engineers in the world. They don’t. Maybe they do for a very narrow definition of better. But, you know, I’ve been in those hiring meetings and people will straight up admit there is no correlation between the questions that they ask, how well the interviewees do on them, and how good of an engineer they are. Some of the best engineers I know are not in the Silicon Valley. Some of the worst I’ve met are here. It’s definitely a magnet but I hate that idea that the best engineers are here. It’s not true and it’s harmful.
When we were hiring for Honeycomb, I could have just gone out and hired all of the most senior, awesome engineers I’ve worked with at Parse and Facebook. I didn’t do that because I knew we were building a tool for everyone and I wanted to have diverse backgrounds. And I’m gonna admit something a little bit embarrassing. For a while, I thought it’s too bad my team wouldn’t have the experience of working with those excellent engineers I worked with. But here’s the thing — this team kicks the ass out of any other team I have worked with. They ship more consistently, they ship better quality code, and I’ve had to reckon with my own snobbery and bias. I no longer think that best engineers make for the best teams. They don’t. The best teams are made of people who feel safe with each other, who communicate, who care passionately about what they’re doing, and can learn from their mistakes. The whole best engineers thing is total bullshit.
Now, back to your question. Why can’t Silicon Valley get it right? Well, we’re solving new problems in Silicon Valley. Like problems of scale. Google’s solutions don’t work for anyone but Google. It’s a hard set of problems and it’s hardest the first time. After it’s been solved once, or twice, or three times, we can learn from each other and it gets a lot easier.
Computers Are Hard
- Bugs and incidents with Charity Majors
- Networking with Rita Kozlov
- Hardware with Greg Kroah-Hartman
- Security and cryptography with Anastasiia Voitova
- App performance with Jeff Fritz
- Accessibility with Sina Bahram
- Representing alphabets with Bianca Berning
- Building software with David Heinemeier Hansson