In 2014, when Dropbox had around 400 employees, I was attending an All Hands meeting on a Friday afternoon upgrading the operating system (OS) on some of our servers from my chair. At the time, we had an automated system in place to do this, but we needed to get the production servers running on the newer OS as soon as possible. To do this, I had to manually run a handful of commands to reimage and upgrade the machines.
Around the same time, I saw several key engineers get up and leave the room. Huh, that’s strange, I thought. I wonder what’s going on. More people began trickling out. And some more. Eventually the entire Infrastructure team got up and left at once, which was pretty alarming. Then a page went out that things were very, very broken. Dropbox was down.
Everyone was scrambling trying to figure things out. Since I was so junior, I was just waiting around on the off-chance I could be helpful. Someone mentioned a particular subset of database machines that seemed to be impacted, and my stomach got that heavy, awful feeling. It sounded suspiciously close to what I had been working on during All Hands. I knew I must have somehow done something wrong.
My mind blanked, and I didn’t really know what to do. You have that moment where you’re like, should I not say something? Or should I tell everyone? And what if this is not even the problem, but I’ve sent an entire team scrambling in a new direction? Ultimately, I knew that speaking up and escalating could be critical to reducing recovery time.
I asked one of the more senior site reliability engineers, a mentor of mine who was smart and really knew our infrastructure, to check it out. We started digging into my shell history to figure out if this was the cause of our outage — and, yes, it was.
Here’s a simplified version of the sort of command I ran:
- dsh — group hwclass=database lifecycle=reinstall perform_upgrade
For those not familiar, “dsh” is a tool that allows you to specify a set of hosts that you would like to run a command on and the command itself. My intention was to take all our database hosts that were marked to be reinstalled and run the command, in order to perform the upgrade on those machines.
In case you can’t spot the problem with the above command — it actually took several engineers a long time to figure out what was wrong — it’s missing a crucial pair of quotation marks.
It should’ve looked like this:
- dsh — group “hwclass=database lifecycle=reinstall” perform_upgrade
Instead of only upgrading the database hosts that were marked for reinstall, I unintentionally ran this command on every single one of them. This led to a large number of the live databases getting wiped out.
Now that we had a better understanding of what had happened, we were able to start working on recovering. Teams across the company — Comms, Office, Product, Legal, User Ops, Sales, and Engineering — all came together. It was understood that anything that wasn’t going to contribute to the recovery needed to be addressed later.
Because we had database backups and an amazing team, we were able to get things working for the majority of people fairly quickly, but the full recovery took about four days. Most of this work took place over the course of the weekend, which made coming in the following Monday difficult. I had the entire weekend to think about how things were still broken, how so many people’s weekends had been ruined, and how it all felt like my fault. When I got in, though, I was surprised that so many people — all of my teammates and also really senior people — offered so much support. They explained that this can happen to anyone who works in our field and that we shouldn’t have tools that allow something like this to happen. Ultimately, my relationship with Dropbox was so much stronger.
For every major outage we’ve had, we’ve learned lessons and applied those to make our infrastructure more reliable and resilient. Out of this experience, we made some important technical changes, like verification checks to ensure that no one can perform dangerous operations on hosts that were running important processes. We also decreased the amount of time that it takes to restore the databases from backup. We made some non-technical improvements, too, like formalizing the SEV process — a standard way for us to respond to incidents. It enables coordination among the people who are actively trying to fix things and communicating to the company and customers. It makes really clear who is responsible for doing what, too.
I learned a lot personally from this experience, as well. I’ve learned to find the value in something going wrong. I’ve learned to be defensive when I’m building things — to remember everything can break, so what can I do to make sure that doesn’t happen? I’ve also learned that when things fail, you have to remember that not everything is your fault! Just learn, grow, and come back better than ever.