What I learned from deleting a production database

Rafael Gaino
Motorway Engineering
4 min readAug 17, 2021

--

Photo by Elisa Ventur on Unsplash

The funny thing about the day I dropped our main production database is that I can't really remember what I was working on. Instead, what I remember is how I felt the second my finger hit the Return key on that script, and how that day turned into a lesson of disaster recovery and a brilliant display of a no blame culture at Motorway.

Part I: The Drop

In the early days of Motorway we were just 3 engineers wearing all the hats we could find. It wasn't uncommon for us to have to login to a database with write credentials to manually change some data. And that’s the first lesson here people: anything that can go wrong, will go wrong. Murphy’s Law spares no one.

As I was working on that thing I don't remember what, I had a script that heavily modified some tables. To make development quicker, I wrote a script that would drop my local DB and re-create it from a dump file. And that was when I was called to investigate some urgent production problem. So I logged in to the production DB with full permissions, and did what I had to do. I closed my laptop for about an hour, had a break and came back later to continue the work I was doing before.

That was when I ran the script using the production connection.

I instantly knew what I had done. My hands begun to shake, and my stomach begun to churn. I took a deep breath and decided two things right then: I was going to own that mistake, and I was going to see the fix through.

Part 2: Owning My Mistake

I immediately turned to Slack to notify everyone, but it was late already and most people were already offline. So I opened our emergency WhatsApp channel and wrote a short and clear message: I had dropped one of the production DBs by accident and was looking for ways to restore it. If anyone could join and work with me, please do as soon as possible.

Before people could properly react I found a backup on an external service, but it was a few hours old. I knew most of the data missing from this period could be pieced together, by other DBs, by our logs and by external storages, so I decided to act to restore the service as quick as possible and begin the process of repopulating the gap. I restored the DB and the service was operational again. The total outage was only 18 minutes long. My hands stopped shaking, but we were not out of the woods yet. That's when my colleagues came online and started to help me piece the puzzle together.

Part 3: Restoring and Learning

As we worked together to fill the data gap, I was amazed by how I felt that I was the only one who was incredibly mad at me. Everyone was focused on the solution. We wrote scripts to collect data from other sources and we were actually able to fill the data gap almost completely. I worked overnight to complete these scripts, and before the rest of the company woke up, things looked normal.

We communicated the incident to the rest of the company and explained what kind of anomalies they could expect for the affected records. I wrote a detailed email to my manager and the CEO where I explained how something like this could even happen, how it was restored, and how to prevent this from ever happening again (which it didn't).

The responses from the founders were very gratifying. Here I'll paste some of the things they wrote: "this isn’t your fault and could just have easily happened to anyone else on the team (…) This is a wake up call to all of us over database security and access (…) This incident is bad, but it could also have been an awful lot worse. Let’s take it as an opportunity to dramatically improve our processes around data durability.", and also "It’s how we handle and learn from things like this that matters, so well done — and thank you 😊"

You read that right. They wrote "thank you" with a smile emoji on an email where I said I had dropped a production DB. If this is not a great place to work, I don't know what is.

Part 4: The Aftermath

Once the dust settled, we went through a review of security and access, wrote a guide around Disaster Recovery, ensured credentials had limited power, and communication of problems were done properly. This happened over 2 years ago, and we haven't had any disaster ever since. But if (or when) it comes, we’ll be ready for it.

We plan as much as we can, but things will sometimes go wrong. Mistakes are rarely made, and when they do we’ll learn, come together as a team, fix it, and never let them happen again.

This whole ordeal was a great lesson for me. At the time I was only a few months with the company, and I fully expected to be fired or at the very least reprimanded. But instead I saw my colleagues coming together to help me fix my mistake as if it had happened to them. I saw our leadership commend me in communicating and owning my mistake, and I saw a great display of a no blame culture. And when (or if) this happens to anyone else here, I'll be sure to react the same way.

--

--

Rafael Gaino
Motorway Engineering

Tech Lead at motorway.co.uk. Failed rock star. Perpetually working on The Next Big Thing™.