I Deleted a Database in Production
And I had no backup
There comes a moment in every developer’s life when they must face this inevitable situation. Really, it’s a rite of passage for our community.
I’m writing this short post to chronicle the events that lead to this dreadful accident along with a few important lessons to learn from this fiasco.
Before we begin, a quick overview of my infrastructure. The infrastructure at WorkIndia is hosted on AWS and some of the terms reference to AWS Services, but this could happen to anyone hosting databases on virtual machines. This particular microservice was hosted on a single virtual machine running the API, a PostgreSQL, and a MongoDB server using Docker containers.
Here’s what happened:
I try to SSH into the machine but due to unnaturally high CPU load, my connection is unsuccessful. At this point the usual solution is to reboot or stop/start the instance from the AWS Console panel. BIG MISTAKE!
Rebooting doesn’t work.
I stop the instance and give it a few minutes to boot a new machine and start the docker containers.
I see that another instance with the same name has been spawned.
Sudden realisation that the service has been configured behind an Auto Scaling Group (ASG). When I stopped the original instance, the ASG couldn’t find a healthy machine and it created a new instance from a very old launch template.
Its okay, at least I have the original instance and all its data. I’ll just change the database host settings in the new machine.
Original instance is automatically terminated by ASG.
Once I stopped the instance, the ASG determined the instance was unhealthy and proceeded to permanently terminate it. This is the default behaviour of an ASG but an instance can be saved from such a fate using a termination policy.
I realise I have no backup.
I call in reinforcements, but its too late. The damage is done. All data on both the PostgreSQL and MongoDB was lost.
Recover some data?
My trusted collegues immediately got to work and devised a plan to regenerate most of the lost data using logic from another service. It took them not more than a day, but thankfully the data was recovered, more or less.
Infrastrucutre is an error prone business. As an infrastructure engineer, I feel more of my energy is spent on creating fail-safes than actually trying to create a perfectly robust application. Things will fail, its our job to handle the failures gracefully.
Here are a few lessons I’ll take away from the above incident:
Never keep stateless and stateful applications in the same virtual machine.
At our scale, it is unwise to mix stateless applications like API servers with stateful applications like PostgreSQL and MongoDB which have storage requirements. Stateless and stateful applications play by different rules and therefore need to have vastly different failure strategies.