A week or so before the Christmas holidays, I made a classic mistake at work.
We were reaching the end of a milestone for a project, and the team was working hard adding the final features and fixing bugs.
I was working on a particular feature that required all of our documents stored in Apache Solr and S3 to be updated. We’re building a REST API using the Serverless Framework, and the particular lambda function I was creating went something like this:
- Retrieve all relevant documents from Solr, and then update them with the new information
- THEN retrieve all the relevant objects from S3 and then update them
This also needed to work at scale for thousands documents. As a side note: in the future I think this sort of job which could take anywhere form a few seconds to a few minutes would need to be added to a Que Layer in our architecture, but because of time constraints we just did it all in one Lambda function.
While I was testing it, things went great. It was updating everything smoothly. Then I edited one line of code, so that Solr would only return the fields I needed (e.g ID and summary) instead of the whole object. However, I forgot that Solr will just replace the old record with the new record.
So, a few minutes later we noticed the front page of the production website was displaying articles from years ago. Turns out, that I had updated 5,000+ records on the production Solr core and also the back up core, with incomplete data.
Luckily the data we had stored in S3 was enough to piece back together our Solr core. Our first priority was to fix the home page asap, which needed to show correct data from the last week. After making sure our most recent snapshot for the Solr Server (on an ec2 instance) was too far out of date to reload, Adam Malowiecki and Sam Gaunt quickly put together a Python script that cross referenced the data in S3 with Solr and had the home page back to normal in less than an hour. After that we made sure everything was working as normal and fixed the rest of the broken documents within another hour.
Fortunately, this is a good time to reflect on this mistake. It shed light on a lot of shortcuts we were taking and best practices we were ignoring. The solutions to prevent this are fairly obvious, but we just never put a proper system in place to prevent this from happening.
First of all, the local development environment needs to be decoupled from production. This is obvious, but we had only just launched the production site and hadn’t switched over yet. Solr has a docker image that we used in the first few weeks of developing this website. One of the first things Adam said when this happened is “Guys, we really need to use that docker image locally”, and I couldn’t agree more.
We stopped using this because we set up a Solr instance on an ec2 server which became somewhat of a snowflake. We had the configured the schema, loaded in data from the old clients site and had multiple Solr cores running on it. Unfortunately, we never took the time to update our local docker image so we never used it.
Another very easy simple solution to preventing Solr data being lost, is adding in daily/weekly snapshots to our ec2 instance. This would allow us to quickly fix any issues we have with the server, and would have helped us quickly patch up the production site if a snapshot had been taken the previous day.
Another solution to this problem is to use a staging environment. From the outset of this project I knew it would be important to have a staging environment, which I had done some work on but hadn't had time to properly set everything up. Having a staging server would allow us to test a build before we deploy to the production site. A staging build that has a separate environment with a Solr server would be a great step towards preventing the production site from being affected again. In addition to this, we also added in separate S3 buckets to store staging documents.
Currently we just have a Postgres database to add to our staging environment before we have a completely decoupled staging environment.
This is a great start towards a more structured environment set up that prevents production being broken in the future. However, there are still improvements that we can make in the future. For a start, we could reload our staging/local dev environments every night with production data. This helps us in a few ways.
- Our staging environment will closely mirror our production site, and allow us to debug issues quickly.
- If production data was ever corrupted, we would have a pretty good back up we could quickly load into the production environment.
Another practice we could implement is starting to use Infrastructure As Code. In an ideal world, we could have our infrastructure set up in a separate repo, and easily deploy a production stack and staging stack with identical infrastructure. This would have prevented our Solr server becoming a “snowflake” in the first place.
Cheers for reading this post! It’s the first time I’ve made a serious mistake on a production website and wanted to reflect on what went wrong and think about future steps that would prevent it from happening again.
On Twitter @awoldes.