How to make changes the right way
Change needs to be managed
Software is a living entity and changes often as businesses expand and mature. A solution that worked well 6 months ago may well not cut the mustard today, especially in a startup environment where things change at a rapid rate.
Every so often we need to change out part of our infrastructure to meet new requirements from both our internal and external users. In the last year especially carwow has grown at a fast rate and we need our tech stack to keep pace with that rapid growth.
Its hard to gain users, easy to lose them. All the marketing and sales effort that goes into attracting and retaining users can easily be for nought if we deliver a bad experience.
When we are upgrading components and making changes our users do not want to see a maintenance page, an error page or, even worse, lose data.
It is important to plan changes out, trial run them and then script them to be reproducible. There is not much worse than being in the middle of an upgrade when something unexpected occurs… sometimes stuff happens that you cannot control but that does not mean we should not attempt to control the risk as much as we can.
Here are some principles that we use at carwow to manage making changes to our tech stack:
Don’t do it unless you have to
This might seem counter-intuitive to say in a blog post about change but it is important to consider. Any change to infrastructure is risky no matter how well you plan and prepare. If it is something that can be deferred until a time where the risk is lower, great — you have solved the problem and it’s time to move onto the next one.
Limit the scope
Don’t change everything at once. Modern tech stacks are composed of many different components and the more components that change at once the greater the risk of services going down, data being lost and users having a bad experience.
Script all the things
Every step / command / action must be written down. Don’t rely on your memory to help you out. Steps that are written down are much harder to skip or be forgotten about.
Go one step better and create a script that runs the changes — copying and pasting commands or clicking a UI is prone to human error, especially when there are more than a few steps involved. Automating the process removes the possibility of human error as the steps are always run in the same way, in the same order.
Writing it all down has another advantage as the process is kept for future use. You never know when you might need to repeat the process and having it all written down will save you time and hassle. No need to reinvent what you have already done.
We use bash scripts and rake tasks for smaller changes, with Ansible and/or Terraform being used to manage larger changes.
Split the change into stages
If your changes are applied in multiple step, break them up. Verify that each step has been successful before continuing on. You do not want to get to the end of the change process before realising that an earlier step had not actually done what it was meant to do.
This can take the form of performing counts, checking that a web request returns an expected response, or that a service has started (and has stayed running!)
Log every step, even if it is something simple. Log output from API calls and results from database queries. The more information you have, the better prepared you are to diagnose any problems that occur along the way.
Give yourself a way out
What happens when you are half way though a change process and realise it cannot be completed? This could be due to a step not working properly, the process taking too long for a maintenance window or something totally unexpected happening.
Being able to reverse the changes means you can get your environment back to where you started (read: undo the damage).
With some changes, this is not always easy to do. For instance, database migrations where large datasets are being manipulated. Here, it is important to think about what happens when something goes wrong and how it can be fixed.
Practice makes perfect
Your going to apply the change straight to your production environment. That’s a brave move… isn’t that kinda like playing Russian Roulette?
Test the change first. Have a duplicate environment (cloud services make this easy) set up like your production environment. Then test your change.
Did it work? No? Good thing that this was a test environment and no users were effected right?
Fix whatever went wrong and try again. Rinse and repeat until it all goes smoothly.
Run in parallel / Phased changeover
Flicking a switch from one service to another might work but it’s less risky to have a phased changeover (aka soft launch). If it is possible, run both old and new services side-by-side. Then you can switch between them to test that the new service / component works.
Be retrospective
After the change has happened (successfully we hope!) it is important to reflect on the process.
What worked? What did not work? How can we improve the process for next time?
There is no point making the same mistakes again. Make sure you share your experience with your team members and others.
Interested in making an Impact? Join the carwow-team!
Feeling social? Connect with us on Twitter and LinkedIn :-)