Rolling back after an Amazon RDS blue/green deployment
Major production database upgrades for high-traffic systems are hard. They always have been, and they always will be. They often require a significant amount of planning, preparation, and potentially some scheduled downtime when performing the actual switchover. Thankfully AWS announced blue/green deployments for RDS updates in November 2022, and it’s fair to say it’s been a game changer. It takes care of a lot of the heavy lifting for you and, if you plan it carefully, it can do this with zero disruption to your end-user experience.
Minimising risk
For a recent database upgrade my colleagues and I were looking into how we could proceed when an Amazon RDS blue/green deployment was successful but we still wanted to switch back to the blue primary and its replicas. After all, it’s better to be safe than sorry — if you don’t actually test your rollback or disaster recovery plans, you don’t have any.
After a successful blue/green deployment switchover, the blue primary will be in read-only mode and, according to the documentation, a simple reboot of the instance will allow writes again. After this, it should be a matter of updating the instance endpoints to point the traffic back at the blue primary and its replicas.
But we uncovered that it wasn’t that straightforward. With our test, we realised that, after we rebooted, the blue primary instance wasn’t writeable.
The way forward
This prompted us to have a look at the parameter group attached to the blue primary. The read_only
parameter was set to {TrueIfReplica}
which does exactly what you’d expect it to do. For primary instances, it means that read_only
evaluates to 0
, and for replicas it evaluates to 1
. Or so we thought. Despite the fact that the blue primary is seen as a primary instance, somehow, after the completed blue/green deployment switchover, it is still seen as a replica in (presumably) the RDS internals.
We found that it is currently necessary to explicitly set the read_only
parameter to 0
in the parameter group attached to the blue primary, to take it out of the read-only mode.
After updating the parameter, and a quick reboot of the blue primary, we confirmed the desired behaviour.
Best practices
It is better to be prepared than to be ready. Whilst we (and you) hopefully never have to roll back after a successful blue/green deployment, it really pays off to have a tested plan in place for the eventuality. Knowing exactly what to do in moments that can be incredibly stressful — especially when there’s an impact on the end-user experience and/or your bottom line – is imperative. But knowing that those rollback plans actually work, and have been tested, that’s invaluable.
Acknowledgements
When I was researching our specific edge case I stumbled across a great post by Matthew Gleeson, and whilst it didn’t provide the answer I was looking for, it was very helpful in narrowing down the culprit in our scenario.
Last but certainly not least, the people that worked on RDS blue/green deployments to facilitate major database engine upgrades on AWS have delivered an outstanding feature. It makes a relative breeze of something that used to be a daunting and often disruptive task, which is an impressive achievement.
Have you heard? We’re hiring at VoucherCodes! Check out our careers page here.