Upgrading Database with Zero Downtime

Gabriel Koo
Inside Bowtie
Published in
8 min readSep 23, 2022
Normally you cannot use a system that is under an upgrade. — Photo by Clint Patterson on Unsplash

Achievement in Just 90 Minutes

In the evening of 30 May 2022, Bowtie’s insurance core system database has been upgraded from the database engine PostgreSQL 10.x family to the PostgreSQL 13.x family, that’s jumping for three entire major database versions, it’s like going from Win8 to Win11. It has been a courageous act, but we are proud and confident that we have achieved this big goal, without taking down our entire core insurance system.

All read traffic remains unaffected. If this is not clear enough, we meant that during this huge system upgrade, features like —

  • making a quick quotation for our insurance products
  • login into your Bowtie account
  • viewing the details of your insurance applications/claims/policies

were all unaffected.

Our upgrade could be said as “zero downtime” as we only disabled the functionalities for making any new changes. Unlike industry practices, we didn’t simply shut down our entire site with a cold system maintenance notice and forbid our customers from logging in, because we know how insecure when one cannot log into a web platform.

We were being very greedy on the way we perform the upgrade. We wanted to make the impact to our customers is minimal, while we still want to conduct our upgrade with no data loss or other unexpected incidents.

Some traditional financial institutions might be putting their entire system offline for a few hours per week for maintenances, but for us, we instead finished the whole process within only 90 minutes, and we are performing huge infrastructural upgrades like this at most once per year since we started our business in 2019.

P.S. For application changes, we conduct deployments every weekday (except Friday!) afternoon during working hours with zero downtime, thanks to the way we automate our Continuous Integration and Continuous Deployment (CI/CD) processes and the way we break a single breaking change into a few non-breaking deployments.

Disaster Recovery Isn’t just for BCP

For an online business ☁️ like us, data integrity and system uptime is of utmost importance, that’s why we have set up our disaster recovery (DR) site setup in a standby AWS Region. By the term “Disaster Recovery”, it means that we have an entire backup system (including our websites and backend APIs) that could be used to serve production traffic in case of any severe infrastructure level outages, such as submarine cable disruption like this.

Data are replicated automatically with features of Amazon Web Services (AWS) like cross region replication (CRR) for our database on RDS or S3, which replicates your data with a low latency as small as less than a few seconds. In case of a real need for a failover, data loss is kept to a minimum thanks to the AWS infrastructure.

However, it’s already 2022 and now you wouldn’t really expect submarine cables to have outages as frequent as 14 years ago, but there is still a point to maintain such a DR infrastructure.

While online businesses setting up a DR site are using it for real business continuity plans (BCP), we have tweaked our DR site to allow it to be running in read-only mode, again with data in the DR site being a few second latency from our primary system thanks to the CRR features from AWS.

A DR site in read only mode can serve for at least two purposes for us:

  • Help alleviate spikes in traffic
    — e.g. when it comes to times like the 5th COVID wave in Hong Kong. During the 5th COVID wave in Hong Kong, our marketing website’s web traffic spiked to an all time high — and the same applied to our system too.
  • Serve production traffic temporary when we conduct any breaking changes, e.g. making a huge change in our insurance system.

This is a simplified version of our architecture diagram, to illustrate how could we perform a failover to the DR site:

Traffic would be shifted to the DR site when needed.

Preparing all the Ingredients

You could have already guessed what we have done — promote our read-only disaster recovery site to serve the production traffic, perform the database upgrade and switch back to primary, isn’t that easy?

We can tell you the answer is — YES and NO.

Yes — The rationale behind is as simple as above mentioned, versus making the upgrade directly and have customers complaining our system is not working.

No — We had to prepare more to make sure our database upgrade could run smoothly, and guarantee that there would have been no data loss.

First, we put announcements on our websites in advance, just to communicate well with customers to have better expectation management. This is what traditional financial institutions would do too.

https://cdn-images-1.medium.com/max/1600/1*vMdzCunGERAtOxqJosupLQ.png
A screenshot of Bowtie’s customer portal during that very 90 minutes.

Second, our DR site serves the same codebase as our primary one. Actually during every deployment, we deploy the latest features with our codebase onto both the primary and the DR site. This means that our system, or more specifically our frontend — customer portal, expects our backend API to be always running well that accepts both read and write requests. We didn’t really hand craft every part of our customer portal to expect the backend is in the read only mode.

In case customers missed our pinned announcements, to make sure they do understand they could be hitting a page that they can’t use yet during the short upgrade, we have configured our DR site to return a special response when browsers are attempting to make any “write” calls. Our customer website would capture such a special response and render the following error page:

https://cdn-images-1.medium.com/max/1600/1*RFe38drgRRrbeMxclHoLiA.png
Note that we have half of our system online, even during an upgrade.

Again, note that any other read actions, such as looking at the details of any other pages in our customer portal, were all unaffected as if the primary system is still serving the production traffic.

The Recipe

Now we have the communications properly handled, and let’s talk about the upgrade itself. For us, the playbook itself is the most important part for the entire upgrade. Having this in place, we can make sure that every step is run in the correct order, and we can also run it in beforehand in other environments such as the UAT environment.

Here are the most critical steps we have documented:

Initial failover:

  • Put up announcements beforehand
  • Failover our backend to the DR site for read-only traffic.
    — The failover could have been done at various levels such as DNS, load balancer, or other ways that normally works well with blue-green deployment, but we chose the safest and easiest way.
    — We decided to failover with the internal DNS record of our primary/DR load balancers. These DNS records are only cached by Amazon CloudFront edge workers and changes could be reflected in less than 5 minutes. We won’t be worrying about clients caching the older version of our API’s DNS / hostname.
    — We did so with the help of Route 53 weighted routing. It’s handy and you can conduct the failover by just adjusting the weights of each DNS record. There is even no need to redeploy our web applications for using another new point.
Conducting failover with an internal DNS is a good choice.
  • Monitor the traffic in both the primary and the DR site.
    — traffic for the primary site should be flooring while the metrics DR site should be catching up.
    —we waited until the traffic for primary site is zeroed down for at least 5 minutes (just to be sure).
Failover traffic to a disaster recovery site and monitor the shift in traffic
Failing over all traffic to our read-only disaster recovery system
  • Validate our primary database has zero connections.
    — again, just to be sure.

After these steps, our system went into full read-only mode with no new data changes. At that time point, our system became ready for any upgrades. In the worst case of any failed upgrade, we could still recover the data with zero loss from our DR site.

The database upgrade itself:

  • Before making any changes, make a backup first.
  • Perform the actual upgrade.
    — this step took us 45 minutes, which might be longer than one might typical expect (<10 minutes) because AWS RDS performs database major upgrades altogether on your Multi-AZ instances & same-region replicas, that’s why it took 3 times than the normal time required.
    — when we did the upgrade tests, we replicated the production setup as much as possible so the above behavior would be shocking us.
  • Wait until the upgrade to finish and monitor the upgrade logs.
    — if any errors had occurred, we would have aborted the mission.
    — this worst case scenario didn’t happen because we have performed so many dry runs with various different setups before.

Wrapping Up:

  • Failover back to the primary site:
    — monitor the traffic until the DR site metrics are stabilized and floored, until the primary site is catching 100% of the traffic again.
  • Put off the announcements
    — and communicate with our engineering team
  • Monitor database metrics after the upgrade
    — sometimes database upgrades would clear out your entire database cache, and it takes time for your database to rebuild them
    — do expect the CPU usage could skyrocket but that should not last long. It wouldn’t be surprising that it is climbing up and maxed out for a while until it eventually floored down again.
Sample Database CPU Utilization after an upgrade
You might find the CPU utilization graph looked like the famous illustration in the The Little Prince.

These steps do not only apply for our database upgrades, they could also be applied when we perform bigger system changes, such as cases when we need to apply column-level modifications on our database tables.

Do not ask your engineers to work midnight until Monday mornings

That’s all! We do not ask our engineers to stay awake for the entire midnight and work until Monday mornings just for an upgrade.

Midnights are for sleeping instead of working.

If you followed us through, you would notice that we spent 90 minutes for the whole process, while upgrading the database itself only took half of the time. The doubled time is very worthy to us, because it could guarantee us that there was no data loss and no customer data are left in any unrecoverable database states. When we do software engineering in Bowtie, we move fast while still staying safe.

Lastly I would like to thank my ex-colleague @sunnychan10 for setting up the initial version of the DR site, and our lead engineer Krystian for the subsequent enhancements, such as migration from traditional VM instances to serverless containerized tasks.

If you are interested in making similar or even bigger achievements with us, do take a look at our openings!

https://career.bowtie.com.hk/departments/software-engineering

--

--

Gabriel Koo
Inside Bowtie

Lead Engineer at Bowtie | DevSecOps | Cloud | Automation | Anime Enthusiast