RDS, Upgrading Postgres Versions

The challenging, yet satisfying, experience of dealing with AWS

Recently our team started working on a new feature to support the Venues Preferences for TheKnot members. Since we store Membership data in Postgres, we were looking for some level of flexibility to manage all different ways the preferences data is configured.

The goal was to avoid further migrations given the case that more Preferences settings were added. We also prefer to avoid columns with most of its rows set to null. Based on the previous goals, the idea of using JSON to store Preferences data was becoming more clear.

Fortunately, Postgres introduced a JSON datatype in v9.4 (HSTORE is not JSON). Unfortunately, our production database instance turned out to be still v9.3, so, an upgrade was required.

I would like to share the steps we followed to upgrade a Postgres Database in RDS. The process does not guarantee zero downtime, meaning that, an outage is imminent, so, we needed to coordinate upfront with each affected team to assure minimum impact.

The metrics and conclusions mentioned here are based on a full exercise we ran initially on a cloned version of the production database, then on the real production environment.

1. Take a manual snapshot

When an upgrade is performed on a Multi-AZ configuration, backups are taken on each instance of the Multi-AZ setup. These backups are also incremental, meaning that the new backup will include only the Delta difference between the current data and the latest backup. This first manual backup will minimize that Delta and, by extension, the time required for it to finish.

2. Reboot with Failover

In a Multi-AZ setup, an automated snapshot of the primary instance won’t be created unless a failover occurs. By performing a reboot with failover, we are forcing the “other” primary instance in the Multi-AZ setup to take over.

3. Take manual snapshot again

Once the failover happens, the next backup will be based on the new primary. These three initial pre-upgrade steps will guarantee a recent backup on both instances of a Multi-AZ configuration. Besides, pre and post upgrade backups will finish faster and upgrade outage will be minimum.

4. Switch to Maintenance Mode

This is the best time to put all database dependencies on Maintenance Mode. Starting with the main API talking to the database, all the APIs/Apps/Systems communicating with this main API and so on.

5. Upgrade

Here is where the “magic” happens. The upgrade will start by taking automated snapshots of each instance in the Multi-AZ setup. Since we already created a manual backup on each of them (steps 1..3) the new automated backups will take a short time (~5 mins)

The whole upgrade process takes ~ 20..30 minutes, in a production deployment and the database will be unreachable and not accepting connections for ~ 8 minutes.

6. Reboot the master instance

When the upgrade is done, the database instance will be in a Reboot Pending State. It’s not required to wait for the DB service to reboot by itself, we can go ahead and force the reboot.

7. Vacuum the database tables

When we run a query in the database, the db_engine leverages the query optimizer to parse the query, check the table statistics and create an execution plan. This plan guides the engine to access, join and process the data efficiently by optimizing the use of indices.

Unfortunately, pg_upgrade does not migrate Optimizer Statistics or Database Statistics between versions. Leaving your tables with obsolete statistics will lead the optimizer to build the wrong execution plan causing every query to degrade and underperform.

The solution is to run, after the upgrade, the VACUUM ANALYZE command on all the tables. This command will update the statistics used by the planner to determine the most efficient way to execute a query.

8.Terminate Maintenance Mode

After the database is ready to be used again, all the systems can be put back online.

9. Recreate the read-replica

Eventually, at some point during the upgrade process, the read-replica will fall behind the replication. Additionally, the version of the replica will be incompatible with the one in the new master. As a consequence the replication will break. We will be able to see the correspondent error message in the replica instance logs.

For this reason, the read-replica should be deleted and re-created.

Done!

If everything went well, after ~ 2..3 hours of full suspense, you will be getting your life back.

Conclusions

Although we received decent support from AWS while we were planning the upgrade, the no guarantee of zero downtime is something that will make more than one team out there think twice before considering moving to a newer version. Running a VACUUM on each table takes nearly 50% of the time depending on the size of your data, and this step can’t be ignored. In our case, it took almost two hours and a half before we were able to turn Maintenance Mode off, but this time we can celebrate that everything went according to the plan.

About the Author

Vlad is a Lead Software Engineer in the Core Services squad at XOGroup. 
He is an agile practitioner and analytical thinker who enjoys his days architecting, implementing and testing distributed applications in AWS.
He’s also an amateur astronomer hoping for good weather and less light pollution on the weekends so he can track some planets and nebulas at night.

Show your support

Clapping shows how much you appreciated Vlad Carballo’s story.