Upgrading PostgreSQL At Scale

Preparation, process, principles: Everything that went into upgrading PostgreSQL.

Vishwesh Jainkuniya
Feb 2 · 6 min read

PostgreSQL is one of the most commonly used relational databases in the recent times, with over 30 years of active development which has earned a strong reputation for reliability, feature robustness, and performance.

Thus, as the development continues, we get more and more features, and to use them, we have to upgrade. Sometimes it’s a minor upgrade, which can be performed on the fly and sometimes it’s a major upgrade, which requires attention and should be carried out carefully.

PostgreSQL 10 was released with Declarative table partitioning, which simplifies many use-cases. So, in order to take advantage of it, PostgreSQL should run on ≥ 10. But our app was already running with an older version of PostgreSQL, having hundreds of millions of live tuples and serving hundreds of thousands of live traffic. This blog talks about our experience upgrading core PostgreSQL databases.

The process

Below is what the current setup looks like, where we have the app running either on VMs or Kubernetes, connecting to the database.

The preparation

Depending on your database size and network bandwidth, it will take some time to create snapshots and to catch up replicas with master. Here’s one way to setup replication using pg_basebackup.

⚠️ Make sure to tune PostgreSQL configs on new master/slave. At that instance since they don’t have any traffic, restart won’t cause any downtime.

Once the replicas are up, start the upgrade process.

👉 We will be scaling down all the application pods running on k8s (and VMs).
👉 Ensure no writes to current master (01).
👉 As a verification step compare Log Sequence Number (LSN) across 4 instances.
👉 Then, promote new master (03) to master and run pg_upgrade on it, and start upgraded PostgreSQL server on the instance (03).

The master is now ready with upgraded PostgreSQL, but slave is still on older version. pg_upgrade doesn’t work on slave.

So how do we upgrade slave?

pg_upgrade provides an option of linking ( — link) old PostgreSQL cluster data with the new cluster instead of copying data to the new cluster. Thus, pg_upgrade will be quick on master. In addition, we need to somehow perform similar linking on slave as well to run newer clusters on it. Hence, we take help from rsync to achieve this.

rsync is known for its extraordinary speed of copying data.

We shall be running rsync on data directly from master (03) to slave (04). Once we have the same linking on slave as well, we can begin the new upgraded cluster on slave and it will start replicating from the upgraded master (03).

As a verification step, check for the Log Sequence Number (LSN) on new master (03) and slave (04), and it’s a good practice to verify PostgreSQL logs as well before scaling up the application.

📝 Note: At any stage, if LSN doesn’t match, don’t work check again. Keep en eye on PostgreSQL logs and look for any error.

Voila! we have upgraded PostgreSQL 🎉

Here are the five simple principles, which are crucial while working at scale. These rules helped us in carrying out the whole process effortlessly and the upgrade was achieved with minimal user impact and overall downtime of 15 mins.

Have a playbook

Have a playground

How should I make sure that my playbook is sufficient?
The playbook needs to be self explanatory. It should contain all information on when to do what, listing all the commands step by step. Most importantly, one should be in a position to handover this playbook to anyone and they should be able to upgrade it on their own without asking any questions.

Define concrete rollback strategies

There could also be a situation where your production environment configs differ from playground. So, be prepared for all such scenarios and make note of them in your rollback section.

In our case, we were not even actually intervening with live master/slave. So in case of any issue, we could just abort the process and scale up the application.

Minimise user impact

Introduce one new component in between the app and the database, which will be responsible to serve while the database is being upgraded.

Depending on your use case and flow criticality, it can be a simple Redis instance which will sync back to the database once we have the upgraded database. Or it can be any queue/log like RabbitMQ/Kafka, which take requests and queue them up. Once the database is up, it can be experimented on. Both require additional component setup, like workers.

Prepare indigents, scripting

End notes:

Leaving you with some useful PostgreSQL commands:

Promote to master

# pg_ctl promote -D /data/dir/path -t 60 

Check if PostgreSQL instance is salve mode

# select pg_is_in_recovery(); 

Get current LSN on master

# SELECT pg_current_xlog_location(); < 10 // for PostgreSQL < 10
# SELECT pg_current_wal_lsn(); // for PostgreSQL >= 10

Get last received LSN on replica

# SELECT pg_last_xlog_receive_location(); // for PostgreSQL < 10
# SELECT pg_last_wal_receive_lsn() // for PostgreSQL >= 10

Click here for more stories about how we build our Gojek #SuperApp.
Click here if you’d like to build it with us. 💚

Gojek Product + Tech

Gojek's Product and Engineering Blog

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store