Our Journey to PostgreSQL 12

Tommy Li
Coffee Meets Bagel Engineering
6 min readJan 25, 2021
Photo by Richard Jacobs on Unsplash

In November 2020 we began a large migration to upgrade our PostgreSQL cluster from version 9.6 to 12.4. In this post I’ll give an overview of our architecture at Coffee Meets Bagel, walk through the steps we took to perform this upgrade with less than 30 minutes of cumulative downtime, and share some lessons learned along the way.

Architecture

If you are unfamiliar with Coffee Meets Bagel, we are a curated dating app. Every day at their local noon, our daters receive a limited batch of high quality candidates. This leads to very predictable load patterns. Looking back one week as of writing this article, we hovered around 30,000 transactions per second on average, maxing out at ~65k TPS in our larger markets.

Prior to the upgrade we ran 6 Postgres servers on i3.8xlarge instances in AWS. This contained one primary node, three replicas for serving read-only web traffic (load balanced by HAProxy), one server dedicated to async workers, and one server for ETLs and BI.

We use Postgres’ built-in streaming replication to keep the fleet of replicas up to date.

Reasons For Upgrading

Over the last few years we’ve largely neglected our data tier and as a result they’ve grown a bit long in the tooth. Our primary server especially had accumulated quite a bit of cruft — it had been online for about 3.5 years straight with various system libraries and services being patched in-place.

My offering to r/uptimeporn

As a result, it had a number of oddities that made us nervous: new services would refuse to run in systemd (we ended up running a datadog agent in a screen session), and sometimes when CPU usage grew above 50% it would become completely unresponsive to SSH (while continuing to serve queries).

Additionally, our disk usage was starting to become precarious. As I mentioned earlier we run Postgres on i3.8xlarge instances in EC2, which come with about 7.6TB of NVMe storage. Unlike EBS this can’t be dynamically resized so what you get is what you get, and we had filled about 75% of our disk. In order to support future growth, we had to upgrade our instance size.

Our Requirements

  1. Minimal downtime — our goal was a hard limit of 4 hours of cumulative downtime, including unintended outages resulting from mistakes during the upgrade.
  2. Build out a new cluster of databases on new instances to replace our current fleet of aging servers.
  3. Go up to i3.16xlarge instances so we have headroom to grow.

There are three methods of performing Postgres upgrades that we were aware of: the classic backup and restore process, pg_upgrade, and logical replication.

We immediately gave up on the backup and restore method as it would take far too long on our 5.7TB dataset. pg_upgrade, while fast, is an in-place upgrade and did not satisfy conditions 2 and 3. So, we moved forward with logical replication.

The Process

There is a lot of existing literature about nitty gritty of installing and using pglogical, so rather than repeat the existing material I’ll just link a few articles that I found immensely helpful:

https://www.depesz.com/2016/11/08/major-version-upgrading-with-minimal-downtime/

https://info.crunchydata.com/blog/upgrading-postgresql-from-9.4-to-10.3-with-pglogical

http://thedumbtechguy.blogspot.com/2017/04/demystifying-pglogical-tutorial.html

We created one Postgres 12 server that would become our new primary, and used pglogical to synchronize all our data. Once it caught up and was replicating incoming changes, we began adding streaming replicas behind it. As we provisioned each new streaming replica, we would add it to HAProxy while removing one of the old 9.6 replicas. We did this until we had retired all of our Postgres 9.6 servers except the primary, which put us in the following state:

Once we were in this state, we scheduled a maintenance window to perform the failover. Again, this process is well documented across the internet so I’ll just layout the rough structure of what we did:

  1. Put the site into maintenance mode
  2. Switch the DNS record of the primary database to the new hosts IP
  3. Force a sync of all the primary key sequences
  4. Run a manual checkpoint on the old primary
  5. Perform some data validation and tests against the new primary
  6. Bring the site back up

All in all, this process went really well for us. Despite such a large shift in our infrastructure, we had no unintended downtime or errors of any kind.

Learnings Along the Way

While this project was largely successful, we did hit a few problems along the way. The scariest of which nearly killed our Postgres 9.6 Primary…

Learning #1: Slow synchronization can be dangerous

First, some background information about pglogical: The way pglogical works is that a sender process on the provider database (in this case, our old 9.6 primary) will decode the write-ahead log, extract the logical changes, and send them off to the subscriber database.

If the subscriber is lagging behind, then the provider will start to accumulate WAL segments so that when the subscriber catches up it won’t be missing any data.

When you add a table to the replication stream, pglogical first needs to synchronize the table data. This is done using Postgres’s COPY command. WAL segments will then start to accumulate on the provider database so that the changes committed after the COPY started can be streamed over to the subscriber after the initial synchronization, ensuring no data loss.

In practice this means if you are synchronizing a large table in a write/update heavy system, you need to keep a close eye on your disk usage. Our first attempt at synchronizing our largest (4TB) table had the initial COPY statement running for over a day, during which we accumulated about one terabyte of WALs on the provider node.

You may recall from earlier in the article that our database servers only had about two terabytes of free disk space left on our old servers. Judging by the disk usage on the subscriber database, only about a quarter of the table had been copied over at that time so we determined the synchronization would not be able to complete before we ran out of disk space, and promptly cancelled the synchronization process.

Available disk space on our old primary database during the initial synchronization

We then made the following changes to the subscriber database in order to speed up the synchronization:

  • Dropped all the indexes on the table being synchronized
  • Set fsync to off
  • Set max_wal_size to 50GB
  • Set checkpoint_timeout to 1h

Those four changes significantly sped up the rate at which the subscriber database was able to ingest data, and our second attempt at synchronizing the table completed in 8 hours.

Learning #2: Every update is logged as a conflict

When pglogical determines that a conflict has occurred, it emits a log message like “CONFLICT: remote UPDATE on relation PUBLIC.foo. Resolution: apply_remote.”.

However we observed that every single update the subscriber processed was logged as a conflict. After only a few hours of replication, the subscriber database had written a gigabyte of conflict logs.

We turned this off by adding pglogical.conflict_log_level = DEBUG to our postgresql.conf.

Edit: Link to discussion on Hacker News

--

--