TLDR Troubleshooting Postgres Streaming Replication by Jonathan Katz

Be prepared — use WAL archiving, replication slots, and ensure you’re are constantly monitoring.

pavel trukhanov
some-tech-tldrs
2 min readAug 8, 2018

--

Highly available Postgres or balanced load across several nodes, or a special read-only db instance for analytical queries — all that can be achieved in Postgres world with Streaming Replication.

There are two (and a half) types of replication:

  • Asynchronous — replica servers allowed to lag from Primary server: a transaction is considered complete when it is committed on the Primary; replicas will replay transactions on their own time.
  • Synchronous— TX will be considered complete when it has been committed on all of the replicas.
  • Synchronous quorum— when you have N replicas, and you want M replicas (M≤N) to commit the tx to consider it complete.

Why a replica can become out-of-sync:

  • Due to being down or some network issue — replica has not communicated with the primary for a while.
  • Primary is writing changes faster than the replica can process them.
  • Configuration prevents WAL logs to either be shipped or processed.

PostgreSQL determines how many WAL files to keep depending on settings.

But Primary might delete a WAL segment file, while replica didn’t get it yet.

And you’re in trouble now.

Fixing An Out-of-Sync Replica: The Good & The Ugly

So you’re in the situation when Primary doesn’t have all the needed WAL.

The Good Scenario: Loading From Your Archive

If you prudently had WAL archiver configured — via archive_command setting — and it was working all the time and WAL archive contains all that Primary was missing, you’ll be able to restore replica.

The Ugly Scenario: Rebuilding Your Replica From a Base Backup

Basically you’re about to recreate a replica from scratch:

  • Shut down your replica.
  • Remove your PostgreSQL data directory on the replica, not on primary!
  • Perform the base backup — if db is large, this will take time.
  • Start your replica again.

Using Replication Slots to Prevent Replica Desynchronization

Since 9.4 there’s a special mechanism — replication slots — that will keep all WAL files that have not been delivered to a replica.

Beware:

While replication slots solve the “replication desynchronization” problem, they could create a new problem: filling up your primary disk.

If a replica goes down for an extended period of time, the amount of saved WAL files could grow to the point where you run out of disk space.

You should actively monitor how much of disk is consumed. You should also be monitoring replicas to ensure that they are up and connected to the primary.

TLDR: wal archiver and replication slots help a lot, but need monitoring!

This week sponsor okmeter.io — a complete Postgres monitoring. We’ll track WAL archiv, replication lags and replication slots and notify you in time.

--

--