How we managed Postgres HA with Logical Replication using Patroni
Streaming replication, a standard feature of PostgreSQL, allows the updated information on the primary server to be transferred to the standby server in real-time, so that the databases of the primary & standby servers can be kept in sync.
Let’s relook at a couple of concepts that will help us understand the problem.
What is Logical Decoding?
Logical decoding is the process of extracting all persistent changes to a database’s tables into a coherent, easy-to-understand format that can be interpreted without detailed knowledge of the database’s internal state. In other words, Logical decoding uses an output plugin to convert Postgres’s write-ahead log (WAL) into a readable format.
Logical decoding in a way, allows you to stream data changes to external consumers in the form of event streaming or CDC events.
What is Replication slot?
Per definition, replication slot represents a stream of changes that can be replayed to a client in the order they were made on the primary server. It ensures that the primary server will retain the WAL logs that are needed by the standby even when they are disconnected from the primary server.
In other words, if the standby goes down, the primary server keeps track of how much the standby lags and preserves the WAL files it requires until the standby reconnects. The WAL files are then decoded and played back on the standby server.
Now let’s look at the problem statement:
Postgres maintains the replication slot on the Primary server. If there is a switchover/failover to promote one of the standby server, the new primary server won’t have any idea about the replication slot maintained by the previous primary node. Which breaks the logical replication from the downstream systems or if a new slot is created, it becomes unsafe to use.
What is the expected result?
- Automatically percolate the replication slot information from the primary server to all active standby servers.
- When the LSN numbers advance on the primary server, percolate these changes to all the active standby servers.
Let’s start the setup
For setting up Postgres HA using patroni take look at the following post
Configuration changes in Patroni
For the detailed petroni configuration refer to this git repo
Step 1- Change the wal_level property to “logical”
Step 2- Add the logical slot property
As soon as the step 2 is completed, patroni in the background copies the replication slot information from the primary server to all the active standby servers.
Step 3- Verification of logical slot and LSN number on all servers
select * from pg_replication_slots;
On Primary Server
On Standby servers
To check the outstanding sync status on all standby servers
SELECT redo_lsn, slot_name,restart_lsn, round((redo_lsn-restart_lsn) / 1024 / 1024 / 1024, 2) AS GB_behind FROM pg_control_checkpoint() c, pg_replication_slots s;
Summary
We see that replication slot information and LSN number in primary and standby servers are in sync. This means that any failover should not affect the downstream applications that have subscribed to the change data capture(CDC) events.