Failover with repmgr and automatic failover implementation

Sylvain
4 min readOct 16, 2018

--

This tutorial is part of a multipage tutorial. Read the introduction first if you haven’t done so.

The previous tutorials explained how to implement a backup with barman , how to recover your database and how to implement replication. Review those tutorials first if you haven’t done so.

Target architecture

We start with the architecture that was implemented in this tutorial. We add a server server-d, the server that is connected to either the primary or standby. For the sake of this example, we will assume an application that is connecting to either servers using an environment variable DATABASE_URL from server-d.

We start with a hypothetical app on server-d that connects to our database on server-a

DATABASE_URL=postgres://user@10.20.0.2:5432/appdb

Make sure the server-d to accept incoming connection from server-c and server-a.

Let’s start this tutorial with preparing our servers to support failover.

Standby promotion

Let’s shutdown server-a and see what happens.

# On server-a
sudo /etc/init.d/postgresql stop

Our app cannot connect to the database anymore, fortunately the server-c is ready and standing by. Let’s change DATABASE_URL to reflect this change.

DATABASE_URL=postgres://user@10.20.0.4:5432/appdb

Our app is now connected to the standby server. Unfortunately it cannot accept write operation just yet, we first need to promote it as primary node.

# On server-c
sudo -u postgres repmgr standby promote -f /etc/repmgr.conf --log-to-file

Now the server-c node should accept read and write operation and act as the primary node:

$ psql -U repmgr -c 'select node_id, upstream_node_id, active, node_name, type from repmgr.nodes' -d repmgrnode_id | upstream_node_id | active | node_name |  type   |          
--------+------------------+--------+-----------+---------+
1 | | f | server-a | primary |
2 | | t | server-c | primary |

It looks like we have two primaries but this is actually not the case since server-a is not active.

Now that we have a new primary node, it needs be backed up. Implement backup for server-c. Follow our previous tutorial.

Failed primary rejoin

At the moment, we are left with no hot standby and have only one active node in our architecture. After investigation on the cause of failure of server-a and if we are happy with server-c as primary, we can set our old primary as the new standby. For this we need to do what’s called a rejoin:

# On server-a
sudo -u postgres repmgr -f /etc/repmgr.conf node rejoin -d "postgres://repmgr@10.20.0.4:5432/repmgr" --force-rewind

As per documentation “synchronize a PostgreSQL data directory with another data directory that was forked from it” which is why we use the --force-rewind option.

We should now get the following:

$ psql -U repmgr -c 'select node_id, upstream_node_id, active, node_name, type from repmgr.nodes' -d repmgrnode_id | upstream_node_id | active | node_name |  type   |          
--------+------------------+--------+-----------+---------+
1 | 2 | t | server-a | standby |
2 | | t | server-c | primary |

You might also notice after this that the replication is not working. That’s because in the previous tutorial we set it up so that server-a accepts replication connection from server-c but not the other way around. So in server-c, add those lines to the pg_hba.conf file.

# Make sure repmgr, from server-c, can ask server-a repmgr for
# information and update its status
host repmgr repmgr 10.20.0.2/32 trust
# Make sure replication connection is accepted from server-c
host replication repmgr 10.20.0.2/32 trust
# Make sure repmgr can connect to the local repmgr database,
# which we will later reference using its network IP, so use
# this instead of the localhost IP.
host repmgr repmgr 10.20.0.4/32 trust

Switchover

Sometimes we just want to be able to switch servers over without having to go through the shutdown, promotion, rejoin process.

On the primary server, run:

# On server-c
/usr/bin/repmgr standby switchover -f /etc/repmgr.conf --log-to-file

We should be back in the situation we were at the beginning of this tutorial:

$ psql -U repmgr -c 'select node_id, upstream_node_id, active, node_name, type from repmgr.nodes' -d repmgrnode_id | upstream_node_id | active | node_name |  type   |          
--------+------------------+--------+-----------+---------+
1 | | t | server-a | primary |
2 | 1 | t | server-c | standby |

Don’t forget to change the DATABASE_URL to it’s original value.

Autofailover

We have seen how to switch over to an active node in case of failure of the primary. It would be nice to have all this done automatically when a failure happens, this is called autofailover and it can be implemented using repmgr.

Add the following lines to the /etc/repmgr.conf of server-a and server-c.

failover=automatic
promote_command='repmgr standby promote -f /etc/repmgr.conf --log-to-file'
follow_command='/usr/bin/repmgr standby follow --log-to-file --upstream-node-id=%n'
monitoring_history=yes
monitor_interval_secs=10
service_start_command='sudo pg_ctlcluster 10 main start'
service_stop_command='sudo pg_ctlcluster 10 main stop'
service_restart_command='sudo pg_ctlcluster 10 main restart'
service_reload_command='sudo pg_ctlcluster 10 main reload'

You can now try this out by simply stopping the database on server-a.

# On server-a
sudo /etc/init.d/postgresql stop

After a while you should get the promotion of server-c as primary and server-a marked as an inactive node.

This is the end of this 4 parts tutorial. I hope it helped you figure out how to setup a full replication and backup setup for your PostgreSQL database. Please comment if you have any questions or if there are important points missing.

--

--