How PostgreSQL replication works

A quick (almost deep) dive into replication and HA configurations

Adriano
Adriano
Jan 25 · 8 min read

Table of Contents

  1. Intro

1 - Intro

Have you ever heard about Eric Brewer’s CAP theorem ? It says that it’s impossible for a distributed system to provide both Consistency, Availability and Partition tolerance. There is one more theorem: PACELC which is a bit more complete than CAP, but I’d rather not spend too much time on theory.

Image for post
Image for post
Diagram by Hamzeh Khazaei from ResearchGate

PostgreSQL isn’t immune of that and for what concerns Availability and Consistency, there are different solutions each of which with pros and cons.
Let’s see what we mean with these three properties:

  • We have consistency when all nodes in the cluster view the same data at the same time.

The following is true for PostgreSQL 11, currently the latest available version on Amazon RDS.

2 - Glossary

First of all let’s clarify some terms:

  • Master or Primary → A server that can modify data.

3 - Failover

As we previously said, failover is the process of switching to standby after an abnormal termination of the master server. Many failover systems use to connect Master and Standby with some kind of heartbeat which tests connection and master liveness. However, PostgreSQL does not provide an official software component to identify a failure on the primary and notify the standby. This monitoring tools are often third party plugins, distributed as open source projects or provided by SaaS vendors.

Typically when a master database crashes the most up to date replica server is restarted as new master.

What if a master restarts after a new primary is promoted? There is a mechanism, known as STONITH (Shoot The Other Node In The Head), which avoids the situation where both systems think they are the primary.

4 - Configurations

We can identify 4 main kinds of configurations:

  1. File or disk based

4.1 - File or disk based

These configurations are somehow physical replications since it happens at low or almost hardware level. In this category we have two main configurations: Shared Disk and File System Replication (or Disk Mirroring).

  • Shared Disk uses a single disk array that is shared by multiple servers.
Image for post
Image for post

If the main database server fails, the standby server is able to mount and start the database as though it were recovering from a database crash. This allows rapid failover with no data loss.

Image for post
Image for post
  • File System Replication is a common technique shared with many RDBMS. Disk Mirroring uses Distributed Replicated Block Device (DRBD), which is a distributed replicated storage system for Linux.
Image for post
Image for post
Image for post
Image for post

With these configurations, back to CAP theorem, we could say that we have Consistency and Availability when everything runs ok, but when things goes wrong (presence of partitioning) we keep only Consistency. Why we don’t keep Availability ? Because there is, even if small, a downtime between the master failure and the new primary promotion.

4.2 - Log shipping based or WAL based

4.2.1 - WAL ? What ?
Before we start talking of log shipping configurations, we have to say that every transaction in PostgreSQL is written to a transaction log called Write-Ahead Log (WAL).
Let’s imagine we have a service which starts a transaction and for every new statement it flushes it to PostgreSQL, here what happens on the database

Image for post
Image for post

So WAL is made up of segments files of 16MB by default and each segment has one or more records. Log Sequence Number (LSN) is a pointer to a specific record in WAL.

postgres@f0bc03174671:~/11/main/pg_wal$ ls -l --block-size=M
total 33M
postgres postgres 16M [...] 00000010000000000000001
postgres postgres 16M [...] 000000010000000000000002
postgres postgres 1M [...] archive_status
postgres@f0bc03174671:~/11/main/pg_wal$

4.2.2 - WAL, why ?

A standby server uses WAL segments (XLOGS in PostgreSQL terminology) to continuously replicate changes from its master.
Write-ahead logging is used to grant atomicity and durability in a DBMS by serializing chunks of byte-array data (each of one with an LSN) to a stable storage before they are applied to the database.

Applying a mutation to a database can result in many file system operations. How a database can then guarantee atomicity if for example the server stop working due to a power outage while it was in the middle of a file system updating? When a database boots, it starts a replay or startup process which reads the available WAL segments and compares them with the LSN stored in each data page (each data page is marked with the LSN of the latest WAL record affecting the page).
As the PostgreSQL doc says:

During WAL replay, we can check the LSN of a page to detect whether the change recorded by the current log entry is already applied (it has been, if the page LSN is >= the log entry’s WAL location).

Image for post
Image for post

WAL Sender is a process in master server which is responsible of sending segments to a receiver.
WAL Receiver is a process, in standby server, responsible of updating WAL segments as soon as it receives updates from a WAL Sender.
Startup process, in standby servers, loads transactions from WAL to database process, which then updates the database file system.

4.2.3 - What happens when a standby reboots after maintenance or crash ?
In the following schema we can see a schematic description

Image for post
Image for post

Once we have an idea on what’s a WAL, we can finally start talking about the Write-Ahead Log Shipping configurations.

4.2.4 - Log based configurations

  • In Synchronous Multi-master Replication each server can accept write requests and modified data is transmitted from the original server to every other server before each transaction commits. It uses the 2PC protocol and follows the all-or-none rule.
Image for post
Image for post

One obvious thing is that with synchronous replication there is no replication lag, while with asynchronous replication a replication lag is possible and there could be data loss in the case of master failure. The last thing to say about this approach is that Write-Ahead Log shipping can only be done for the entire database server and it’s not allowed to choose a subset of tables or schema.

There is one more configuration under this category and it’s called Logical Replication. It’s based on WAL too, but it uses a quite different model since there is no Sender/Receiver process involved. It uses a pub/sub model which extracts, encodes and streams data from WAL. These data are then sent over publications.
Logical Replication can also be combined with WAL shipping methods (i.e. with synchronous replication to avoid data loss).

Image for post
Image for post

Analyzing what we said about the previous configurations under the CAP point of view, we can say that:

  • Synchronous approaches favour consistency over availability when partition occurs.

4.3 - SQL based

SQL based configuration, basically works with a program which intercepts all SQL mutation queries and sends them to a standby server which operates independently.

If queries are broadcast unmodified, functions like random(), CURRENT_TIMESTAMP, and sequences can have different values on different servers. This can be avoided by configuring the interceptor program to query values from the source database and replace that in the SQL statement before broadcasting it.

4.4 - Commercial solutions

On the last stand we have the commercial solutions, which are solutions developed starting from the PostgreSQL open source project.

In the project I’m currently working on at Moveax, we chose Amazon Aurora DB clusters as relational database to back our micro-services environment. Aurora has a single master replication configuration, made up of:

  • Primary server which supports read and write operations.

Furthermore, Aurora uses a distributed cluster volume located in different availability zones. Physically the underlying storage is made up of SSDs and the replication lag is often below the 100 milliseconds.

Image for post
Image for post
Image from AWS doc

In conclusion on our brief tour of Postgres replication, we have to say that there are other replication implementations that may be described in another post.

If you think there is something wrong in this article or you want to submit improvements, let's get in touch!

… and thanks for reading!

moveax

moveax tech blog

Sign up for Moveax's newsletter

By moveax

Resources for everything related to digital world: Coding, User Experience, Security and much more Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Thanks to Michelle, Simone Bronzini, and Furio Dipoppa

Adriano

Written by

Adriano

Software Engineer

moveax

moveax

Resources for everything related to coding, user experience and security

Adriano

Written by

Adriano

Software Engineer

moveax

moveax

Resources for everything related to coding, user experience and security

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store