RDS PostgreSQL Logical Replication COPY from AWS RDS Snapshot

Bhuvanesh
Bhuvanesh
May 31, 2020 · 8 min read

AWS RDS PostgreSQL has logical replication enabled from 9.4+ onwards. Generally in a logical replication once the subscriber is enabled, CDC will be captured in the logical replication slots and also it’ll start sync the historical data(initial sync) from the Publisher itself with the COPY process. If you have a mission-critical database on AWS RDS PostgreSQL, then the logical replication will start COPY from the Master node. This may reduce the performance of the master node. And it's not good for your business. We can’t do read from Read Replica, logical replication only works on the master node. So we did Proof Of Concept to solve this issue with Snapshots.

Inspired by Instacart:

If you are already familiar with RDS and Logical replication and you may recently read a blog post from instacart. Great article. They already achieved this. But if your RDS instance already having any replication slots then? Also here the requirement is RDS to an external replica. But do the COPY from the snapshot. That's why we are doing this POC.

Importance of Logical Replication in RDS:

PostgreSQL’s replication is not like MySQL. If you want to replicate from/to RDS, then the native streaming replication will not help you, also it's not possible. When this feature was not available on RDS, we used trigger-based replications with Bucardo, Slony to migrate. But this Logical Replication changed everything.

The logic behind this POC:

  1. When you created the Publication and added the table, then all the write operations will be tracked in your replication slot. Until a subscriber consumes it’ll be persisted.
  2. The pg_replication_slots will mark the LSN of WAL which still might be required for the subscriber.
  3. So take a snapshot and restore it as a new RDS instance.
  4. Go to the log file most recent one, or before the recent one, get the LSN number.
  5. Create a publication on the Snapshot RDS. Then create the subscription on your Target. Let's sync the data.
  6. Once the sync has been done, disable the subscription.
  7. Create a new subscription on Target(with disabling mode) that will be pointing the slot and publication of the one on the source which we created for this CDC.
  8. Manually advance the replication slot on Target to set the last LSN as the value that we copied from the log file.
  9. Finally, enable the Subscription to resume the CDC from the Master node.
  10. You can use the same method to replicate without performing the initial COPY.

Infra Setup:

  • RDS instance with PostgreSQL 10.6+
  • On the Target database, set rds.logical_replication=1 and shared_preload_libraries = 'pglogical' in the database parameter group.
  • Launch 2 new EC2 instances and install PostgreSQL 10. Because we are trying to simulate that the RDS is already having logical replication.
  • Make sure then wal_level=logical in the config file.
  • Port 5432 should be opened between the RDS and the EC2.

Prepare the source database:

Create a sample database and add 2 tables with some values.

create database bhuvi;
\c bhuvi
create table mytbl (id int PRIMARY KEY, name varchar(10));
create table mytbl_new (id int PRIMARY KEY, name varchar(10));
insert into mytbl values (1, 'postgresql');
insert into mytbl values (2, 'mysql');
insert into mytbl values (3, 'sqlserver');
insert into mytbl_new values (1, 'aws');
insert into mytbl_new values (2, 'rds');
insert into mytbl_new values (3, 'aurora');

Now let's create the publication.

CREATE PUBLICATION replication1;
ALTER PUBLICATION replication1 ADD TABLE mytbl;

Replicate to the first EC2 instance:

We need the same database and tables before setting up the Replication.

create database bhuvi;
\c bhuvi
create table mytbl (id int PRIMARY KEY, name varchar(10));
create table mytbl_new (id int PRIMARY KEY, name varchar(10));

Create the subscription.

CREATE SUBSCRIPTION subscription1 CONNECTION 'host=RDS-ENDPOINT port=5432 password=password user=postgres dbname=bhuvi' PUBLICATION replication1;

This will automatically start the replication and perform the COPY on the master node. Run the following common to verify the replication.

-- On source insert a new row
insert into mytbl values (4, 'redis');
--Check the rows on the target
select * from mytbl;

id | name
----+------------
1 | postgresql
2 | mysql
3 | sqlserver
4 | redis

Now, let's see the LSN from the pg_replication_slots on the master.

select * from pg_replication_slots ;
-[ RECORD 1 ]-------+--------------
slot_name | subscription1
plugin | pgoutput
slot_type | logical
datoid | 16405
database | bhuvi
temporary | f
active | t
active_pid | 17627
xmin |
catalog_xmin | 718
restart_lsn | 0/34000028
confirmed_flush_lsn | 0/340002F8
Where are we now?

COPY from Snapshot Approach:

In this method, before taking the snapshot we have to create one more publication on the master node. And then create a slot for this.

CREATE PUBLICATION replication2;
ALTER PUBLICATION replication2 ADD TABLE mytbl_new;
SELECT * FROM pg_create_logical_replication_slot('replication2', 'pgoutput');

Now check the LSN from the slots table.

-[ RECORD 1 ]-------+--------------
slot_name | subscription1
plugin | pgoutput
slot_type | logical
datoid | 16405
database | bhuvi
temporary | f
active | t
active_pid | 17627
xmin |
catalog_xmin | 723
restart_lsn | 0/36000028
confirmed_flush_lsn | 0/36000060
-[ RECORD 2 ]-------+--------------
slot_name | replication2
plugin | pgoutput
slot_type | logical
datoid | 16405
database | bhuvi
temporary | f
active | f
active_pid |
xmin |
catalog_xmin | 723
restart_lsn | 0/36000028
confirmed_flush_lsn | 0/36000060

Now take the snapshot of the RDS master node from the Console/CLI or API. Once done, then restore it. Meanwhile, insert some more data on the Master node during the snapshot and after the snapshot completed. Because it has to resume the CDC until how many rows that we inserted.

--During the snapshot
insert into mytbl_new values (4, 'replica');
insert into mytbl_new values (5, 'replica1');
insert into mytbl_new values (6, 'replica2');
insert into mytbl_new values (7, 'replica3');
--After the snapshot
insert into mytbl_new values (8, 'replica4');
insert into mytbl_new values (9, 'replica5');
insert into mytbl_new values (10, 'replica6');

Launch the new RDS from the snapshot. Once the instance is in the available state go to theand open the log file(most recent one, or before the recent one). The LSN number will start like this.

2020-05-30 11:40:17 UTC::@:[5102]:LOG: database system was interrupted; last known up at 2020-05-30 11:28:21 UTC
2020-05-30 11:40:17 UTC::@:[5102]:LOG: database system was not properly shut down; automatic recovery in progress
2020-05-30 11:40:17 UTC::@:[5102]:LOG: redo starts at 0/38000028
2020-05-30 11:40:17 UTC::@:[5102]:LOG: invalid record length at 0/39000470: wanted 24, got 0
2020-05-30 11:40:17 UTC::@:[5102]:LOG:
Where are we now?

Lets do the COPY from the Snapshot RDS for the table mytbl_new . For this, we have to create a new publication.

CREATE PUBLICATION copy;
ALTER PUBLICATION copy ADD TABLE mytbl_new;

Go to your second EC2 instance. (make sure the Database and table has been created). Create a subscription to COPY all the data from the mytbl_new

CREATE SUBSCRIPTION copy_sub CONNECTION 'host=SNAPSHOT-RDS-ENDPOINT port=5432 password=password user=postgres dbname=bhuvi' PUBLICATION copy;
Where are we now?

Perfect, till here, the initial COPY is done, now let's start the CDC from the Master RDS Node.

CDC from Master Node:

We don’t need the COPY subscription, so let's disable this on the Target EC2.(Perform all the steps on the second EC2)

alter subscription copy_sub disable;

Now create a subscription on the target(with disable mode). It should refer to the publication that we created on the master for this CDC.

CREATE SUBSCRIPTION subscription2 CONNECTION 'host=MASTER-RDS-NODE port=5432 password=password user=postgres dbname=bhuvi' PUBLICATION 
WITH (
copy_data = false,
create_slot = false,
,
synchronous_commit = false,
connect = true,
slot_name = ''
);

Manually advance(set flushed LSN) the replication slot.

SELECT * FROM pg_replication_origin;roident |  roname
---------+--------
1 | pg_16418
2 | pg_16419
3 | pg_16423

Your publication is created last right, so pick the roname of the higher roident. So it should be Also, add the LSN number to advance the LSN for the slot.

pg_replication_origin_advance-------------------------------(1 row)

It won’t return any values, but that's fine. Let's enable the subscription.

alter subscription subscription2 enable ;
Where are we now?

Check data on Target.

select * from mytbl_new;id |  name
----+---------
1 | aws
2 | rds
3 | aurora
4 | replica
5 | replica1
6 | replica2
7 | replica3
8 | replica4
9 | replica5
10 | replica6

It worked! go to Master node and you can see the flushed LSN value is changed on the pg_replication_slots table.

select * from pg_replication_slots ;
-[ RECORD 1 ]-------+--------------
slot_name | subscription1
plugin | pgoutput
slot_type | logical
datoid | 16405
database | bhuvi
temporary | f
active | t
active_pid | 17627
xmin |
catalog_xmin | 746
restart_lsn | 0/440005E8
confirmed_flush_lsn | 0/45000000
-[ RECORD 2 ]-------+--------------
slot_name | replication2
plugin | pgoutput
slot_type | logical
datoid | 16405
database | bhuvi
temporary | f
active | t
active_pid | 6309
xmin |
catalog_xmin | 746
restart_lsn | 0/440005E8
confirmed_flush_lsn | 0/45000000

Is there any drawback?

Even though we are eliminating a huge data transfer from the master node for the COPY process, but the CDC data will be stored on the logical replication slot until you read. So if you add all the tables at one shot then a huge amount of data will be there on the slots. This may fill up your disk quickly. So to avoid this, . Then you can avoid this.

Conclusion:

This is just Proof Of Concept for replicating from RDS and COPY from a snapshot. This will help you if you are going to replicate a huge RDS instance to an EC2, or On-Prem or somewhere else. We played with fewer data and a single table. If you are going to implement this for a production workload, better do this on your staging/dev infra and then implement it.

The same approach will work on Aurora as well, Try it out and let us know your experience in the comments.

Happy Replication :)

Searce Engineering

We identify better ways of doing things!

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store