Hi. My name is Konstantin Evteev, I’m a DBA Unit Leader of Avito.ru, one of worlds top classifieds. In Avito, ads are stored in PostgreSQL databases. At the same time, for many years already the logical replication has been actively used. With its help, the following issues are successfully solved: the growth of data volume and growth of number of requests to it, the scaling and the distribution of the load, the delivery of data to the DWH and to the search subsystems, inter-base and intersystem data synchronization etc. But nothing happens “for free” — at the output we have a complex distributed system. Hardware failures can happen — you need to be always ready for it. There is plenty of samples of logical replication configuration and lots of success stories about using it. But with all this documentation there is nothing about samples of the recovery after crashes and data corruptions, moreover there are no ready-made tools for it. Over the years of constantly using PgQ replication, we have gained extensive experience, implemented our own add-ins and extensions to restore and synchronize data after crashes in distributed data systems.
In this report, we would like to show how our recovery use cases around Londiste (PGQ in general) in distributed data processing could be switched to a new logical replication in PostgreSQL 10. This research was done by Mikhail Tyurin (firstname.lastname@example.org), Sergey Burladyan (email@example.com) and me (firstname.lastname@example.org). And many thanks to Stas Kelvich from Postgres Professional for useful comments, discussions and review. We started making it when PostgreSQL 10 Alpha was released and are still working on it.
I want to highlight one aspect of infrastructure: streaming and logical replication types are set up in asynchronous mode. Synchronous replication may not cope with a huge OLTP load.
Logical replication is a common case of data denormalization. We usually put some data in normal form in one place and then we need to redistribute it to a different place and even to a different structure (do some data transformations) due to different conditions:
- storing requirements (optimal hardware utilization);
- making easier data processing;
- access management;
The typical use-cases for logical replication are:
- giving an access to the replicated data to different groups of users;
- storage flexibility;
- event tracking;
- data distribution;
- flexible replication chains;
- data transformation;
- upgrading between major versions of PostgreSQL.
Different logical replication solutions, tools and frameworks from community allow us to make our own improvements and tools. We replicate data from one Postgres to another to achieve different use-cases. In Avito we successfully use Logical Replication for:
- Dictionaries delivery.
- Load balancing. Data copies are stored in different places so we can get it in parallel.
- Partial replication to services. Delivery of a part of the table for example a category or user’s group for a particular service or application.
- Data streaming to search systems. In our case it is data delivery to an external index.
- Persistent queue. Our own implementation of central storage for all states of our tables/objects at the end of transaction which are serialized on the lock on primary key. The main customer of this persistent queue is DWH.
- Interservice communication.
That is all about values of logical replication in brief. In PostgreSQL 10 logical replication became a built-in feature. And Avito architecture is a successful example of using standalone implementation of the logical replication (SkyTools, Londiste, PgQ).
The infrastructure for Londiste is as following Provider, Subscriber, Ticker, Londiste Worker.
In Londiste replication queue is set of tables and changes are written to them with the help of triggers. At the same time the ticker writes down snapshots on provider. And then londiste worker get changes with following query — give me changes that was not visible in previous snapshot and becomes visible in current snapshot. Then applies them on subscriber. That is all about Londiste architecture in brief. The main thing I want to highlight for you that replication queue is stored in tables.
Builtin logical replication
Builtin logical replication is built with an architecture similar to physical streaming replication.
It is implemented by “walsender” and “apply” processes. The walsender process starts logical decoding of the WAL and loads the standard logical decoding plugin (pgoutput). The plugin transforms the changes read from WAL to the logical replication protocol and filters the data according to the publication specification. The data is then continuously transferred using the streaming replication protocol to the apply worker, which maps the data to local tables and applies the individual changes as they are received, in correct transactional order.
The apply process on the subscriber database always runs with session_replication_role set to replica, which produces the usual effects on triggers and constraints. The logical replication apply process currently only fires row triggers, not statement triggers. The initial table synchronization, however, is implemented like a COPY command and thus fires both row and statement triggers for INSERT. For our Undo case we use row trigger, you will see it later.
The main difference Between trigger based and builtin replication is queue — in builtin replication it is WAL while in trigger based solution it is the table.
Before describing our recovery use cases I want to draw your attention to few details in implementation of built in logical replication that make it easier to understand implementation of our recovery use cases.
Replication origins used for tracking replication progress in PostgreSQL 10. Replication progress is tracked in a shared memory table (ReplicationState) that’s dumped to disk every checkpoint. And progress of applying logical replication is written to WAL ( COMMIT 2017-XX-XX XX::XXX::XX origin: node 1 , lsn 0/30369C0, at 2017–11–01 XXXXXX msk).
Replication progress in shared memory is not transactional. Even if you are inside a Repeatable Read transaction, the value of pr_replication_origin_status can be changed. It means that nowadays without stopping logical replication we can’t make a consistency database dump of the logical subscription. But we can stop the replication before taking a snapshot, log the LSN to some external place, start dump and after these actions start replication. Then switch a new logical consumer to a new slot, and set its LSN to the subscription’s LSN value which we logged before we start dumping it.
pg_subscrition and pg_dump
When dumping logical replication subscriptions, pg_dump will generate CREATE SUBSCRIPTION commands that use the NOCONNECT option. After restoring from this dump all replicated tables are unlinked to subscription. To create a link we need to refresh the subscription, but it is not allowed (ERROR alter subscription is not allowed for disabled subscriptions) e.d. subscription is restored with option connect = false, but we need connect = true and enable = false. And if we turn on the replication, there is a time interval before the refresh publication command is executed, and as there was no link between the replicated tables and subscription there will be a silent data corruption which means that all the changes for logical replication will not be applied to the subscription. That’s why it is not possible to connect a new subscription with pg_dump and to execute the refresh publication command. To create a new copy of subscriber you should do the following:
- run pg_dump without subscriptions;
- create the subscription manually with a link to an existing slot, after restoring the dump;
- before starting the replication, switch the slot to the logical subscription’s LSN value which we logged before we start dumping it.
Let’s see how to move our experience on restoring and synchronizing data after crashes in distributed data processing systems from SkyTools to built-in logical replication in PostgreSQL 10.
Avito recovery use cases
Reinitializing subscriber from another subscriber
Reinitalizing in brief
1. The Infrastructure for this user case is as follows:
- Provider (main) — is a publisher — the source of our logical replication.
- Two replicas (repca) — logical consumers: repca1, repca2.
There are two replicas for recovery cases. Crashes can happen so if you want to have fast recovery, reserve every node.
This case is suItable for systems where:
- the provider is overloaded;
- the size of replicated tables is extremely big and there is some logic in triggers on these tables on the subscriber’s side, i.e. the duration of initial data copying process is estimated to be too long, e.g. few days;
- the case when the replica is a derivative of the original data, for example some signals of events on the source side.
Reinitializing subscriber from another subscriber in brief:
- “Resubscribe” — create a new logical slot without a consumer
- Wait till the new subscriber’s position is seen on the old one and copy (pg_dump — j)
- Change the queue position according to destination
- Start replication
In short, we have two logical replicas. We recreated a logical subscriber with no influence on provider — this is the main feature of our solution. We did it with Londiste and now let’s discover how to do the same with logical replication in PostgreSQL 10
- Creating a logical slot for a new subscriber:
2. Disabling active subscription:
3. Logging current LSN:
5. At the same time:
6. Creating subscription.
7. Moving LSN.
8. Checking the state of subscription:
9. Enabling subscription:
UNDO in brief
1 The Infrastructure for this use case is as follows:
- Provider (main) — is a publisher — the source of our logical replication.
- Provider’s standby
- Replica (repca1) — logical consumer.
There is a master with its standby. If the master crashes, there is a probability that the repca (logical consumer) will be in future in relation to provider’s standby, e.g. logical replication works faster than binary due to overloaded standby or queries on standby that locks applying of the new WAL etc. For this case Avito has developed Undo framework. Undo looks like an audit log of events, that were applied on a logical consumer. It is the opposite actions: for Delete it is Insert, for Update it is another Update, for Insert it is Delete. After promotion of standby we take the progress of logical replication on standby and compare it to the subscriber’s position. Then apply Undo on the subscriber for all the same actions related to standby . As a result the provider’s and subscriber’s data will be in a consistent state, and starting from this point it is ok to turn on the logical replication.
If we don’t set our provider and subscriber databases in consistent state after the crash mentioned above, there is a probability that replication will crash due to unique key constraint violation error. And still there is a chance that everything will look fine, but the data is not consistent and there is a silent data corruption which is very hard to detect and even harder to explain and reproduce.
Here are schema images with the main steps for Undo recovery case:
Step 1: provider’s crash
Step 2: promoting provider’s standby
Step 3: applying Undo
Step 4: Starting Logical Replication
The full implementation of Undo can be found on Github. I just want to highlight the main components here:
- Undo log table.
select id, LSN, dst_schema, dst_table, undo_cmd, cmd_data, cmd_pkfrom undo_log order by id
2. Undo trigger that writes opposite actions to Undo log table: for Delete it is Insert, for Update it is another Update, for Insert it is Delete.
3. Undo apply function — for applying Undo in descendent order.
Key features and specialties for Undo implementation in PostgreSQL 10:
- In trigger-based replication solutions like Londiste tracking the progress of logical replication for Undo recovery case is very easy in comparison with the built-in logical replication. In PostgreSQL 10 implementation of the logical replication you should wait until the provider’s standby has replayed all WAL files, then log the LSN position, you can get it by executing
pg_last_xlog_replay_locationfunction . If recovery is still in progress, LSN will increase monotonically. If recovery has completed then this value will remain static at the value of the last WAL record applied during that recovery. When the server has been started normally without recovery the function
pg_last_xlog_replay_locationreturns NULL. We will use this LSN for applying Undo to this state, after applying Undo we can promote the provider’s standby.
- A physical slot can be created on the standby unlike a logical one. So you mustn’t turn on client’s activity on the promoted standby before creating a logical slot. It is necessary to save all data changes for logical replication, if you do not create a logical slot there will be a gap in your data.
- To prevent undoing new changes, writing transactions on the subscriber can be turned on only after applying Undo.
- Also you can read from the subscriber before applying Undo, but it is very risky — you can get inconsistent data.
Full implementation of Undo case — provider’s crash and switching to the provider’s standby:
- Write down the WAL replay LSN before promotion.
2. Logical Replication Slot isn’t replicated to the standby, that’s why you shouldn’t turn on traffic immediately after promotion of standby.
3. There are some changes for Undo.
4. Current subscriber’s LSN.
5. Applying Undo.
6. Enabling logical replication.
Few sources, one subscription and Undo
- In trigger get subscriber’s name and write it to undo log with opposite actions.
select subnamefrom pg_stat_subscription pwhere p.pid = pg_backend_pid()
- On the publication side there is no possibility to find out who consumes the slot. As we don’t know links between subscriber and publisher — we need to make an external list with logical consumers ( for londiste we do the same thing), to apply undo if the source is crashed.
- So the consumer name has to be in special format. This will be useful to find out the link between publication and subscription.
REDO — reposition source (subscriber’s crash)
REDO in brief
1. The Infrastructure for this use-case is as follows:
- Provider(main) — is a publisher — the source of our logical replication.
- Subscriber’s master (repca1) — logical consumer.
- Subscriber’s standby (repca2) — logical consumer’s standby.
There are 2 kinds of replication:
- logical replication between the provider (the main one which is the source of the logical replication) and subscriber’s master (repca1);
- streaming replication between subscriber’s master (repca1) and subscriber’s standby (repca2).
If the subscriber’s primary crashes, there is a probability that the subscriber’s standby will fall behind the crashed subscriber’s master, it might occur when the replication is set up in an asynchronous mode. In this case a logical slot’s position is not at the LSN, which subscriber’s standby expects i.e. we need a different logical slot (it can be called a slot for a failover) which tracks the progress of applying the logical changes on subscriber’s standby.
Now let’s look at the main schema of that case on the images below:
Step 1: the crash of subscriber’s master
Step2: promoting subscriber’s standby
Step3: changing the queue position according to destination
Step4: Start the replaying of the logical replication
The command for moving a replication slot. We create the slot manually for the promoted consumer’s standby to prevent replicated queue (WAL) rotation on provider’s side:
psql -p 5433 -U postgres -X -d src-c “select * from pg_logical_slot_get_binary_changes( ‘repca2’::name, ‘0/38AFCC0’::pg_lsn, null::int, variadic array[‘proto_version’, ‘1’, ‘publication_names’, ‘pub’])”
- Creating a logical slot to prevent WAL’s rotation on provider’s side(these WAL files can be needed for promoted subscriber’s standby).
2. Adding new changes for our logical consumer:
3. Replication slot for our subscriber’s standby is in the past:
4. Checking pg_replication_origin status on subscriber’s side and subscriber’s standby side:
5. “Moving” replication slot with the help of SQL protocol
6. LSN for both replication slots are equal.
7. “Emulating delay of subscriber’s standby replication”.
8. Adding one more record in the replicated table.
9. As expected subscriber’s standby falls behind.
10. “Subscriber’s crash”. Subscriber’s standby is still behind.
11. Dropping slot which was consumed by crashed subscriber’s primary.
12. Checking pg_replication_origin_status.
13. Sync “subscriber’s standby slot” actual with subscriber’s standby pg_replication_origin.
14. Subscriber hasn’t had new changes yet.
15. Promoting subscriber’s standby.
16. Promoted subscriber’s standby is still behind
17. Alter subscription: set actual slot, we prepared previously.
18. Slot turned on and started being consumed by subscriber.
19. Subscriber replayed the changes.
The algorithm for consuming a reserved slot
- Take the LSN position for the subscriber’s master’s slot from pg_replication_slots.
2. Check pg_replication_origin on the subscriber’s master.
3. Wait until the subscriber’s master’s pg_replication_origin is seen on the subscriber’s standby side.
4. Consume the reserved slot (subscriber’s standby slot):
select * from pg_logical_slot_get_binary_changes(‘repca2’::name,
‘0/3037B68’::pg_lsn, null::int, variadic array[‘proto_version’, ‘1’, ‘publication_names’, ‘pub’])
REDO 2 — on provider’s side (provider’s crash and switching to the provider’s standby, subscriber is falling behind)
1 The Infrastructure for this use-case is as follows:
- Provider (main) is a publisher — the source of our logical replication.
- Provider’s standby.
- Replica (repca1) logical consumer.
There is a master with its standby (the source of logical replication). If the primary crashes, there is a probability that the repca (logical consumer) will fall behind the provider, i.e. logical replication might work slower than binary due to many reasons:
- by design many backends can make data changes on the provider side and only one backend replay all data changes from the provider on the subscriber side,
- subscriber might be overloaded,
- there can be queries on the subscriber that lock some replicated relations and prevent applying the logical changes,
In trigger based solution (Londiste in my case) we do nothing with that case, i.e. we just promote the standby, make some changes in the config file and continue replaying the logical replication. It is possible because the replication queue and the replication progress are stored in tables (PgQ and Londiste tables) which are also replicated to binary provider’s standby. On the other hand, in Logical Replication in PostgreSQL 10 WAL files, pg_catalog and a logical replication slot are used for these purposes.
Let me give you a simple example with a potential data loss in some production system. Imagine a system, where a table with orders is replicated. On the subscriber’s side there is a trigger, which fires on any order table change and writes all these states of each row to the audit table. Such solutions are used for audit/analytics/other purposes. If there is a crash like the one I have described above, even if you have all WAL files on your provider’s standby, you can’t get logical changes for the subscriber, so there is a gap in your data.
In PostgreSQL 10 implementation of the logical replication a logical slot is not replicated to the provider’s standby. We can’t create a logical replication slot on the provider’s standby and there is no other way today to continue our replication without a data loss on the provider’s side.
A range of large technology companies in Russia face the same problem:
- Avito, the biggest classified site of Russia,
- Yandex, the largest technology company in Russia (there was a mentioning of the same problem in the talk https://www.youtube.com/watch?v=2QJgCTF1cIE),
- a database architect from Tinkoff bank, one of the largest Russian banks.
This problem is very important, I see community activity in developing solution for it, e.g.
The short list above compares tools which we use to recover and make data consistent after crashes in distributed systems, built with the help of Logical Replication.
My thoughts about (desirable features)
I would be on cloud nine if there were an api and commands to implement all the recoverу cases we discussed above as simply as setting up logical replication in PostgreSQL 10 :)
- Make pg_replication_origin “transactional” ~ “pin to snapshot”, to prevent stopping Logical Replication process for Reinitializing subscriber from another subscriber case
- Dumping with pg_dump replication_origin or option to enable it will make it easier to implement Reinitializing subscriber from another subscriber case
- Implement “logical UNDO” / or SQL API
- Tracking the progress of consuming a logical slot on the subscriber (“provider_restart_lsn”) to make it easier to implement Redo case (the algorithm for consuming a reserved slot).
- Comparing LSN between subscriber and provider (SerialConsumer). Raise error when provider state does not match subscriber to prevent silent data corruptions, like the ones I’ve described above in Undo and Redo 1 and 2 cases.
- A function to move a slot forward on provider (not get the changes, just move, waiting for PostgreSQL 11 pg_replication_slot_advance (slot_name name, upto_lsn pg_lsn) )
- Any implementation of failover slot or another mechanism to get logical changes for the subscriber after provider’s failover. It’s needed for two cases:
— In the Redo 2 on provider’s side (Subscriber in the past) the absence of a failover slot blocks the production usage of logical replication in PostgreSQL 10 at least in our case
— Undo — to prevent losing new changes for the logical subscriber we need to create a logical slot before letting clients connect to promoted provider. It can be solved today by following the steps described above.
- Implementing a transactional queue (like PgQ) can be done with using a proxy table or writing a custom logical decoder and using pg_logical_emit_message.
I and my colleagues are hoping to see the solution for it in future PostgreSQL versions. Logical replication in PostgreSQL 10 is a great achievement, it is very simple today to set up a logical replication with high performance. This article is written to highlight main demands of Russian PostgreSQL community (I suppose that members of PostgreSQL community worldwide face the same problems).