Published in


Standby in production: scaling application in the second largest classified site in the world

Hi. My name is Konstantin Evteev, I’m a DBA Unit Leader of Avito. Avito is the biggest Russian classified site, and the second largest classified site in the world (after Craigslist of USA). Items offered for sale on Avito can be brand new or used. The website also publishes job vacancies and CVs.

Via its web and mobile apps, the platform monthly serves more than 35 million users. They add approximately a million new ads a day and close over 100,000 transactions per day. The back office has accumulated more than a billion ads. According to Yandex, in some Russian cities (for example, in Moscow), Avito is considered a high load project in terms of page views. Some figures can give a better idea of the project’s scale:

At the same time, these volumes of data need not only to be accumulated and stored but also processed, filtered, classified and made searchable. Therefore, expertise in data processing is critical for our business processes.

The picture below shows the dynamic of pageviews growth.

Our decision to store ads in PostgreSQL helps us to meet the following scaling challenges: the growth of data volume and growth of number of requests to it, the scaling and distribution of the load, the delivery of data to the DWH and the search subsystems, inter-base and internetwork data synchronization, etc. PostgreSQL is the core component of our architecture. Rich set of features, legendary durability, built-in replication, archive, reserve tools are found a use in our infrastructure. And professional community helps us to effectively use all these features.

In this report, I would like to share Avito’s experience in different cases of standby usage in the following order:

- a few words about standby and its history in general;

- problems and solutions in replication based horizontal scale-out;

- Avito’s implementation for solution to avoid stale reads from replica;

- possible pitfalls while using standby with high request rate, applying DDL, receiving WAL files from the archive;

- handling issues with technique of using several standbys in production and routing queries between them;

- logical replication based scaling example to compare;

- conclusions and standby major upgrade features.

A few words about standby and its history in general

Standby can be used for the following purposes.

In early 2000s, PostgreSQL community didn’t think that replication should be a built-in feature. Vadim Mikheev after implementation of MVCC, developed a replication solution «Rserv» based on MVCC internal mechanics. The following logic was supposed to be used to replicate data: taking changes that were not visible in a previous snapshot and became visible in the current snapshot. Then Jan Wieck developed an enterprise logical replication solution based on Vadim Mikheev’s idea and named it Slony (Russian for “elephant”) as a reverence for its author. Then the team from Skype created SkyTools with PgQ (the transactional queue) and Londiste (the logical replication solution based on PgQ). PostgreSQL 8.3 аdd several txid_*() functions to query active transaction IDs. This is useful for various replication solutions. (Jan). And finally Logical replication became a built-in feature in PostgreSQL 10.

Meanwhile, built-in binary replication was evolving.

So we have discussed few classifications of standby: logical and physical, built-in and stand alone solutions, read only and not only read only, sync and async.

Now let’s look closer at physical built-in read only async standbу.

In Streaming Replication, three kinds of processes work together. A walsender process on the primary server sends WAL data to а standby server; and then, a walreceiver and startup processes on standby server receives and replays these data. A walsender and a walreceiver communicate using a single TCP connection.

Problems and solutions in replication based horizontal scale-out

When you use asynchronous standby, it can fall behind the primary. That’s why different rules for routing read queries to the primary or to the standby match different types of applications.

But I don’t like this approach, let’s look at better techniques for routing read queries to standby.

2. Sometimes there is a specific business logic that can allow exploiting stale reads. For example, few years ago on Avito there was a 30-minute interval before our users could see new ads (time for anti-fraud check and other restrictions). So we successfully use this specific business rule for routing your read queries for other people’s ads. At the same time we route read queries to your own ads to the primary. Thus, you may edit your own ad and work with the actual ad’s state. To make this routing approach work in a correct way we need to support a replication lag lower than the 30-minute interval.

3. There can be situations when stale reads are not acceptable, even more interesting when you need to use few standbys to achieve a greater level of scale. To deal with them you need to use special techniques. Let’s look closer at our example and the technique to avoid stale reads

Avito’s implementation to avoid stale reads from replica

We successfully used the logical routing technique (number 2: based on business specific logic) when we faced the following spikes. In the picture you can see a diagram with the Load Average on a 28-physical-core machine (2X Intel(R) Xeon(R) CPU E5–2697 v3 @ 2.60GHz). When we started using Hyper-threading, we benchmarked and found out that most of our queries were successfully scaled with enabling Hyper-threading. But then the distribution of queries changed and the result of this can be seen on the diagram. When the level of load average got closer to the number of physical cores, the spikes happened.

The same spikes were observed on the CPU diagram below. From my side, I saw that all queries started being executed dozens of times longer.

So I experienced the capacity problem of the primary server. Actually, the capacity of one server is bounded by the most powerful market offering. If it does not fit you, it is a capacity problem. Normally, changing the architecture of your application is not so that fast and moreover it might be complicated and take a lot of time. Being under such circumstances I had to find a quick solution. As you can notice that the spikes are gone, we coped with the capacity problem by switching more read queries to standby (you can see the number of queries on standby on the following TPS diagram).

We distributed load among more nodes in the system by re-routing queries that don’t need to run on the primary. The main challenges in spreading read queries to replicas are stale reads and race conditions.

On the links you can see detailed description of routing reads based on replica WAL position technique. The main idea is to store the last committed LSN for the entity on which mutating request was executed. Then, when we subsequently want to fulfill a read operation for the same entity, we’ll check which replicas have consumed to that point or beyond it, and randomly select one from the pool. If no replicas are sufficiently advanced (i.e. say a read operation is being run very closely after the initial write), we’ll fall back to the master. Stale reads become impossible regardless of the state of any given replica.

We made some simplification in the technique above. We didn’t use exactly the same one because we didn’t have enough time and resources to rewrite application by implementing sticky connections based on LSN tracking.

At that time we were facing the capacity problem of the primary server, we had a huge monolithic application (it was in progress of splitting to microservice architecture). So it had a lot of complicated logic and deep nested calls — which forced us to make migration to sticky LSN sessions unreachable in short term period. But on the other hand, we had some historical background: we had a timestamp column in the user and the item tables (our main entities), which we filled with the help of PostgreSQL function now(): current date and time — start of current transaction. So we decided to use timestamp instead of LSN. Probably you know that timestamp can’t be used for the task of serialization data operations (or tracking position of replica in relation to primary) in PostgreSQL. To make this assumption we used the following arguments:

Eventually, we came up with the following solution (we called it Avito Smart Cache). We replaced our cache based on Redis with the one implemented in Tarantool. (Tarantool is a powerful fast data platform that comes with an in-memory database and an integrated application server).

We made two levels of cache. The first level — the hot cache — stores sticky sessions. Sticky connections are supported by storing the timestamp of last user’s data mutation. Each user is given its own key so that the actions of one user would not lead to all other users being affected.

Let’s look closer at the approach we used to deliver user’s timestamp information to Tarantool. We have made some changes in our framework for working with database: we have added «on commit hook». In this «hook» we invoke sending data mutation’s timestamp to the 1st level of cache. You may notice, that there is a possibility when we successfully have made changes in PostgreSQL database but couldn’t deliver this info to cache (for example there was a network problem or application error or smth else). To deal with such cases we have made the Listener. All mutations on the database side are sent to the Listener using PostgreSQL LISTEN and NOTIFY features. Also, the Listener tracks last_xact_replay_timestamp of standby and has HTTP API from which Tarantool takes this information.

The second level is the cache for our read data to minimize utilization of PostgreSQL resources (replacing the old cache in Redis).

Some special features:

The main logic of working with smart cache is as follows.

Trying to find data on the main level of the cache, if we find it, it is success. If we do not then we try to get data on the 1st level hot cache. If we find data in the hot cache, it means that there were recent changes in our data and we should get data from primary.

Or if the standbys lag is greater than routing interval — we also should get data from the primary.

Otherwise, data hasn’t been changed for a long time and we can route read request to standby.

You may notice different ttl values — they are different because there is a probability of races and In cases where the probability is greater I use smaller ttl to minimize losses.

Thus we successfully implemented one of the techniques for routing queries and made eventually synchronized cache. But on the graph below you may notice the spikes. These spikes appeared when standby started falling behind due to some reasons such as different locks or specific profile of utilizing hardware resources. In the following part of my report, I would like to share Avito’s experience in solving those issues on the standbys.

Сases highlighting possible problems while using standby with high request rate, applying DDL, receiving WAL files from archive and handling some issues with technique of using few standbys in production and routing queries between them

1st: Deadlock on standby

Step 1:

The Infrastructure for this case is as follows:

Step 2

Open transaction on the primary and alter the options table.

Step 3

On the standby execute the query to get data from the items table.

Step 4

On the primary side alter the items table.

Step 5

Execute the query to get data from the items table.

In PostgreSQL versions lower than 10 we have a deadlock that is not detected. Such deadlocks have successfully been detected by deadlock detector since PostgreSQL 10.

2nd: DDL (statement_timeout and deadlock_timeout)

The infrastructure for this case is the same.

How to apply DDL changes to the table, receiving thousands read requests per second. Of course with the help of statement_timeout setting (Abort any statement that takes more than the specified number of milliseconds, starting from the time the command arrives at the server from the client. ).

Another setting you should use to apply DDL is deadlock_timeout. This is the amount of time, in milliseconds, to wait on a lock before checking to see if there is a deadlock condition. The check for deadlock is relatively expensive, so the server doesn’t run it every time it waits for a lock. We optimistically assume that deadlocks are not common in production applications and just wait on the lock for a while before checking for a deadlock. Increasing this value reduces the amount of time wasted in needless deadlock checks, but slows down reporting of real deadlock errors. The default is one second (1s), which is probably about the smallest value you would want in practice. On a heavily loaded server you might want to raise it. Ideally the setting should exceed your typical transaction time, so as to improve the odds that a lock will be released before the waiter decides to check for deadlock.

Deadlock timeout setting controls the time when conflicting autovacuum is canceled, because the default 1-second value is also the value of duration of the lock while applying DDL and it can lead to a trouble in systems with thousands requests per second.

There is some specifics in the scope of statement_timeout setting and the changes take effect ( for example changing statement_timeout inside the stored procedure will take no effect on this procedure call).

With the help of the settings above we repeatedly try to apply changes to the structure of our tables with a short lock (approximately dozens of ms). Sometimes we can’t do this (take a lock within dozens of ms) during the daytime, and then we try to do it at night, when the traffic rate is lower. Usually we manage to do it. Otherwise we increase the values of those settings(«time window» for getting the lock).

After execution of the DDL statement with timeouts on the primary, these changes are replicated to standby. Timeouts aren’t replicated to the standby that’s why there may be locks between the WAL replay process and read queries.

2018–01–12 16:54:40.208 MSK pid=20949,user=user_3,db=test,host= LOG: process 20949 still waiting for AccessShareLock on relation 10000 of database 9000 after 5000.055 ms2018–01–12 16:54:40.208 MSK pid=20949,user=user_3,db=test,host= DETAIL: Process holding the lock: 46091. Wait queue: 18639, 20949, 53445, 20770, 10799, 47217, 37659, 6727, 37662, 25742, 20771,

We have implemented the workaround to deal with the issue above. We use the proxy to route queries to one of two standbys. And when we want to apply the DDL command on the primary, we stop replaying of the replication on the standby where the queries are executed.

Return the first standby to the pool of active standbys.

3rd: Vacuum replaying on standby and truncating data file

Vacuum can truncate the end of data file — the exclusive lock is needed for this action. At this moment on the standby long locks between read only queries and recovery process may occur. It happens because some unlock actions are not written to WAL. The example below shows a few AccessExclusive locks in one xid 920764961, and not a single unlock… Unlock happens much later. When the standby replays the commit.

The solution for that issue can be like:

In our case (when we faced problem above) we have a kind of table with logs. Almost append only — once a week we executed a delete query with timestamp condition, we cleaned records older than 2 weeks. This specific workload makes it possible to truncate the end of the data files with the autovacuum process. At the same time we actively use the standby for reading queries. The consequences can be seen above. Eventually we successfully solved that issue by executing cleaning queries more frequently: every hour.

4th: Restoring WAL from archive

On Avito we actively use standbys with the replication being set up with the help of an archive. This choice was made to have an opportunity to recreate any crashed node ( primary or standby) from the archive and then set it in a consistent state in relation to the current primary.

In 2015, we faced the following situation: some of our services started generating too many WAL files. As a result our archive command reached up the performance limits, you may notice the spikes with pending WAL files for sending to the archive on the graph below.

You can notice the green arrow pointing to the area, where we solved this issue. We implemented multithreaded archive command.

You can make a deep dive into the implementation following the github link. In short, it has the following logic: if the number of ready WAL files is lower than the threshold value, then the archive is to be done with one thread, else turn on a parallel archive technique.

But then we were faced the problem of CPU utilization on the standby. It was not obvious at first view that we were monitoring CPU utilization for our profile of load in a wrong way. Getting /proc/stat values we can see just a snapshot of the current CPU utilization. Archiving one WAL file in our case takes 60 ms and on the snapshots we don’t see the whole picture. The details could be seen with the help of the counter measurements.

Due to the above reasons there was a challenge for us to find the true answer why the standby was falling behind in relation to the primary (almost all standby CPU resources were utilized by archiving and unarchiving WAL files). The green arrow on the schema above shows the result of the optimization in the standby CPU usage. This optimization was made in a new archive schema, by delegating the archiving command execution to the archive server.

First let’s look under the hood of the old archive schema. The archive is mounted to the standby with the help of NFS. The primary sends the WAL files to archive through the standby (on this standby users’ read queries are routed to minimize the replication lag). The archive command includes the compression step, and it is carried out by the CPU of the standby server.

New archive schema:

Both archive solutions were ideas of Michael Tyurin further developed by Sergey Burladyan and implemented by Victor Yagofarov. You can get it here. The solutions above were made for PostgreSQL versions where pg_rewind and pg_receivexlog were not available. With its help it is easier to deal with crashes in the distributed PostgreSQL infrastructure based on asynchronous replication. You can see the example of a crash of the primary at the schema below and there can be variants:

So there can be different cases with many «IFs» in your Disaster Recovery Plan. On the other hand, when something goes wrong it is harder to concentrate and to do everything right (for example). That’s why DRP must be clear to responsible stuff and be automated. There is a really beneficial report named «WARM standby done right» by Heikki Linnakangas where you can find a good overview and description of the tooling and the techniques to make warm standby.

Besides I want to share one configuration example of PostgreSQL infrastructure from my experience. Even using synchronous replication for standby and archive, it may happen that the standby and the archive are in different states after a crash of the primary. Synchronous replication works in such a manner: on the primary changes are committed, but there is a data structure in memory(it is not written to the WAL) that indicates that clients can’t see these changes. That indicator is stored until synchronous replicas send acknowledgment that they have successfully replayed the changes above.

Standbys pool

The existence of the possible lag of standby and other problems I have described above made us use the standbys pool in Avito’s infrastructure.

With the help of HAProxy and check function, we altered few standbys. The idea is that when the standby lag value is greater than the upper border, we close it for users’ queries until the lag value is smaller than the lower border.

if masterthen falseif lag > maxthen create file and return falseif lag > min and file existsthen return falseif lag < min and file existsthen remove file and return trueelsetrue

To implement this logic we need to make a record on the standby side. It can be implemented with the help of foreign data wrapper or untrusted languages(PL/Perl, PL/Python and etc).

Logical replication as an alternative tool to scale application

It is out of the current topic but I want to say that there is logical replication as an alternative of binary replication. With its help, the following Avito’s issues are successfully solved: the growth of data volume and growth of number of requests to it, the scaling and the distribution of the load, the delivery of data to the DWH and to the search subsystems, inter-base and intersystem data synchronization, etc. Michael Tyurin (ex chief architect) is the author of the majority of core solutions mentioned above, implemented with the help of SkyTools.

Both kinds of replication solutions have their strengths and weaknesses. Since PostgreSQL 10 logical replication has been a built-in feature. And here is an article with Avito’s recovery cases and techniques.

One of successful examples of logical replication usage in Avito is the organization of search results preview (since 2009 to the 1st quarter 2019). The idea is in making trigger-based materialized view (www.pgcon.org/2008/schedule/attachments/64_BSDCan2008-MaterializedViews-paper.pdf, www.pgcon.org/2008/schedule/attachments/63_BSDCan2008-MaterializedViews.pdf ), and then replicating it to another machine for read queries.

This Machine is much cheaper than the main server. And as a result, we serve all our search results from this node. On the graph below you can find information about TPS and CPU usage: it is very effective pattern of scale read load.

There are a lot more things to discuss such as reserve for logical replica, unlogged tables to increase performance, pros and cons and so on. But it is a completely different story and I will tell it next time. The point I want to highlight is that using logical replication can be very useful in many cases. There are no such drawbacks as in streaming replication, but it also has its own operation challenges. As a result there is no «silver bullet», knowledge about both types of replication helps us make the right choice in a specific case.


There are a few types of standby and each fits a specific case. When I was going to start preparing this report I intended to make the following conclusion. There are many operation challenges, that’s why it is better to scale your application with sharding. With the help of standby these tasks can be done without extra efforts:

Moreover, I was going to stop using binary standby for reading queries in Avito production. But opportunities have changed and we are going to build multi-datacenter architecture. Each service should be deployed at least in 3 datacenters. This is how the deploy schema looks like: a group of instances of each microservice and one node of its database (primary or standby) should be located in each data center. Reads will be served locally whereas writes will be routed to the primary.

So the opportunities make us use standby in production.

I have described several approaches and techniques above how to serve reads from standby. There are some pitfalls and a few of them have been found and successfully solved by us. Built-in solutions for pitfalls mentioned in this report could make standby operating experience better. For example, some kind of hints to write deadlock and statement timeout to WAL. If you are aware of other pitfalls let’s share that knowledge with us and the community.

Another crucial moment — is to make backups and reserve right way. Create your disaster recovery plan, support it in actual state with regular drills. Moreover, in order to minimize downtime period automize your recovery. One more aspect about reserve and standby is that if you use standby in production to execute application queries you should have another one for reserve purposes.

Suppose you have to do major upgrade and you can’t afford a long downtime. With primary and standby in production there is a high probability that one node can’t serve all queries (after the upgrade you should also have two nodes). To do this fast you can either use logical replication or upgrade master and standby with pg_upgrade. On Avito we use pg_upgrade. From the documentation, it is not obvious how to do it right if your standby receives WAL files from archive. If you use Log-Shipping standby servers (without streaming), the last file in which shutdown checkpoint record is written won’t be archived. To make the standby servers caught up you need to copy the last WAL file from primary to the standby servers and wait till it is applied. After that standby servers can issue restart point at the same location as in the stopped master. Alternatively, you can switch from Log-Shipping to Streaming Replication.

Yet another useful example from my upgrade experience.

In production config vacuum_defer_cleanup_age = 900000. With this setting pg_upgrade cannot freeze pg_catalog in a new cluster (it executes ‘vacuumdb — all — freeze’) during performing upgrade and upgrade fails:

Performing Upgrade------------------Analyzing all rows in the new cluster okFreezing all rows on the new cluster okDeleting files from new pg_clog okCopying old pg_clog to new server okSetting next transaction ID and epoch for new cluster okDeleting files from new pg_multixact/offsets okCopying old pg_multixact/offsets to new server okDeleting files from new pg_multixact/members okCopying old pg_multixact/members to new server okSetting next multixact ID and offset for new cluster okResetting WAL archives okconnection to database failed: FATAL: database "template1" does not existcould not connect to new postmaster started with the command:"/home/test/inst/pg9.6/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "new/"-o "-p 50432 -b -c synchronous_commit=off -c fsync=off -cfull_page_writes=off -c listen_addresses='' -c unix_socket_permissions=0700-c unix_socket_directories='/home/test/tmp/u'" startFailure, exitingTest script:#!/bin/bashPGOLD=~/inst/pg9.4/binPGNEW=~/inst/pg9.6/bin${PGOLD}/pg_ctl init -s -D old -o "--lc-messages=C -T pg_catalog.english"${PGNEW}/pg_ctl init -s -D new -o "--lc-messages=C -T pg_catalog.english"echo vacuum_defer_cleanup_age=10000 >> new/postgresql.auto.conf# move txid to 3 000 000 000 in old cluster as in production${PGOLD}/pg_ctl start -w -D old -o "--port=54321--unix_socket_directories=/tmp"${PGOLD}/vacuumdb -h /tmp -p 54321 --all --analyze${PGOLD}/vacuumdb -h /tmp -p 54321 --all --freeze${PGOLD}/pg_ctl stop -D old -m smart#${PGOLD}/pg_resetxlog -x 3000000000 olddd if=/dev/zero of=old/pg_clog/0B2D bs=262144 count=1# # move txid in new cluster bigger than vacuum_defer_cleanup_age may fixproblem# ${PGNEW}/pg_ctl start -w -D new -o "--port=54321--unix_socket_directories=/tmp"# echo "select txid_current();" | ${PGNEW}/pgbench -h /tmp -p 54321 -n -P 5-t 100000 -f- postgres# ${PGNEW}/pg_ctl stop -D new -m smart${PGNEW}/pg_upgrade -k -d old/ -D new/ -b ${PGOLD}/ -B ${PGNEW}/# rm -r new old pg_upgrade_* analyze_new_cluster.sh delete_old_cluster.sh

I did not find any prohibition in the documentation on using production config with pg_upgrade, maybe I am wrong and this has already been mentioned in the documentation.

I and my colleagues are hoping to see improvements in replication in future PostgreSQL versions. Both replication approaches in PostgreSQL are a great achievement, it is very simple today to set up replication with high performance. This article is written to share Avito experience and highlight main demands of Russian PostgreSQL community (I suppose that members of PostgreSQL community worldwide face the same problems).



Stories from Avito.ru engineering team

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store