PostgreSQL: bloat, pg_repack, and deferred constraints

Nikolai Averin
Miro Engineering
Published in
17 min readJul 28, 2020

Table/index bloating is a widely known effect and is not unique to PostgreSQL. There are built-in methods to deal with it like VACUUM FULL or CLUSTER, but they lock tables and therefore cannot always be used.

This article contains a bit of theory on how bloat occurs, how to deal with it, what are deferred constraints, and what problems they bring when using the pg_repack extension.

Why bloat occurs

PostgreSQL uses a multiversion model (MVCC). The essence of that model is that each table row can have multiple versions, but transactions can only see one of them and not necessarily the same one. This allows multiple transactions to work simultaneously with little or no impact on each other.

Of course, all these versions need to be stored somewhere. PostgreSQL uses a paging mechanism to manage the memory, and one page is the minimum amount of data that can be read from or written to disk. Let’s look at a small example to see how this works.

Suppose we have a table with some newly added records. The first page of that table’s file will contain the new data. These are “live” versions of the rows, which are available to other transactions after the commit (for simplicity, we will assume that the isolation level is Read Committed).

Next, we will update one of the records, which will mark the previous version as obsolete.

Step by step, by updating and deleting rows, we will get a page that is half-filled with garbage data. This data is invisible to any transaction.

PostgreSQL has a VACUUM function that removes obsolete versions and frees up space for new data. But if it is not configured aggressively enough or is busy working with other tables, the garbage data will persist, and we will have to use additional pages for new data.

In our example, the table, at some point, will consist of four pages, but only half of that data will be live. As a result, if we try to access the table, we will read much more data than necessary.

Now, even if VACUUM does remove obsolete row versions, the situation will not improve dramatically. We will have free space in the pages (or even completely free pages) for new rows, but we will still be reading more data than we need to.

By the way, if the completely empty page (the second in our example) were at the end of the file, VACUUM would be able to remove it. But as it is right now, it is in the middle, so nothing can be done with it.

When the number of such empty or heavily fragmented pages becomes large, this is called bloat, and it begins to affect the performance.

Everything described above is the mechanics of how bloat occurs in tables. For indexes, the process is more or less the same.

Do I have bloat?

There are several ways to determine if you have bloat. The first way is to use PostgreSQL internal statistics, which contain approximate information about the total number of rows in tables, the number of live rows, and so on. You can find many variations of scripts for that on the Internet. As a basis, we used the script from PostgreSQL Experts that can estimate table bloat as well as TOAST bloat and B-tree index bloat. In our experience, its margin of error is about 10 to 20 percent.

We also used the postgres-checkup tool, which, in addition to an approximate estimate of the bloat (the evaluation logic is similar), gives a lot of other valuable information about the state of the database.

Another way is to use the pgstattuple extension, which allows you to look inside the pages and get both estimated and exact bloat values. But to get the exact values, you have to scan the entire table.

A little bloat, up to 20 percent, we consider acceptable. It can be thought of as an analog of Fill factor for tables and indexes. At 50 percent and above, performance problems may start to occur.

Dealing with bloat

There are several built-in ways to deal with bloat in PostgreSQL, but all of them are far from universal solutions.

Configure AUTOVACUUM to prevent bloat from happening, or more precisely, to keep it at an acceptable level. This might seem like obvious advice, but in reality, it is not always easy to achieve. Let’s say, for example, that your active development involves regular schema changes or that you are doing some sort of data migration. As a result, your load profile can change frequently, and it is usually different for different tables. That means that you always have to work a little ahead and adjust AUTOVACUUM to the changing profile of each table. Clearly, this is not easy to do.

Another common reason that AUTOVACUUM is unable to process tables in time is the presence of long transactions that prevent it from reclaiming the occupied storage because that storage is available to these transactions. The recommendation here is also obvious — get rid of hanging transactions and minimize the duration of active transactions. But if your workload is a mix of OLAP and OLTP, you can have many frequent updates and short queries, and long-running operations (like building a report) at the same time. In such a situation, it is worth thinking about distributing the load across multiple databases, which will allow for more fine-tuning of each of them.

As another example, even if the load profile is uniform, if the database is under heavy load, even the most aggressive AUTOVACUUM may not be enough, and bloat will occur. Scaling (vertical or horizontal) is the only solution.

What about the situation when AUTOVACUUM is configured but bloat continues to grow?

The VACUUM FULL command rebuilds the contents of tables and indexes, keeping only the actual data. For eliminating bloat, it works perfectly, but during the execution, it locks the table with an AccessExclusiveLock, which does not allow any queries against that table, even SELECT. If you can afford to temporarily stop your service or part of it for some time (from tens of minutes to several hours, depending on the size of the database and your hardware), then this is the best option. Unfortunately, we do not have enough time to run VACUUM FULL during a scheduled maintenance period, so this approach does not suit us.

The CLUSTER command, like VACUUM FULL, also rebuilds the contents of tables, but it allows you to specify an index based on which the data will be physically reordered on the disk (but it does not guarantee that index order for new rows). In certain situations, this is a good optimization for a number of queries — those that read multiple rows by index. The CLUSTER command has the same disadvantage as the VACUUM FULL command: it locks the table during its operation.

The REINDEX command is similar to the previous two, but it rebuilds a specific index or all indexes on a table. Its locks are slightly weaker: ShareLock on the table (prevents modification but allows SELECT) and AccessExclusiveLock on the index that is being rebuilt (blocks all queries that use that index). However, version 12 of PostgreSQL introduced the CONCURRENTLY option, which allows you to rebuild an index without blocking inserts, updates, and deletes.

With earlier versions of PostgreSQL, you can achieve similar results to REINDEX CONCURRENTLY with CREATE INDEX CONCURRENTLY. It allows you to create a new index without a strict lock (it uses a ShareUpdateExclusiveLock, which allows both reads and writes on the table), then replace the old index with the new one and delete the old one. This eliminates index bloat without interfering with your application. It is important to note that the index rebuild process causes an additional load on the file system.

Thus, there are ways to eliminate bloat on a running database for indexes, but not for tables. This is where the various external extensions come into play: pg_repack (formerly pg_reorg), pgcompact, pgcompacttable, and others. In this article, I will not compare them and will only talk about pg_repack, which, after some modification, we use in our product.

How does pg_repack work?

Suppose that we have a typical table with indexes, constraints, and (unfortunately) bloat. The first thing pg_repack does is to create a log table to store all data changes that occur during the operation. A trigger will record all inserts, updates, and deletes. Then it creates a table that is structurally similar to the original table, but without indexes and constraints so as not to slow down the data insertion process.

Next, pg_repack moves the data from the old table to the new table, automatically filtering out all dead rows, and then it creates indexes for the new table. During the execution of all these operations, changes are accumulated in the log table.

The next step is to transfer these changes to the new table. The transfer is performed in several iterations, and when there are less than 20 records remaining in the log table, pg_repack acquires a strict lock, moves the rest of the data, and replaces the old table with the new one in the PostgreSQL system tables. This is the only and very short period of time when you cannot work with the table. After that, the old table and the log table are dropped to free disk space. And that is it, all done.

Everything looks great in theory, but what about in practice? We tested pg_repack both without and under load and also tested how well it works after being interrupted (Ctrl+C, to put it simply). All test results were great.

We moved on to the production environment, and this is where things went not the way we expected.

First failure at the production server

The first cluster gave us an error about a violation of a unique constraint:

$ ./pg_repack -t tablename -o id
INFO: repacking table “tablename”
ERROR: query failed:
ERROR: duplicate key value violates unique constraint “index_16508” DETAIL: Key (id, index)=(100500, 42) already exists.

This constraint had an autogenerated name index_16508 — it was created by pg_repack. By its attributes, we had identified our constraint that corresponded to it. The problem turned out to be that this is not an ordinary constraint, but a deferred constraint, i.e., it is enforced after an SQL query, which leads to unexpected consequences.

Deferred constraints: Why you may need them and how they work

A bit of theory on deferred constraints.

Consider a simple example. We have a table of cars with two columns: the name of the car and its order number.

cars
create table cars 
(
name text constraint pk_cars primary key,
ord integer not null constraint uk_cars unique
);

Let’s say we need to swap the order number of the first two cars. A direct approach would be to set the first value to the second value, and vice versa:

begin;
update cars set ord = 2 where name = ‘audi’;
update cars set ord = 1 where name = ‘bmw’;
commit;

But if we try to run this code, we will expectedly get a constraint violation error because the order number must be unique:

[23305] ERROR: duplicate key value violates unique constraint “uk_cars”
Detail: Key (ord)=(2) already exists.

How to do this differently? Option one would be to use a temporary value that is guaranteed not to exist in the table, like -1. In programming, this is known as swapping the values of two variables using a third variable. The only drawback to this method is the need for an additional update.

The second option is to redesign the table and change the data type of the order number column from integer to floating-point. Then, if we update the value from 1 to, for example, 2.5, the first record will be placed between the second and the third records. This solution works, but there are two limitations. First, it will not work for you if the value is used somewhere in the user interface. Second, depending on the accuracy of the data type, you will have a limited number of possible updates before you will have to recalculate the values of all records.

Option number three is to make the constraint deferred so that it is enforced only on commit:

create table cars
(
name text constraint pk_cars primary key,
ord integer not null constraint uk_cars unique deferrable initially deferred
);

Since the logic of our initial query ensures that all values are unique at the time of a commit, it will execute successfully.

The above example is, of course, very artificial, but it explains the idea. In our application, we use deferred constraints to implement the logic for resolving conflicts that happen when multiple users are working simultaneously with the same objects/widgets on a whiteboard. The use of deferred constraints allows us to make the application code a bit simpler.

In general, depending on the type of constraint, there are three levels of granularity of constraint enforcement in PostgreSQL: row, statement, and transaction.

Source: begriffs

CHECK and NOT NULL are always enforced at the row level; for other constraints, as can be seen from the table, there are different options. You can read more about that here.

To summarize briefly, deferred constraints in a number of situations allow for more readable code and fewer queries. The downside of that, however, is the complication of the debugging process, since the moment an error occurs and the moment you find out about it will be separated in time. Another possible problem is that the planner cannot always build an optimal plan if a deferred constraint is involved in the query.

Modifying pg_repack

Now we know what deferred constraints are, but how are they related to our problem? Let’s recall the error we previously encountered:

$ ./pg_repack -t tablename -o id
INFO: repacking table “tablename”
ERROR: query failed:
ERROR: duplicate key value violates unique constraint “index_16508”
DETAIL: Key (id, index)=(100500, 42) already exists.

It occurs when the data is copied from the log table to the new table. It looks strange because the data from the log table is committed together with the data from the original table. If that data satisfies the constraints of the original table, how can it violate the same constraints in the new table?

As it turns out, the root of the problem lies in the previous step of pg_repack where it creates indexes but not constraints: The old table had a unique constraint, but the new table has a unique index instead.

creation indexes for the new table

It is important to note here that if the constraint is not deferred, the unique index created instead of that constraint will be equivalent to it because, in PostgreSQL, unique constraints are implemented by creating a unique index. But in the case of a deferred constraint, the behavior is different because an index cannot be deferred and is always enforced at the time a query is executed.

Thus, the essence of the problem lies in the fact that the enforcement is deferred: In the original table, it occurs at the time of a commit, and in the new table, it occurs at the time a query is executed. Therefore, we need to make sure that the enforcements are the same in both cases: either always deferred or always immediate.

So, what ideas did we have?

Create an index similar to deferred

The first idea is to enforce both checks in immediate mode. This may cause some false positive constraint violations, but if there are only a few of them, it should not affect the user experience, since such conflicts are a regular thing for users. They happen, for example, when two users start modifying the same widget at the same time and the second user’s client is not able to receive in time the information that the widget is already locked by the first user. In such a situation, the server responds to the second user with an error, and the client rolls back the changes and locks the widget. Later, when the first user finishes modifying the widget, the second user will receive the information that the widget is no longer locked and they will be able to repeat their action.

To ensure that the checks are always immediate, we created a new index that is similar to the original deferred constraint:

CREATE UNIQUE INDEX CONCURRENTLY uk_tablename__immediate ON tablename (id, index);
— run pg_repack
DROP INDEX CONCURRENTLY uk_tablename__immediate;

In the test environment, we received only a few expected errors. It’s a success! Then, we launched pg_repack in the production environment again and got five errors on the first cluster in an hour. This is an acceptable result. However, the number of errors on the second cluster already had a manifold increase and we had to stop pg_repack.

Why did it happen? The probability of an error depends on how many users work simultaneously with the same widgets. Apparently, at that time, the first cluster had a much smaller number of concurrent data changes than the others, so we just got “lucky.”

The idea did not work. At that moment, we saw two other solutions: rewrite our application code to stop using deferred constraints or modify pg_repack to add support for them. We chose the second one.

Replacing the indexes in the new table with the deferred constraints from the original table

The purpose of the modification was obvious: if the original table has a deferred constraint, a deferred constraint should be created in the new table, not an index.

To test our changes, we wrote a simple test:

  • create a table with a deferred constraint and one record;
  • make a loop to insert new data that will conflict with the existing record;
  • make an UPDATE query to resolve the conflicts;
  • commit changes.
create table test_table
(
id serial,
val int,
constraint uk_test_table__val unique (val) deferrable initially deferred
);
INSERT INTO test_table (val) VALUES (0);FOR i IN 1..10000 LOOP
BEGIN
INSERT INTO test_table VALUES (0) RETURNING id INTO v_id;
UPDATE test_table set val = i where id = v_id;
COMMIT;
END;
END LOOP;

The original version of pg_repack always failed on the first INSERT; the modified one worked without errors. Excellent.

Moving to the production environment, we encountered the same error when the data was copied from the log table to the new table:

$ ./pg_repack -t tablename -o id
INFO: repacking table “tablename”
ERROR: query failed:
ERROR: duplicate key value violates unique constraint “index_16508”
DETAIL: Key (id, index)=(100500, 42) already exists.

Classic. The code that works in the test environment does not work in the production environment.

APPLY_COUNT and the juncture of two chunks

We started analyzing the code, literally line by line, and found an important part: data is copied from the log table to the new table in chunks, and the APPLY_COUNT constant defines the size of a chunk:

for (;;)
{
num = apply_log(connection, table, APPLY_COUNT);
if (num > MIN_TUPLES_BEFORE_SWITCH)
continue; /* there might be still some tuples, repeat. */

}

The problem is that queries from the initial transaction that violate a constraint can become broken into two chunks, which will be committed separately. And this part is a complete matter of chance: if the queries from the first chunk do not violate any constraint, then everything is fine, but if they do, an error occurs.

APPLY_COUNT is set to 1,000 queries, which explains why our tests were successful — they did not cover the cases of chunking. We used two queries, one INSERT, and one DELETE, so exactly 500 query pairs were placed in a chunk and we did not encounter any problems. After we added a second UPDATE, our modification stopped working:

FOR i IN 1..10000 LOOP
BEGIN
INSERT INTO test_table VALUES (1) RETURNING id INTO v_id;
UPDATE test_table set val = i where id = v_id;
UPDATE test_table set val = i where id = v_id; — one more update
COMMIT;
END;
END LOOP;

So, our next task was to make sure that the data from the original table that was changed within the scope of one transaction was copied to the new table also within one transaction.

Getting rid of chunking

Again, we had two options. Option number one: get rid of chunking completely and copy all data as one transaction. One advantage of this option is its simplicity: the required code changes are minimal (by the way, this is exactly how the previous version of pg_repack, pg_reorg, was written). But there is a problem: we create a long transaction, and this, as was said before, increases the possibility of bloat.

Option number two is more complicated but probably more correct: create a column in the log table that will identify the transaction that added the data to the table. Then, when copying data, we can group it by that column and ensure that related changes are transferred together. A chunk will consist of several transactions (or one big one), and its size will vary depending on how many changes these transactions contain. It is important to note that since transaction data in the log table is accumulated in a random order, it will no longer be possible to read it sequentially. A sequential scan for every query that filters by txid is too expensive; we need an index, but even that will slow down this method due to the overhead of updating it. So, as always, you have to sacrifice something.

We decided to start with the first option, the simpler one. First, we needed to figure out whether a long transaction would be a real problem. Since the main data transfer from the original table to the new table also occurs in one long transaction, the question can be formulated as follows: How much will we extend this transaction? The duration of the first transaction depends mainly on the size of the table. The duration of the new transaction will depend on how many changes will accumulate in the table during the data transfer, i.e., on the database load. We did a pg_repack run during a minimum load period, and the number of changes was infinitely small compared to the size of the original table. We decided that we can neglect the duration of the new transaction (for comparison, it is one hour and two or three minutes on average).

The experiments were successful. The production run also went smoothly. To demonstrate, here is a chart of the size of one database after the run:

Since we were completely satisfied with this solution, we did not try to implement the second one but are considering discussing it with the extension developers. Our current revision, unfortunately, is not yet ready for publication as we have only solved the problem with deferred unique constraints, and a fully fledged patch requires adding support for other types as well. We hope we will be able to do that in the future.

You might be asking yourself: Why did they even bother modifying pg_repack instead of using any other similar tool? At some point, we were thinking about that, but the positive experience we had using it earlier (on tables without deferred constraints) motivated us to try to understand the problem and fix it. Besides, using other solutions would also require time for testing, so we decided that first, we would try to fix the problem with pg_repack, and if we were not able to do that in a reasonable time, then we would consider other tools.

Conclusions

What we can recommend based on our experience:

  1. Monitor your bloat. By using your monitoring data, you will be able to understand how well AUTOVACUUM is configured.
  2. Configure AUTOVACUUM to keep bloat at an acceptable level.
  3. If bloat still grows and you cannot beat it using built-in methods, do not be afraid to use external extensions. Just remember to properly test everything.
  4. Do not be afraid to modify external solutions to your needs. Sometimes this can be more effective and even easier than changing your own code.

PS: Big thanks to guys from postgres.ai for helping with detecting the problem of bloat and finding possible solutions.

Join our team!

Would you like to be an Engineer, Team Lead or Engineering Manager at Miro? Check out opportunities to join the Engineering team.

--

--