Auto-vacuum tuning in PostgreSQL

Published in

helpshift-engineering

16 min readOct 11, 2023

Pic credit: https://unsplash.com/photos/cpIgNaazQ6w

In May 2021, we were caught off guard by the infamous XID wraparound problem on one of our important PostgreSQL (pg) clusters. Many others have encountered this problem and have written about their experience dealing with it. After employing some hacky workarounds to restore service as quickly as possible, we knew we had to really understand the root cause if we wanted to continue using pg. So we started with the following questions:

Why did it happen?
Can it happen again?
If yes, how to ensure that it doesn’t happen again.

TL;DR of #1 was that the vacuum operation hadn’t run for a long time. More on this later.

The answer to #2 was obviously “yes”, thanks to Murphy’s law —

Anything that can go wrong will go wrong — Murphy’s law

Finding the answer to #3 was the key. Something had to be done to prevent it from happening in the future.

Fast-forward to today, the vacuum operation has consistently run on the server for the past 2 years, and the cluster continues to operate smoothly. More importantly, we now have clear visibility into when the vacuum operation runs, or if it doesn’t run at all.

Writing about our remediation efforts to address the XID wraparound issue has been long overdue. This post will be long and in-depth. But I strongly believe that diving a bit deeper into the intricacies of vacuum is essential for appreciating the need for it, rather than merely complaining about it.

Here is a rough outline of what follows:

Understanding the vacuum operation
Tuning auto-vacuum config to prevent (a) db bloat and, (b) XID wraparound
Monitoring and alerting
Tuning config to ensure vacuum runs effectively

If you’re someone who’s looking for the answer to question #3 above, I hope this post will be useful. If you’re running a postgresql cluster for any serious use case, I’d highly recommend reading further. It’s better to be prepared, cause you never know when and how badly XID wraparound may hit you!

Understanding vacuum

In pg, “vacuuming” is a critical maintenance task that needs to be performed periodically. If databases are not vacuumed for long enough, it may result in table bloat and/or XID wraparound. The latter implies serious downtime (of hours, days or even weeks). In other words, vacuum is not something that we should try to avoid as it’s essential for healthy and smooth functioning of the db.

To understand vacuum, let’s begin with the very basic “tuple”. In pg, rows are immutable. A tuple is an internal representation of a row. As application developers, what we usually refer to as a “row” is actually a tuple that’s visible at a given time. There could be other tuples associated with the same row which represent older versions of it. These are invisible to new transactions but could still be visible to ongoing ones (See more: MVCC).

As the name suggests, the primary goal of vacuum is cleanup, but it does much more than that. Mainly, it performs the following four tasks:

Remove bloat: Db bloat means additional disk usage than required to store the data. In pg, when a row is deleted or updated, a new tuple is created, and the old one becomes “invisible” to new transactions. It doesn’t stop there though— pg also needs to change all the index entries to point to the new version of the row. This results in accumulation of index entries pointing to dead tuples. If there are many indexes on a table, it can lead to significantly large bloat — something referred to as “write amplification”. Vacuum does the important job of deleting the dead tuples and pruning the index pointers to them, thereby freeing up space which can be reused for storing future data.
To protect against XID wraparound: pg operates with limited (32 bit or 4 byte) transaction IDs or XIDs. The XID gets associated with a row at the time of a write operation. If the cluster is running for a long time, there’s a risk of running out of XID values. When the limit is reached, new XID values will “wraparound” i.e. the XID space is circular with no endpoint. So, by the time the XID wraps around, all old XIDs associated with old rows need to be gone. Otherwise, XID of a new transaction can be less than the XID of an old transaction, causing data corruption. The vacuum process prevents this by updating the XID of old transactions (older than a configured threshold age) with a special value FrozenTransactionId that is handled by the XID comparison algorithm. This process is commonly referred to as “freezing” tuples.
To update visibility map: Vacuum builds a visibility map for each table which has information about which tuples are visible to transactions and which are not. This speeds up index-scans as well as the vacuum process itself.
To analyze tables and generate statistics: pg looks at samples of data and generates statistics that can are used by the query planner to serve queries efficiently. This operation doesn’t take much time and can even be done in stages of incremental sample sizes. Hence, pg also provides a way to run vacuum only to generate statistics, leaving out the other expensive operations.

Detailed information about vacuum can be obtained here — Routine Vacuuming

One question that arises is, why does vacuum try to do so many things? The answer is — because it involves scanning the heap, it tries to club various unrelated operations and performs them in a single scan.

The autovacuum daemon

Pg has a daemon process called autovacuum launcher that is in charge of starting worker process to perform the vacuum operation on all dbs.

While vacuum can be run manually on any db or even specific tables of a db at any given time, it’s best to let the autovacuum daemon do the job as it will ensure that vacuum will be performed on all the dbs that are eligible for vacuuming.

There are 2 main things that pg considers when deciding whether a db needs vacuuming

No. of dead tuples increasing beyond a threshold
Max XID age of tuples in the db exceeding a limit

However, there could be cases where autovacuum fails to run if it’s not tuned properly. Hence it’s important to tune the autovacuum config.

Side effects of vacuum operation

Normal vacuum process doesn’t acquire any exclusive locks[1] and can be run along with the cluster serving reads and writes. VACUUM FULL on the other hand is an aggressive version that acquires exclusive locks and is a blocking operation. It blocks any reads and writes until completion in order to return the freed up space back to the OS.

Having said that, some of the phases[2] of vacuum (both normal and full) result in a lot of disk I/O. Thus ongoing vacuum operation can interfere with the other active processes on the cluster.

In short, vacuuming isn’t free of costs and may lead to temporary performance degradation for the applications. Hence it needs to be run only when justified. Something to keep in mind when tuning the autovacuum config parameters:

If it’s triggered too frequently, there will be unnecessary I/O
If it’s triggered very rarely, there’s a chance of db bloat and XID wraparound

Another question is, how big is the table. For a large table (more no. of rows), vacuum will take more time which translates to poor application performance for a longer duration.

Tuning vacuum to prevent DB bloat

There are 2 config parameters which decide when autovacuum is triggered as a reaction to db bloat. One of them is a static component and the other is the dynamic component. The formula is,

actual threshold =  autovacuum_vacuum_threshold 
                  + autovacuum_vacuum_scale_factor * no. of rows

autovacuum_vacuum_threshold is the minimum number of updated or deleted tuples needed to trigger a VACUUM in any table. The default is 50 tuples.

autovacuum_vacuum_scale_factor is a fraction of the table size (in terms of no. of rows) to decide whether vacuum should be triggered. Default is 0.2

Which means that,

For a smaller table (less rows) the static threshold component is what will contribute more towards triggering vacuum
For larger tables (more rows) the dynamic component scale_factor is what will contribute more towards triggering vacuum

Now let’s talk about our specific incident. As it turned out, one large table in our db was the reason why vacuum hadn’t run on for a long time. This particular table is like an append only log with a lot of insertions and no updates. Deletions happen at the time of archival (at a later date) or when the customer leaves us. The default scale factor of 0.2 was too large for us which caused the autovacuum trigger point to be pushed farther and farther with increase in no. of rows.

To arrive at the suitable value for the scale factor, we considered the following:

Rate of dead tuples created per day
No. of rows in the large table of interest
Rate at which news rows are inserted in the table

The rate of dead tuples (1) can be calculated using the n_dead_tup value from the pg_stat_user_tables view. Eg.

-- For a specific table
select n_dead_tup from pg_stat_user_tables where relname = '<tablename>';

-- Or for all tables 
select sum(n_dead_tup) from pg_stat_user_tables;

It’d be expensive to get the exact count of rows (2) for a large table, but we can get an estimate as follows,

SELECT
    c.relname,
    c.reltuples::bigint AS estimate
FROM
    pg_class c
    JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE
    c.relname = '<tablename>'
    AND n.nspname = 'public';

Finally, to find out the rate of insertions (3), you can use application side metrics if any. In our case, the insertions were done by a kafka consumer and we already had charts for throughput of the consumer. So we just used those numbers.

The values in our case were as follows:

Rate of creation of dead tuples ≈ 20m per day
No. of rows in the table ≈ 4.3b
Rate of insertions ≈ 10m on avg per day

Now with the default scale factor of 0.2 and static threshold of 50, autovacuum will be triggered when (4.3b * 0.2) + 50 = 910mdead tuples are created. That’s a huge number. With this config, autovaccum would trigger after 45 days. But at a rate of 10m insertions per day, the no. of rows in the table will have increased by almost half a billion in those 45 days. You see the problem, right?

If we wanted to have autovacuum triggered daily, we’d need to reduce the scale factor to 0.005 (4.3b * 0.005 = 21m). But we decided to start with a conservative frequency of once a week and reduced it to 0.035 as per the following calculation:

4.3b * 0.035 = 150.5m

This is roughly equal to the no. of dead tuples accumulated in 7.5 days. Note that during these 7.5 days, there would be around 75m insertions but this number is not too large.

This worked quite well for us.

Tuning autovacuum to avoid XID wraparound

Db bloat is one thing but the real reason why you’d want prioritize tuning autovacuum is to avoid XID wraparound. You may be able to work your way around db bloat by adding more and more disk (throwing money at the problem). But if the age of the oldest XID (max age) exceeds 2 billion, then the db will go into read-only mode i.e. it will stop accepting writes. At this point the only way to recover from it is by running a standalone VACUUM FREEZE operation in single user mode i.e. with the database process shutdown. This can result in a downtime of hours, days or even weeks[3].

As a guard against reaching this point, pg will react with some warning signs.

A bit before XID wraparound, it will start writing warnings to the log file. But in absence of regular monitoring of logs, they will most certainly be missed. So this is not very useful.

A more effective warning, or rather a measure is the forced vacuum operation[4] that pg will trigger at a much earlier threshold of 200m (10% of 2b). It is special in the sense that,

It gets run even if the autovacuum daemon is disabled and,
It cannot be cancelled permanently — If a vacuum of this kind is in progress, some statements such as alter table or index creation will get blocked because of shared locking. In that case, it’s perfectly ok to cancel the vacuum process and allow the other transaction to proceed. But as soon as that transaction is complete, a new vacuum process will start. While such vacuum is running, the cluster can continue to serve reads and writes.

There’s nothing to worry about forced vacuum operation as long as it completes in reasonable time and doesn’t cause performance degradation. For smaller dbs this is not a problem at all. For larger dbs, it’s a rather important consideration.

For tuning the config to prevent XID wraparound, there are 3 params of interest:

autovacuum_freeze_max_age:

This is the max age (in no. of transactions) a table’s pg_class.relfrozenxid can attain before vacuum is forced in order to prevent XID wraparound.

Default is 200m which is ~10% of the possible 2³¹ values (wraparound). Now 10% may seem too less and one might be tempted to increase it. But that would just delay the inevitable forced vacuum process further (thereby also increasing the time it’d take to complete).

The right value can be found out based on your transaction burn rate i.e. how fast the XIDs will be used up. But we might just be able to get away with the default value itself if other things are configured correctly (as explained a bit later).

vacuum_freeze_min_age:

The cutoff age (in transactions) that vacuum should use to decide whether to freeze row versions while scanning a table.

Default is 50m which means all XIDs that are older than 50m will be frozen and the youngest XID in the table will have an age of 50m. This is very very conservative number. Consider transaction advancement rate of 50m in 24 hours. Unless we have transactions or queries that run for 24 hours, do we really need that old XIDs to be present? Another way to think about it is — even after vacuum is run, we are only 50m transactions away from the next forced vacuum. From a practical standpoint, this value can be reduced by a lot (considering the default 200m autovacuum_max_freeze_age).

The right value can be decided based on two things — the rate of transactions and the duration of longest running query/transaction served by the cluster.

A vacuum freeze instead of plain vaccumimplicitly sets this value to 0 i.e. XID on all rows will be frozen.

vacuum_freeze_table_age:

If the table’s pg_class.relfrozenxid value reaches this threshold, then vacuum will aggressively scan the tuples i.e. it will visit every page that may contain unfrozen tuples, not just the ones that contain dead tuples. This is provided because normally, vacuum uses an optimization to avoid scanning all pages. The visibility_map tells which pages may contain dead tuples and vacuum can only scan those.

Default value is 150m. Recommended is 80% of autovacuum_freeze_max_age

Again, a vacuum freeze instead of plain vaccumimplicitly sets this value to 0 ie. XID on all rows will be frozen.

To calculate the suitable values for the above config params, we’d need to consider:

Current max XID age, at least for the databases having large tables. It’d also be good to know this value as a percentage of autovacuum_freeze_max_age as well as as percentage of 2b (when XID wraparound would happen)
Transactions burn rate i.e. rate at which XIDs are getting “exhausted”

For current max XID age (1), we can run a query as follows,

WITH max_age AS ( 
    SELECT 2000000000 as max_old_xid
        , setting AS autovacuum_freeze_max_age 
        FROM pg_catalog.pg_settings 
        WHERE name = 'autovacuum_freeze_max_age' )
, per_database_stats AS ( 
    SELECT datname
        , m.max_old_xid::int
        , m.autovacuum_freeze_max_age::int
        , age(d.datfrozenxid) AS oldest_current_xid 
    FROM pg_catalog.pg_database d 
    JOIN max_age m ON (true) 
    WHERE d.datallowconn ) 
SELECT max(oldest_current_xid) AS oldest_current_xid
    , max(ROUND(100*(oldest_current_xid/max_old_xid::float))) AS percent_towards_wraparound
    , max(ROUND(100*(oldest_current_xid/autovacuum_freeze_max_age::float))) AS percent_towards_emergency_autovac 
FROM per_database_stats

The above query will return 2 important values:

percent_towards_wraparound: If this is close to 100%, it means XID exhaustion will happen soon and if older XIDs are not frozen, the database would shutdown to prevent data corruption. We want to avoid this at all costs because then there would be no time to react.
percent_towards_emergency_autovac: If this is close to 100% it means forced vacuum is around the corner. Ideally we’d want to avoid getting to this point, but things would still be in much better control than the first case.

It also helps to find XID max age per database using the following query

SELECT datname, age(datfrozenxid) FROM pg_database;

For transaction burn rate (2), we can use the xact_commit and xact_rollback stats from pg_stat_database view.

SELECT
    xact_commit,
    xact_rollback
FROM
    pg_stat_database
WHERE
    datname = '<tablename>';

We send these metrics to statsite and plot the nonNegativeDerivative over time windows of 1 hour (green bars) and 1 day (blue bars).

Now let’s look at how we can tune the above 3 params based on the these metrics.

autovacuum_freeze_max_age:

Interestingly, if other values are tuned correctly then there’s no need to change this setting and we can let it be the default. That’s because if other params are tuned properly, then this limit will never be reached. And even if it’s reached, autovacuum will run without any performance degradation. So let’s assume this setting will be the default of 200m.

vacuum_freeze_min_age:

To tune this, we also need to consider the duration of longest running query (that we expect based on our specific use case), besides the transaction burn rate e.g. if it is 4 hours, then it’s ok to freeze tuples for transactions older than 4 hours. For a typical OLTP workload, 4 hours is more than sufficient. But let’s use this as a conservative estimate.

In our case, at the time of tuning the config params, the avg. daily transactions were around 11m which amounts to an hourly rate of 460k.

Let’s say we decide that 4 hour old transactions can be frozen, then vacuum_freeze_min_age can be set to, 4 * 460k = 1.84m

As you can see, it’s significantly lesser than the default which is 50m. This will have a major impact on how aggressively the tuples will get frozen.

Aside: The hourly rate calculated above also gives us an estimate of time between XID wraparounds (in the absence of vacuum),
2000000000 / 460000 hours = 181 days ≈ 6 months

vacuum_freeze_table_age:

We can set it to 80% of auto_vacuum_max_freeze_age i.e. 0.8 * 200m = 160m

Monitoring and alerting

Tuning database config is not a one time activity. You may need to regularly re-tune the config based on the changes in the use cases, traffic patterns etc. But how do we know when re-tuning is in order?

Having monitoring and alerting helps in this case. All of the metrics that we considered above can be plotted on the monitoring platform of your choice. Moreover, you can set threshold based alerts on some metrics e.g. we have alerts when percent_towards_wraparound exceeds 10% and percent_towards_emergency_autovac exceeds 120%. This will give us sufficient time to investigate any issues and take corrective actions.

We also plotted the various phases of vacuum operation by querying the pg_stat_progress_vacuum view and it has turned out to be tremendously helpful in understanding what the vacuum operation is currently doing.

Making sure vacuum runs effectively

There’s nothing to worry about vacuum operation as long as, it completes in reasonable time and doesn’t cause performance degradation while it’s in progress. Both these can be easily achieved if the database is small but for large databases we need to strike a balance between time taken v/s performance impact. This can be done by tuning the config params that control the resources available to the vacuum process.

More the resources, there will be increased I/O (more impact) but vacuum will take less time (shorter impact duration)

Lesser the resources, there will be less I/O (less impact) but vacuum will run for longer duration and in the worst case, it may not catch up with transaction burn rate or dead tuple creation rate. Then we have a problem.

The relevant config parameters that we can tune here,

maintenance_work_mem: The default value of 64MB is too less. On typical production environments, we can easily increase it to at least 1GB. This will allocate more memory to vacuum process so that it completes faster.

autovacuum_max_workers: maximum number of autovacuum processes (other than the autovacuum launcher) that may be running at any one time. The default value is 3. We can keep it the default or equal to the no. of large databases, whichever is smaller.

autovacuum_vacuum_cost_delay: Specifies the cost delay value that will be used in automatic VACUUM operations. We can keep it unchanged.

autovacuum_vacuum_cost_limit: Specifies the cost limit value that will be used in automatic VACUUM operations. We can keep it unchanged.

In conclusion,

vacuum is an important operation for proper functioning of a postgres server. It’s not something to be avoided or delayed, rather it needs to be tuned as per the specific traffic pattern and workload to ensure effective vacuuming at regular intervals.
Further, tuning the (auto) vacuum config is not a one time exercise as it needs to be revisited as and when the nature of traffic and workload changes.
It’s worth investing in good monitoring and alerting sooner than later, as it greatly helps in tuning the config. If there’s one takeaway from this article, it should be this.

If you made this far, thanks for reading and hope this post is useful to you (hopefully proactively rather than reactively :-))

Thanks to Narendra Pal for reviewing the drafts.

Footnotes

[1]: This is true even for the forceful vacuum started to avoid XID wraparound at XID max age of 200m.

[2]: The vacuum process happens in multiple phases which are described in detail here.

[3]: During the aforementioned incident, the vacuum full operation in single user mode ran for ~2 weeks. As we didn’t have any monitoring in place yet, we were starting to lose hope. That it eventually did complete felt like a miracle!

[4]: One valid question is why the forced vacuum operation was not able to avoid XID wraparound in our case. Our guess is that forced vacuum did run but it was not able to catch up due to insufficient resources (due to poor configuration, we didn’t know any better at that time!)