Keep Your Postgres Stable — Vacuum

Mert Yavasca
TOM Tech
Published in
5 min readApr 30, 2024

In database management, ensuring optimal performance is critical. If you are working with PostgreSQL, you must have heard of concepts such as vacuum and dead tuple. In this article, we dive into Postgresql vacuum, exploring its purpose, functionality, best practices, and its impact on database performance.

Why Need Vacuum ?

For understanding Vacuum, we need to be familiar with MVCC (Multi-Version Concurrency Control). MVCC is an implemantation used in PostgreSQL and similar database systems for concurrency control. In this apporach, each transaction can view a consistent snapshot of the database, allowing transactions to proceed concurrently without blocking each other. Every transaction creates its own snapshot and operates on it. This prevents conflicts between read and write operations occurring simultaneously, ensuring database consistency and improving performance.

Due to this transaction management scope of Postgresql, the update process is achieved by versioning the row with the old version and marking it as deleted on the disk, and then inserting the row with the new version. Therefore, old version pages that need to be cleaned, which we call “dead tuple”, are left behind. So updates work as delete + insert and delete itself not really delete row from disk, it just flagged for vacuum operation as deleted but dead tuple still takes space.

What is Vacuum?

At its core, the vacuum in PostgreSQL is designed to reuse storage which is occupied by dead tuples (rows that are no longer visible to any transaction) and update visibility information for active transactions. This process is crucial for maintaining the health and efficiency of a PostgreSQL database.

Without proper vacuuming, the database can suffer from bloat, where dead tuples accumulate and occupy disk space unnecessarily. Moreover, the lack of updated visibility information can lead to transaction ID wraparound issues, potentially causing data loss and database downtime.

Create table with dummy data with scripts below.

CREATE TABLE users (
id serial PRIMARY KEY,
name text NOT NULL,
created_on timestamptz
);


INSERT INTO users (name, created_on)
SELECT
'Name Surname ' || i,
time_hour
FROM
generate_series(1, 500000) as i,
generate_series(
TIMESTAMPTZ '2024-01-01',
TIMESTAMPTZ '2024-01-07',
INTERVAL '2 days'
) as time_hour;

Before delete/update operations ;

after an update operation ;

When I update table, postgres flagged old rows as deleted in storage and insert new ones. Nothing really updated/deleted actually, so this table’s size is increased as shown. Let’s do a vacuum operation to reuse disk space.

After vacuum, total size of relation does not change. Even though it continues to take up space on the operating system, new data comes to the pages marked as reusable after vacuum.

If you want to take this space back from storage, you have to do vacuum full operation or use extensions like pg_repack. But vacuum full works like “create table as” statement and hold access exclusive lock on table, so you can’t access table during vacuum full ( even select statements does not work)

Autovacuum

The autovacuum daemon, a background process in PostgreSQL, automatically initiates vacuum operations based on predefined thresholds and configurations.

One of the key decisions in vacuuming PostgreSQL databases is adjusting autovacuum operations. While autovacuum offers convenience by automating the process.

Configuring autovacuum parameters is essential for optimizing performance and resource utilization. Parameters such as autovacuum_threshold, autovacuum_scale_factor, and autovacuum_max_workers dictate when and how autovacuum kicks in, ensuring timely maintenance without overwhelming system resources.

  • autovacuum_threshold: Minimum number of updated or deleted tuples triggering autovacuum.
  • autovacuum_scale_factor: Fraction of total table size that triggers autovacuum.
  • autovacuum_max_workers: Maximum number of autovacuum workers that can be active simultaneously.
  • autovacuum_naptime: Time to sleep between autovacuum runs.

Do not forget that vacuum or autovacuum can not claim storage back, only reflag files to reuse again. If you want to take space back from storage, you have to do vacuum full or use extensions like pg_repack. Personally, I think vacuum full should not be used as it is never suitable for any conditions. It takes an exclusive lock and recreate table from scratch like create table as statement. Use pg_repack instead.

Best Practices for Vacuuming

To maximize the effectiveness of vacuuming, it’s essential to follow best practices:

  • Never ever disable autovacuum. It is your ally. If you have any problem with autovacuum, tune it properly. Don’t disable it.
  • Schedule manual vacuum operations during off-peak hours to minimize impact on database performance. Set daily cron or even more frequent vacuum jobs to keep tables and statistics clean.
  • Autovacuum works more optimistically than the manual vacuum process. It stops working when a process that requires more locks starts. Therefore, it is not ideal to do all vacuum operations by relying on autovacuum.
  • Monitor database activity and adjust autovacuum parameters accordingly to prevent bloat and ensure timely maintenance.
  • Analyze and address fragmentation issues by vacuuming tables with high update/delete activity or index bloat. Don’t be afraid to adjust parameters for spesific tables. You can control your bloat level using extensions such as pg_visibility

In conclusion, we can think that in Postgresql update statements work as delete + insert operations and nothing really deleted when you do delete operation, it just flags deleted rows in OS. So deletes are working like versioning and special process called Vacuum act like garbage collector to delete these rows from OS.

VACUUM command is a vital tool for optimizing PostgreSQL database performance. By reclaiming storage occupied by dead tuples and updating visibility information, vacuum operations prevent bloat, mitigate the risk of transaction ID wraparound, and ensure the health and efficiency of the database.

--

--