PG Internals, PostgreSQL Vaccum of A to the Z

Published in

Opensource Relational Database with Graph Analytics

6 min readApr 29, 2024

1. Introduction

In the world of PostgreSQL, the ‘Vacuum’ function plays a crucial role in enhancing database performance. Unique to PostgreSQL, thanks to its implementation of Multi-version Concurrency Control (MVCC), the concept of Vacuum deserves special attention for its distinct features.

Just like vacuuming cleans up a space, the Vacuum process in a database clears out unnecessary space, creating room for new data. In essence, Vacuuming in PostgreSQL involves tidying up disk space by removing unwanted data, thus boosting efficiency. To fully grasp the Vacuum process, it’s imperative to understand the principles of MVCC first.

2. The importance of MVCC

Imagine a bustling database accessed by numerous users simultaneously, with each one modifying or querying data. At such times, consistent and accurate data retrieval becomes critical.

MVCC, or Multi-Version Concurrency Control, ensures this consistency by enabling concurrent access to the database without locking data, preserving the integrity of data reads. When a user accesses the database, they get a snapshot of the data at that moment. Changes made by other users remain invisible until the transaction (commit) is completed.

Different database management systems implement MVCC in various ways, but the goal is always the same.

Let’s take a closer look at how Oracle and PostgreSQL approach MVCC.

ORACLE’s MVCC “Rollback Segment”

Oracle employs Rollback Segments to implement MVCC. If a transaction modifies data, Oracle updates the existing data block with new information while preserving the original data in a Rollback Segment

PostgreSQL’s MVCC “MGA”

PostgreSQL, on the other hand, uses a method known as MGA. When data is modified, the original Tuple is kept intact, and a new Tuple is added.

This approach records the creation and modification timestamps for each Tuple for comparison. As a result, old Tuples become ‘Dead Tuples,’ which occupy space without serving any purpose, thus necessitating a Vacuum.

PostgreSQL manages versions by recording creation and deletion transaction IDs in metadata fields within each Tuple, known as xmin and xmax.

Therefore, Vacuuming is an essential process for implementing MVCC in PostgreSQL.

Let’s delve deeper into the Vacuum process.

3. What is Vacuum?

When data is deleted in PostgreSQL, it’s not immediately removed from the disk because another transaction might still need it. However, when data is no longer required, it takes up unnecessary space and can degrade performance. This is where Vacuuming comes into play.

Specifically, Vacuum is a PostgreSQL feature designed to address space inefficiencies and prevent problems like XID Wraparound, which are unique to PostgreSQL.

Let’s examine the four key roles of Vacuuming:

Clearing Dead Tuples and Returning Space to the FSM (Free Space Map)
For one thing, dead tuples are old data that has already been committed, erased, or updated. FSM also looks at the available space on disk like a map to see if there is any free space, and if there is not, it will free it up. Vacuum cleans up dead tuples, updating the Free Space Map (FSM), which tracks pages in the database that have usable free space. This helps optimize space allocation for new data and prevents unnecessary disk I/O by avoiding full-page reads of pages with no usable space
Preventing Transaction ID Wraparound
If transaction IDs overlap, data loss can occur. Regular Vacuuming helps prevent this overlap, which is crucial as it can lead to severe issues, including service outages. Vacuuming changes old XID values to frozenXID to prevent errors
Updating the Visibility Map to Enhance Index Scan Performance
The Visibility Map tracks pages that contain tuples referenced by active transactions. While Vacuuming cannot clean these pages and skips them, it allows for an index-only scan, which improves performance by checking the Visibility Map first, thus avoiding disk scans
Refreshing Statistical Information
Vacuuming updates both the table information stored in pg_class and the real-time statistical information files

Criteria for Vacuum Execution

Vacuuming follows the criteria set by the FSM, checking if there is enough space before proceeding.

Vacuum analyze query statement

You can monitor Vacuum activity by checking the number of inserted, updated, or deleted rows in pg_stat_all_tables, as well as the number of dead tuples.

TupleCheckQueryStatement

4. Vacuum Utilization & Optimization

Vacuum vs. AutoVacuum

There are two types of Vacuuming:

manual Vacuum, where you input queries as needed, and AutoVacuum, which operates automatically. The main difference between a manual Vacuum and AutoVacuum is that the latter does not lock the database during operation.

Vacuum — The Manual Approach

When precision is key, the manual Vacuum allows for a targeted cleanup initiated through specific queries. This approach is hands-on and offers control for immediate, as-needed maintenance. However, one must consider the operational state of the database, as manual vacuuming can introduce locks that might be disruptive during active periods.

AutoVacuum — The Automated Solution

AutoVacuum is the seamless, behind-the-scenes process that proactively maintains database health. Triggered by predefined thresholds of data age or accumulation, it operates without hindering the database’s ongoing activities, distinguishing itself from the more invasive full Vacuum procedure.

Optimizing AutoVacuum involves configuring it to work efficiently based on pre-set conditions such as Dead Tuple accumulation and Tuple age thresholds.

Here are three ways to optimize AutoVacuum settings:

Optimizing autovacuum_vacuum_scale_factor
Setting this factor to 0 means AutoVacuum will trigger based on the number of dead tuples specified in autovacuum_vacuum_threshold, ensuring stable performance
Adjusting the autovacuum_vacuum_cost_limit
This parameter limits the total cost that a single AutoVacuum operation can incur before it is paused. A higher value can allow AutoVacuum to run longer before pausing, potentially speeding up the vacuum process but at the risk of greater impact on database performance

In short, this parameter determines the duration of Autovacuum’s behavior.

This setting is the conductor of the operation’s timing. Here’s a quick score:

autovacuum_vacuum_cost_limit = 200: Each Vacuum run gets a budget of 200 credits
vacuum_cost_delay = 0: Think of it as a breather. Once AutoVacuum spends its 200 credits, it takes a millisecond nap
vacuum_cost_page_hit = 1: Every time it cleans up data from the shared buffer, it’s just 1 credit off the budget
vacuum_cost_page_miss = 10: Digging into the disk area costs 10 credits per clean-up
vacuum_cost_page_dirty = 20: Tidying up Dead Tuples? That’ll be 20 credits, please

Though you could apply a one-size-fits-all approach to all your tables, fine-tuning autovacuum_vacuum_cost_limit for each one separately could yield better results.

3. Adjusting autovacuum_max_workers
Increasing the number of workers enables the system to handle unattended tables promptly, especially advantageous for partitioned tables. The autovacuum_max_workers parameter defines the number of concurrent AutoVacuum processes allowed.

Optimization is a dynamic process that requires a tailored approach. Identifying and tuning the right parameters for your database service is crucial.

Therefore, continuous monitoring and adjusting of these values is the key to achieving optimal performance.

Get 24/7 PostgreSQL help from us! Our experts are ready to assist you anytime. Let’s enhance your database together.

PG Internals, PostgreSQL Vaccum of A to the Z

1. Introduction

2. The importance of MVCC

3. What is Vacuum?

4. Vacuum Utilization & Optimization

Written by AGEDB