Adyen Tech

Insights from the team building the world’s payments infrastructure.

Checkpoint Tuning For RTO Improvements

--

By Dwarka Rao and Huseyin Demir, Database Engineers, Adyen.

Person organizing a cloud of data into neat stacks

In the world of database management, disasters can strike at any time. Whether it’s a hardware failure, a software bug or a natural disaster, there’s always a risk that your database could become unavailable. That’s where the Recovery Time Objective (RTO) comes in.

What is RTO, and why we have it lower

RTO is the maximum acceptable amount of time that a database can be down after an outage before it starts up again, significantly impacting your business operations. In simpler terms, it’s the timeframe you set for getting your database back up and running after a failure.

Why is a lower RTO crucial?

  • Minimizing Business Disruption
  • Maintaining Customer Trust
  • Meeting Compliance Requirements
  • Reducing Data Loss

RTO and checkpoint configurations in PostgreSQL are closely related because checkpoints dictate the state from which recovery begins after a crash. The frequency and size of checkpoints, controlled by parameters like checkpoint_timeout and max_wal_size, directly affect how much WAL (Write-Ahead Log) needs to be replayed during recovery. A shorter RTO requires frequent checkpoints to limit the WAL replay time, but this can increase I/O overhead during normal operation. Conversely, infrequent checkpoints reduce write overhead but require more WAL replay during recovery, potentially exceeding the RTO goal. Balancing these settings is essential to ensure the database can meet its recovery objectives without compromising performance.

Understanding Checkpoints in PostgreSQL

Before we delve into checkpoint tuning, it’s crucial to grasp the technical intricacies of checkpoints in PostgreSQL. A checkpoint is a mechanism that ensures data consistency and facilitates database recovery. It’s a point in time when PostgreSQL performs the following critical operations:

  1. Flushing Dirty Pages: PostgreSQL utilizes a buffer cache in memory to hold frequently accessed data pages. During a checkpoint, it writes all modified data pages (known as “dirty pages”) within this buffer cache to the persistent storage (typically disk). This ensures that any changes made to the data since the last checkpoint are safely stored.
  2. Writing the Write-Ahead Log (WAL) Record: PostgreSQL employs a Write-Ahead Logging (WAL) protocol for durability. It first records every data modification in the WAL, a sequential log file, before applying it to the actual data pages. During a checkpoint,it writes a special WAL record to mark the completion of the checkpoint process. This record acts as a pointer for database recovery, indicating that all changes, up to this point, have been persisted to disk.
  3. Updating Control Files: PostgreSQL maintains control files that store essential information about the database cluster, including the location of the latest checkpoint record. During a checkpoint, these control files are updated to reflect the completion of the checkpoint.

Why are checkpoints important?

  • Crash Recovery: In the event of a database crash, PostgreSQL uses the WAL and the latest checkpoint record to recover the database to a consistent state. The WAL contains all the changes made since the last checkpoint, allowing PostgreSQL to replay them and return the database to its pre-crash state.
  • Minimizing Recovery Time: Checkpoints reduce the amount of WAL that needs to be processed during recovery by periodically flushing dirty pages to disk. This significantly speeds up the recovery process and contributes to a lower RTO.
  • WAL Management: Checkpoints help control the growth of WAL files. After a checkpoint, PostgreSQL can recycle older WAL segments that are no longer needed for recovery.

Types of Checkpoints in PostgreSQL

PostgreSQL performs two main types of checkpoints:

  • Timed Checkpoints: These occur regularly and are controlled by configuring the parameters checkpoint_timeout and max_wal_size. They ensure that the database changes are consistently flushed to disk, preventing excessive WAL accumulation.
  • Requested Checkpoints: PostgreSQL initiates these checkpoints in response to specific events, such as a shutdown request or reaching a certain threshold of WAL activity. They ensure data consistency before performing these operations. These checkpoints also happen when you issue a shutdown command, a checkpoint command and a few more.

In summary, to achieve the lowest possible RTO, you need to optimize both types of checkpoints.

Our Calculations and Formula to Improve RTO

To adjust PostgreSQL’s max_wal_size and checkpoint_timeout for optimal performance and meet a desired Recovery Time Objective (RTO), follow these steps:

1. Understand the Metrics:

  • Average WAL Generation (avg_wal_generation): Measure the average amount of WAL generated over 2, 7, 30, or 90 days to understand normal activity.
  • Maximum WAL Generation (max_wal_generation): Measure the peak WAL generated over the same periods to plan for worst-case scenarios.

2 . Reference Current Settings:

  • Check the current max_wal_size in MB to evaluate how it compares with actual WAL usage patterns.
  • Align with Desired RTO:
    Your business defines the desired RTO (e.g., 10 minutes). This definition determines how much WAL can accumulate before a checkpoint should be completed.
  • Calculate New Values
    For new max_wal_size, use the formula:
    tweaked_max_wal_size = (desired_rto × avg_wal_generation) × safety_buffer The safety buffer is a constant (e.g., 1.2 or 1.5) to ensure room for unexpected spikes in WAL generation. For example, if your RTO is 10 minutes and avg_wal_generation is 20 MB/min, then: tweaked_max_wal_size = (10 × 20) × 1.5 = 300 MB.
  • New checkpoint_timeout: The checkpoint timeout can align with the desired RTO (e.g. if the RTO is 20 minutes, set checkpoint_timeout to 20 minutes).

Example Calculation:

Suppose our avg_wal_generation is 40MB/sec, and we aim for an RTO of 30 minutes. To account for a safety buffer, we’ll multiply by 3. When we apply these values to the formula, the recommended max_wal_size comes out to approximately 3600MB. At this point, we have tweaked_max_wal_size, whose value can be used for actual max_wal_size, and we have a new checkpoint_timeout from the above calculations. Go ahead and change the checkpoint_timeout and max_wal_size of the concerned Postgresql cluster/instance to achieve your RTO goal. By adjusting settings like checkpoint_timeout and max_wal_size to control the frequency and size of checkpoints, it’s essential to fine-tune checkpoint_completion_target to smooth out I/O loads and prevent I/O storms, ensuring a more predictable and efficient recovery process.” checkpoint_completion_target influences how gradually PostgreSQL spreads the work of a checkpoint over its duration. Setting this value closer to 1.0 ensures the checkpoint process is spread evenly, reducing the chance of sudden I/O spikes that could negatively affect performance. This becomes especially important when combined with checkpoint_timeout and max_wal_size, as the interplay between these settings determines how frequently and how much data is written during each checkpoint, impacting both database performance and recovery time.

How To Monitor the Impact Of Changing Checkpoint Configurations

It’s crucial to monitor and assess I/O performance systematically to ensure your changes to checkpoint configurations improve PostgreSQL performance without introducing new bottlenecks.

Checkpoints are crucial to PostgreSQL’s ability to maintain data integrity, but their I/O-heavy nature means they can significantly impact system performance. When checkpoints are misconfigured, they can become bottlenecks, disrupting the balance between database operations and the underlying hardware’s capacity to handle I/O.

To appreciate the full impact of checkpoints on I/O, we must consider the full_page_writes configuration. This parameter determines whether PostgreSQL writes entire pages to WAL during the first modification after a checkpoint.

PostgreSQL uses the full_page_writes parameter to save the whole page initially after the checkpoint. In other words, it writes all pages for the first time after a checkpoint. If this setting is disabled, we can’t ensure PostgreSQL can do crash recovery without corruption. In addition to WAL, the database will ensure page-level corruptions will not occur with the help of full_page_writes.

The Impact of Full Page Writes

  • More WAL Files Generated
  • Higher I/O on PostgreSQL instance

After a brief introduction of full_page_writes, we can continue discovering other areas. Another area to monitor and analyse is replication lag. Replication lag is a critical metric in PostgreSQL, particularly in high-availability environments, as it indicates how far behind replicas are from the primary server. Checkpoint configurations directly influence this lag because of their impact on Write-Ahead Log (WAL) generation and processing. This can lead to additional pressure on the replication process, as WAL records need to be shipped to replicas. If the network or replica server can’t keep up with this influx of WAL, replication lag grows. Excessive lag may have a powerful impact on the database and application level, including uptime and connections.

To sum up, we can claim that the following are takeaways from the impact of changing checkpoint configurations:

  • Triggering more frequent checkpoints can increase WAL file size if full-page writes are enabled. Because after each checkpoint, all pages will be written to disk.
  • Triggering more frequent checkpoints puts additional pressure on the filesystem.
  • Configuring less frequent checkpoints leads to wait times when doing crash recovery.
  • Configuring more delayed checkpoints increases WAL disk space if wal_generation_speed is greater than the checkpoint frequency.

Results before and after check point tuning

To measure the impact, we’ve collected two checkpoint logs: one of them observed before changing configurations and the other observed after changing checkpoint configurations.

Before checkpoint tuning:

checkpoint complete: wrote 77931 buffers (0.6%); 0 WAL file(s) added, 0 removed, 4577 recycled; write=3563.153 s, sync=2.680 s, total=3650.802 s; sync files=10553, longest=0.181 s, average=0.001 s; distance=59582267 kB, estimate=94237011 kB

After checkpoint tuning:

checkpoint complete: wrote 664277 buffers (5.1%); 0 WAL file(s) added, 0 removed, 2095 recycled; write=809.597 s, sync=1.808 s, total=818.478 s; sync files=14272, longest=0.021 s, average=0.001 s; distance=34024036 kB, estimate=46944576 kB

Here is some explanation of the above Postgresql logs:

1. checkpoint complete

  • Indicates that a checkpoint operation has finished successfully.

2. wrote 77931 buffers (0.6%)

  • Buffers Written: The number of shared buffers (each typically 8 KB by default) written to disk during the checkpoint.
  • 0.6%: This is the percentage of the total buffer pool that was written. A small percentage implies that only a fraction of the buffer cache needed to be flushed to disk.

3. 0 WAL file(s) added, 0 removed, 4577 recycled

  • WAL File(s) added: No new WAL (Write-Ahead Log) files were created during this checkpoint.
  • WAL File(s) removed: No WAL files were deleted during this process.
  • WAL File(s) recycled: 4577 WAL files were reused, meaning PostgreSQL did not need to create new files but instead reused existing ones, which helps optimize disk usage.

4. write=3563.153 s, sync=2.680 s, total=3650.802 s

  • write=3563.153 s: The time taken (in seconds) to write all modified buffers to disk.
  • sync=2.680 s: The time spent synchronizing the written data to disk (i.e., ensuring that data is safely stored).
  • total=3650.802 s: The total time (in seconds) the checkpoint process took from start to finish.

5. sync files=10553, longest=0.181 s, average=0.001 s

  • sync files=10553: The number of individual files synced to disk during the checkpoint.
  • longest=0.181 s: The longest time it took to sync any single file.
  • average=0.001 s: The average time taken per file during synchronization.

6. distance=59582267 kB, estimate=94237011 kB

  • distance=59582267 kB: This refers to the estimated amount of WAL (in kilobytes) generated since the last checkpoint, representing the “distance” the system had to catch up with during this checkpoint.
  • estimate=94237011 kB: The estimated size (in kilobytes) of WAL that would trigger the next checkpoint based on current settings and activity.

To summarize,

  • This log message indicates that the checkpoint operation was completed after writing 77,931 buffers, took about 3,650 seconds , and recycled 4,577 existing WAL files without creating or removing new ones.
  • The sync process was efficient.10,553 files were synced, and the synchronization had an average time of 0.001 seconds per file.

The distance covered (almost 57 GB of WAL data) and the estimate for the next checkpoint (about 90 GB) show how much WAL data is managed between checkpoints.

Conclusion

  • The checkpoint process was completed successfully, although it took considerable time (approximately 1 hour). This could indicate heavy disk I/O activity.
  • Recycled WAL files suggest good disk space management.
  • The small percentage of buffers written (0.6%) might indicate that incremental changes were not extensive.

Understanding these metrics can help DBAs fine-tune checkpoint-related settings like checkpoint_timeout and checkpoint_completion_target to optimize performance and recovery time.

Checkpoints after tuning are “better” as they show more efficient resource use, faster completion time, and improved synchronization despite handling more buffers. Earlier checkpoints, while processing fewer buffers, took much longer, suggesting higher disk I/O load or less optimized performance.

The duration of checkpoints has drastically decreased after tuning. This enhancement ensures that PostgreSQL can meet tighter RTO requirements, making the system more resilient to unexpected failures.

Before tuning: Total checkpoint duration = 3,650.802 seconds (~1 hour).

After tuning: Total checkpoint duration = 818.478 seconds (~14 minutes).

If your primary goal is to minimize recovery time after a crash or failover, then a shorter checkpoint duration is beneficial. It reduces the volume of WAL that needs replaying, ensuring the system can resume operations more quickly. This is especially critical in systems with stringent high-availability requirements or where downtime directly impacts business operations.

--

--

Adyen Tech
Adyen Tech

Published in Adyen Tech

Insights from the team building the world’s payments infrastructure.

Adyen
Adyen

Written by Adyen

Development and design stories from the company building the world’s payments infrastructure. https://www.adyen.com/careers/

No responses yet