30 years of PostgreSQL buffer manager locking design evolution
Recently, for no good reason at all, I did some archaeological research into the Postgres buffer manager locking design evolution for the past 30 years. I suppose no one but the most hopeless Postgres nerds will find this topic interesting. So, here we go!
Background: Postgres buffer manager
First of all, what does Postgres buffer manager do? Just briefly, Postgres organizes on-disk data files in 8KB fix-sized pages. It also maintains a fixed-size array of page buffers in memory, to cache recent reads and writes of these disk pages and improve performance through caching and lazy disk flush.
Obviously, there are often fewer page buffers available than on disk file pages, so you also need a manager to cache-in / cache-out pages when the buffer is full. To achieve this, the buffer manager introduced two more data structures:
- A buffer lookup hash table, to translate from a BufferTag (a logical page identifier, i.e. which table / file / block number this page belongs to) into a buffer id, which is simply an array index of the buffer array.
- Each buffer array slot also has a header type called “BufferDescriptor”, which contains metadata needed by the buffer manager to do its job.
When a process needs to access a disk page, it first lookups the buffer table for the buffer id. If found, it happily reads the page from the buffer pool directly. If missing from the hash table, the page needs to be read from the disk. If the buffer pool is full, the manager must also select another victim page to evict to free up a slot.
Postgres uses a “clock sweep” algorithm to efficiently detect which page to evict first:
The aforementioned buffer descriptors array is logically seen as a “clock face”. A nextVictimBuffer
pointer points to an element and moves around like a “clock hand”. Each buffer descriptor also contains a “usage count” field. On a page visit, the count is increased by 1. When a page needs to be selected as victim, nextVictimBuffer
sweeps through the clock. On each page, if its usage count > 0, the count is decremented by 1, and the nextVictimBuffer
moves on to scan the next. If the count is 0, the page is selected as victim. Simple.
The real complexity, I think, comes from concurrency control, i.e. how to manage many processes accessing these shared memory data structures at the same time, without stepping on each other’s toes? How do they dance in harmony, to maximize the overall concurrency and performance at the same time?
Prehistory
Postgres, ever since its first commit on GitHub, had a very simple answer to concurrency: a monolith bufmgrLock
to guard access of everything:
Anyone who needs to look up, evict, read / write a buffer,… must all compete for the same BufMgrLock. No doubt at all, this created a huge bottleneck of everything, and it would be just a “toy” implementation just to quickly unblock other development efforts.
In addition, each buffer descriptor also has a “pin” count (refcount
): it signals that a process is still interested in a page, so other processes cannot select it as an eviction victim.
Postgres 6.5: introducing buffer context lock
The first evident effort to break-down the lock monolith, by Vadim Mikheev in 1998, was to extract the concurrency control of buffer pages into a new cntx_lock
per buffer. (The same commit also introduced the initial MVCC in Postgres!)
The cntx_lock
protects access to the buffer’s content: whoever reads the page must hold shared lock to this buffer. Whoever writes the page must hold exclusive lock. The cntx_lock
prevents concurrent read-write and write-write conflicts.
The cntx_lock
and the buffer pin have a subtle difference: the pin protects a disk page from being evicted, while the cntx_lock
coordinates the concurrent reads / writes of the page’s data. For example, to find if a buffer can be evicted, a process does not need to hold cntx_lock
, but only needs to check the pin count (protected by BufMgrLock
). But to grab a cntx_lock
, one must first pin the buffer. Multiple processes can pin the buffer at the same time, but only one process can write to a buffer at a time by holding exclusive cntx_lock
.
This is the first “low hanging fruit” of concurrency improvement: many processes can access many pages concurrently and do useful work. Meanwhile, a BufMgrLock
holder needs to do only short tasks most of the time, like looking up the buffer table, pick a victim buffer to evict, update a buffer’s pin count, maintain a buffer’s flags, etc. This simple model worked reasonably well, that it survived many years until Postgres 8.1.
Postgres 8.1: decouple the BufMgrLock
monolith
Up to pg 8.0, it was very clear that the BufMgrLock
monolith lock, although doing just tiny work each time, had become the single biggest throughput bottleneck of the whole system. Therefore, in 2005, Postgres 8.1 introduced a critical commit from Tom Lane that heavily redesigned it (and also first implemented the aforementioned clock sweep algorithm):
https://github.com/postgres/postgres/commit/5d5087363d7cdbd00fc432a1216e83a00f7139bd
You can find a detailed design proposal from this email thread:
https://www.postgresql.org/message-id/15341.1108332472%40sss.pgh.pa.us
This decoupled the BufMgrLock
monolith into two smaller monoliths: a BufMappingLock
and a BufFreelistLock
, and a new granular buf_hdr_lock
per buffer header.
The BufMappingLock
protects concurrent reads and modifications of the buffer table. It is acquired when anyone needs to add, remove, or look up a page in the buffer table. Also (not drawn on the graph), each buffer descriptor contains a BufferTag
field for the buffer page’s self identity. It must be modified together with the buffer table while holding BufMappingLock
whenever a buffer slot is swapped for a different page.
The BufFreeListLock
protects access to the “clock hand” in the clock-sweep algorithm: whoever moving the nextVictimBuffer
pointer must hold BufFreeListLock
. It also protects access to a FreeList
structure (a linked list stitching together all unused buffer slots, not drawn in the graph), that contains all the free buffers in a linked list. This is where the locks’ name come from. When a process looks for a victim buffer to evict, it first holds BufFreeListLock
, then lookups the freelist. If freelist is empty (buffer pool is full), the caller (while still holding BufFreeListLock
) runs the clock sweep algorithm, until a victim is found, and returns the victim.
Finally, the buf_hdr_lock
protects concurrent accesses to most fields in the buffer descriptors array, one lock per page.
With this design, many code paths are now decoupled and concurrent. For example, a process does not need to hold BufMappingLock
when modifying the flags of a page header or run clock-sweep algorithm. it also does not need to hold BufFreeListLock
if a buffer already exists in memory, thus no need to evict a victim.
This is a big leap forward in finer grained locking design, and a big performance improvement. But (you’ve probably guessed it), the global BufMappingLock
and BufFreelistLock
become new bottlenecks!
Postgres 8.2: partitioning the BufMappingLock
The bigger of the problem was the BufMappingLock
: every page lookup and swap needed to go through the buffer table, which became a major bottleneck.
Postgres 8.2 introduced a new design in 2006 (again, by Tom Lane) to decouple the monolith hash table into a partitioned table:
- https://www.postgresql.org/message-id/9019.1153424488%40sss.pgh.pa.us
The idea is simple (and seen in many other systems and standard libraries): each hash table contains a static list of buckets, each bucket contains a dynamic list of elements whose keys hash into that bucket. Therefore, you can allocate a separate lock per bucket, to protect concurrent accesses of that bucket only. This removed the major concurrency bottleneck.
Postgres 9.5: remove the BufFreelistLock
You’ve probably wondered: why Postgres relied so heavily on locks, even for simple counters? Wouldn’t atomic numbers do the same?
It turned out to be surprisingly huge amount of heavy lifting to introduce simple atomic compare-and-swaps into the Postgres code base. The main difficulty (I think) came from a large variety of different hardwares and how to abstract them away, and got the details right. A foundational work was done in 2014 by Andres Freund:
https://github.com/postgres/postgres/commit/b64d92f1a5602c55ee8b27a7ac474f03b7aee340
After this ground work, soon we started to see many impressive performance wins from replacing locks with atomics. Regarding buffer manager, in 2015, two major changes were made to dissolve the BufFreelistLock
monolith, by Robert Haas and Andres Freund:
- https://github.com/postgres/postgres/commit/5d7962c6797c0baae9ffb3b5b9ac0aec7b598bc3
- https://github.com/postgres/postgres/commit/d72731a70450b5e7084991b9caa15cb58a2820df
Before the above changes, whenever looking up a victim buffer, the BufFreelistLock
must be held for (potentially) a long time, because it needs to sweep the “clock hand” (nextVictimBuffer
) for however long it takes to discover the first page with zero refCount
(unpinned) and usageCount
(not used recently), which is “disastrous for concurrency”.
So the first change modified the long-held BufFreelistLock
to a short-held buffer_strategy_lock
, and release the lock whenever one makes a single step forward in the freelist traversal or the clock hand tick, instead of holding the BufFreelistLock
all the time. The second change took one step further, to flip the nextVictimBuffer
to an atomic integer, so clock ticking no longer requires buffer_strategy_lock
. (buffer_strategy_lock
was still kept to control accesses to the freelist.)
Postgres 9.6: more atomicity
No surprise at all, Andres did the final push in 2016, to flip everything else in a buffer description (refCount
, usageCount
, other flags) to an atomic number, thus deprecating the buf_hdr_lock
.
That’s the end of the story! Since then, I did not find another major change to the concurrency algorithm in the buffer manager layer.
The buffer_strategy_lock
that protects the freelist (a linked list of unused buffers) access is the only remaining global monolith, that has survived the last 11 years. Evidently, this was not a bottleneck: if you need to take a node away from the freelist, you likely need to do expensive disk read IO next, so the overhead of grabbing buffer_strategy_lock
is likely minor.
Closing thoughts
This history retold some timeless lessons in software engineering:
- Begin simple.
- Optimize performance only on bottlenecks.
- Do not optimize what you think is the bottleneck. Root on real world evidences.
- Evolve the system iteratively and incrementally.
Another observation is, concurrency design is closely coupled with the underlying data structure and algorithms. For example, the clock sweep algorithm’s success in Postgres came from not just its own effectiveness, but also how easy it was to implement fine grained locking and atomic counters with it (it would be wayyy harder with a least-recently-used algorithm like 2Q in pg 8.0). Choosing a proper algorithm can be foundational in making a system highly concurrent. (Speaking of which, check out this algorithm list referenced from Suzuki’s book!)
Finally, I must appreciate how much concerns were simplified by fine grained locking, and even more so by atomic numbers. A monolith lock is easier to implement and prove correct initially, but probably harder to maintain and operate in the long term. The big bottleneck makes every line of code in the locking period extremely performance sensitive, that even simple changes get over-scrutinized from performance worries. This churn, over time, may over-weigh the cost of breaking down the monolith.
In comparison, fine grained locking design takes a lot of hard work to get (and prove) it right. But if done well, things may appear as if all natural and effortless.