If you like the series, check out my upcoming book on Database Internals!
Series consist of 5 pieces:
- Flavors of IO: Page Cache, Standard IO, O_DIRECT
- More Flavors of IO: mmap, fadvise, AIO
- LSM Trees
- Access Patterns in LSM Trees
- B-Trees and RUM Conjecture
A new series on Distributed System concepts in Databases can be found here.
Memory Mapping
Memory mapping (mmap) allows you to access a file as if it was loaded in memory entirely. It simplies file access and is frequently used by database and application developers.
With mmap a file can be mapped to a memory segment privately or in shared mode. Private mapping allows reading from the file, but any write would trigger copy-on-write of the page in question in order to leave the original page intact and keep the changes private, so none of the changes will get reflected on the file itself. In shared mode, the file mapping is shared with other processes so they can see updates to the mapped memory segment. Additionally, changes are carried through to the underlying file (precise control over which requires the use of msync).
Unless specified otherwise, file contents are not loaded into memory right away, but in a lazy manner. Space required for the memory mapping is reserved, but is not allocated right away. The first read or write operation results in a page-fault, triggering the allocation of the appropriate page. By passing MAP_POPULATE it is possible to pre-fault the mapped area and force a file read-ahead.
In the previous post, we’ve touched the subject of Virtual Memory and Page Cache. Memory Mapping is done through the page cache, the same way as the Standard IO operations such as read and write and is using a demand paging.
During the first memory access, a Page Fault is issued, which signals the kernel that the requested page is currently not loaded into memory and will have to be loaded. The kernel identifies where which data has to be loaded from. Page Faults are transparent to the developer: The program flow will carry on as if nothing happened. Sometimes, Page Faults may have a negative impact on performance, and we will discuss the possible ways to improve the situation later in this post.
It’s also possible to map a file into memory with protection flags (for example, in a read-only mode). If an operation on the mapped memory segment violates the requested protection, a Segmentation Fault is issued.
mmap is a very useful tool for working with IO: It avoids creating an extraneous copy of the buffer in memory (unlike Standard IO, where the data has to be copied into the user-space buffers before the system call is made). Besides, it avoids a system call (and subsequent context switch) overhead for triggering actual IO operation, except when Page Faults occur. From a developers perspective, issuing a random read using an mmapped file looks just like a normal pointer operation and doesn’t involve lseek calls.
The disadvantages of mmap that are mentioned most of the time are less relevant with modern hardware:
- mmap imposes overhead of the kernel data structures required for managing the memory mappings: In today’s realities and memory sizes, this argument does not play a major role.
- Memory-mapped file size limit: Most of the time, the kernel code is much more memory-friendly anyways and 64 bit architectures allow mapping larger files.
Of course, this doesn’t imply that everything has to be done with the memory-mapped files.
mmap is quite frequently used by database implementers. For example, the MongoDB default storage engine was mmap-backed, and SQLite is using memory mapping extensively.
Page Cache Optimizations
From what we discussed so far, it looks like using Standard IO simplifies many things and has some benefits, but at the cost of control loss: you’re at the grace of the kernel and the page cache. This is true, but only to a certain extend. Usually, the kernel can do a much better job predicting when to perform a write-back and prefetch pages, using internal statistics. However, sometimes it’s possible to help the kernel to manage the page cache in a way that would be beneficial for the application.
One of the ways of informing the kernel about your intentions is using fadvise. Using the following flags, it is possible to instruct the kernel about your intentions and let it optimize the use of the page cache:
- FADV_SEQUENTIAL specifies that the file is read sequentially, from lower offsets to higher ones, so the kernel can make sure to fetch the pages in advance, before the actual read occurs.
- FADV_RANDOM disables read-ahead, evicting pages that are unlikely to be accessed any time soon from the page cache.
- FADV_WILLNEED notifies the OS that the page will be needed by the process in the near future. This gives the kernel an opportunity to cache the page ahead of time and, when the read operation occurs, to serve it from the page cache instead of page-faulting.
- FADV_DONTNEED advises the kernel that it can free the cache for the corresponding pages (making sure that the data is synchronised with the disk beforehand).
- There’s one more flag (FADV_NOREUSE), but on Linux it has no effect.
Just as the name suggests, fadvise is only acting advisory. The kernel is not obligated to do exactly as fadvise suggests.
Since database developers often can predict accesses, fadvise is such a useful tool. For example, RocksDB uses it for notifying the kernel about access patterns, depending on the file type (SSTable or HintFile), mode (Random or Sequential) and operation (Write or Compaction).
Another useful call is mlock. It allows you to force pages to be held in memory. This means that once the page is loaded into memory, all subsequent operations will be served from the page cache. It has to be used with caution, since calling it on every page will simply exhaust the system resources.
AIO
In terms of IO flavors, the last piece we’re going to discuss is Linux Asynchronous IO (AIO). AIO is an interface allowing to initiate multiple IO operations and register callbacks that will be triggered on their completion. Operations will be performed asynchronously (e.g. the system call will return immediately). Using async IO helps the application to continue work on the main thread while the submitted IO job is processed.
The two main syscalls responsible for Linux AIO are io_submit and io_getevents. io_submit allows passing one or multiple commands, holding a buffer, offset and an operation that has to be performed. Completions can be queried by using io_getevents, a call that allows to collect result events for corresponding commands. This allows for a fully asynchronous interface for handling IO, pipelining IO operations and freeing application threads, potentially reducing the amount of context switches and wake-ups.
Unfortunately, Linux AIO has several shortcomings: the syscalls API isn’t exposed by the glibc and requires a library for wiring them up (libaio seems to be the most popular). Despite several attempts to fix that, only file descriptors with O_DIRECT flag are supported, so buffered asynchronous operations won’t work. Besides, some operations, such as stat, fsync, open and some others aren’t fully asynchronous.
It’s worth mentioning that Linux AIO shouldn’t be confused with Posix AIO, which is a different thing altogether. The Posix AIO implementation on Linux is implemented completely in user space and does not use this Linux-specific AIO subsystem at all.
Vectored IO
One, possibly less popular, method of performing IO operations is Vectored IO (also known as Scatter/Gather). It is called this way because it operates on a vector of buffers and allows reading and writing data to/from disk using multiple buffers per system call.
When performing a vectored read, bytes will be read from the source into the buffer first (up to first buffers’s length offset). Then, bytes from the source starting at the first buffer’s length and up to the second buffer’s length offset will be read into the second buffer and so on, as if the source was filling up buffers one after another (although operation order and parallelism are not deterministic). A vectored write works in a similar manner: Buffers will be written as if they were concatenated before the write.
Such an approach can help by allowing reading smaller chunks (therefore avoiding allocation of large memory areas for contiguous blocks) and, at the same time, reducing the amount of system calls required to fill up all these buffers with data from disk. Another advantage is that both reads and writes are atomic: The kernel prevents other processes from performing IO on the same descriptor during read and write operations, guaranteeing data integrity.
From the development perspective, if data is laid out in a certain way in the file (say, it’s split out into a fixed-size header and multiple fixed size blocks), it is possible to issue a single call that will fill up separate buffers allocated for these parts.
This sounds rather useful but, somehow, just a few databases use the Vectored IO. This might be because general purpose databases work with a bunch of files simultaneously, trying to guarantee liveness for each running operation and reduce their latencies, so and data is accessed and cached block-wise. Vectored IO is more useful for analytics workloads and/or columnar databases, where the data is stored on disk contiguously, and its processing can be done in parallel in sparse blocks. One of the examples is Apache Arrow.
Closing Words
As you can see, there are quite a few things to pick from, each one of them having their own advantages and shortcomings. Using a specific tool does not guarantee a positive outcome: because of their specifics, these IO flavors are easily misunderstood and misused. Implementation, tuning and the way a database is used by the end user still might play a big role.
You can see that the existing approaches still seem to have a pattern: using O_DIRECT might require you to write a buffer cache, using the page cache might require using fadvise and using AIO might require you to hook it up with a Futures-like interface. Some of these are for more concrete use-cases, some are more generic. The main purpose of these series is to help people to get basic vocabulary and understand what databases have under the hood to make it easier to look into their subsystems, tune, optimize and pick the right tool for the right job.
In the next post, we’ll be talking about immutable on-disk data structures and LSM Trees.
If you find anything to add or there’s an error in my post, do not hesitate to contact me, I’ll be happy to make corresponding updates.
If you liked the post and would like to be notified about the next parts, you can follow me on twitter or subscribe to my mailing list .