On Disk IO, Part 2: More Flavours of IO
In the first post, we’ve started an overview of IO flavours (Standard IO, O_DIRECT flag) and discussed the Page Cache and buffering. To get a complete picture about the tooling available on Linux, we need to cover a couple more things.
Memory mapping allows you to access the files as if it was loaded in memory completely. It can simplify the file access and is frequently used by the database developers.
When using mmap, the file can be mapped to the memory segment privately (which would allow reading from the file, but any write would trigger copy-on-write of the page in question in order to leave the original page intact and keep the changes private, so none of the changes will get reflected on the file itself) and in the shared mode (which would share the mapping to the file with the other processes, allowing both writing the changes back to the file and seeing the writes issued by the other processes).
Unless specified otherwise (by passing MAP_POPULATE, which causes the mapped area to pre-fault and the file to be read ahead), the contents of file aren’t loaded into main memory right away, but rather done in a lazy manner. The space required for the memory mapping is reserved, but is not allocated right away. The first read or write operation will allocate an appropriate page.
In the previous post, we’ve touched the subject of Virtual Memory and Page Cache. Memory Mapping is done through the Page Cache, the same way as the Standard IO operations such as read and write and is using a demand paging.
During the first memory access, the Page Fault is issued, which means that the page is currently not loaded into the main memory and will have to be loaded. The Kernel takes over the control and identifies which exactly data has to be present on the page. Page Faults are transparent from the developer’s perspective: the program flow will carry on as if nothing had happened. Unfortunately, Page Faults may have a negative impact on performance, and we will discuss the possible ways to improve the situation later in this post.
It’s also possible to map the file into the memory with protection flags (for example, in a read-only mode). If the operation done on the mapped memory segment violates the requested protection, the Segmentation Fault is issued.
mmap is a very useful tool for working with IO: it avoids creation of an extraneous copy of the buffer in memory (unlike the Standard IO, where the data has to be copied into the user-space buffers before the system call is made). Besides, it avoids an overhead of the system call (and context switch) for triggering an actual IO operation, except for the possible Page Faults. From the development perspective, issuing a random read using mmapped file looks just like a normal pointer operation and doesn’t involve lseek calls.
The disadvantages of mmap that are mentioned most of the time are less relevant with the modern hardware: page mapping size and an overhead of the Kernel data structures required for managing the memory mappings, a limit on the memory-mapped file size. Most of the time, the Kernel code is much more memory-friendly anyways and 64bit architectures allow mapping larger files. Of course, this doesn’t imply that everything has to be done with the memory-mapped files.
Page Cache Optimisations
From what we discussed so far it looks like using Buffered IO simplifies many things and has a lot of benefits, but the cost is loss of control: you’re at the grace of the Kernel and Page Cache. This is true, but to a certain extend. Usually, the Kernel can do a much better job in predicting when to perform a write-back to the disk and when prefetch the pages, using internal statistics. Sometimes though it’s still possible to help the Kernel to manage the Page Cache the way that would be beneficial for the application.
One of the ways informing the Kernel about your intentions is using fadvise. The Kernel will be able to adjust and optimise the use of the Page Cache, possibly reducing an amount of Page Faults by prefetching, improving memory usage by freeing unneeded pages sooner, and so on.
This system call can be used in order to specify the access pattern (distinguishing between the FADV_RANDOM and FADV_SEQUENTIAL). If the file is read sequentially, from lower offsets to the higher ones, the operating system will make sure to prefetch the pages ahead of reads. FADV_RANDOM will disable the read-ahead, freeing the memory from the pages that are very unlikely to be accessed any time soon. FADV_WILLNEED notifies the OS that the page will be needed by the process in the near future. This gives Kernel an opportunity to cache the page ahead of time and, when the read operation occurs, to serve it from the Page Cache instead of pagefaulting. FADV_DONTNEED advises the Kernel that it can free the cache for the corresponding pages (making sure that the data is synchronised with the disk beforehand). There’s one more flag (FADV_NOREUSE), but on Linux it has no effect.
Just as the name suggests, fadvise is only acting advisory. Kernel is not obligated to do exactly as fadvise suggests. Database developers often can predict accesses, which makes it quite a useful tool. For example, rocksdb uses fadvise for notifying the Kernel for the access patterns depending on the file type (SSTable or HintFile), mode (Random or Sequential) and operation (Write or Compaction).
Another useful call is mlock, which lets you to force pages to be held in the memory. This means that once the page is loaded into the memory, all subsequent operations will be served from the Page Cache. It has to be used with caution, since calling it on every page will simply exhaust the system resources.
In terms of IO flavours, the last piece we’re going to discuss is Linux Asynchronous IO. AIO is an interface allowing to initiate multiple IO operations and register callbacks to be triggered on their completion. The operations will be performed asynchronously (e.g. the system call will return immediately). Using async IO helps the application to continue work on the main thread while the submitted IO job is processed.
The main two syscalls responsible for the Linux AIO are io_submit and io_getevents. io_submit allows passing one or multiple commands, holding a buffer, offset and an operation that has to be performed. Completions can be queried by using io_getevents, an call allowing collecting result events for the corresponding commands. This allows for fully asynchronous interface for handling the IO, pipelining the IO operations and freeing the application threads, potentially reducing an amount of context switches and wakeups.
Unfortunately, Linux AIO has several shortcomings: the syscalls API aren’t exposed by the glibc and requires a library for wiring them up (out of which, libaio seems to be the most popular). Despite several attempts to fix that, only file descriptors with O_DIRECT flag are supported, so asynchronous buffered operations won’t work. Besides, some operations, such as stat, fsync, open and some others aren’t fully asynchronous, too.
It’s worth mentioning that Linux (Kernel) AIO shouldn’t be confused with Posix AIO, which is a different thing altogether. Posix AIO implementation on Linux is implemented completely in user space and does not use this Linux-specific subsystem at all.
One, possibly less popular, method of performing IO operations is Vectored IO (also known as Scatter/Gather). It is called this way because it operates on a vector of buffers and allows reading the data from disk into them and writing to disk from them.
When performing a vectored read, the bytes will be read from the source into the first buffer (up to the first buffers’s length offset). Then, bytes from the source starting at the first buffer’s length and up to the second buffer’s length offset will be read read into the second buffer and so on, as if the source was filling up the buffers one after another (although in fact the operation order and parallelism aren’t deterministic). A vectored write would work in a similar manner: the buffers will be written as if they were concatenated before the write.
Such an approach can help by allowing reading smaller chunks (therefore avoiding allocation of large memory areas for contiguous blocks) and, at the same, time reducing an amount of system calls required to fill up all these buffers with the data from disk. Another advantage is that both reads and writes are atomic: Kernel prevents the other processes from performing IO on the same descriptor during read and write operations, guaranteeing the data integrity.
From the development perspective, if the data is laid out in a certain way in the file (say, it’s split out into the fixed-size header and multiple fixed size blocks), it’s possible to issue a single call in order to fill up the separate buffers allocated for these parts.
This seems sounds rather useful but, somehow, just a few databases use the Vectored IO. One possible explanation can be because the general purpose databases are working with a bunch of files simultaneously, trying to guarantee liveness for the each running operation and reduce their latencies, and the data is accessed and cached block-wise. Vectored IO sounds much more useful for analytics workloads and/or columnar databases, where the data is stored in a contiguous manner on the disk, although it’s processing can be done in parallel in sparse blocks. One of the examples is Apache Arrow.
As you can see, there are quite a few things to pick from, every one of them having their own advantages and shortcomings. Using a specific tool does not guarantee any good outcome: because of their specifics, these IO flavours may get easily misunderstood and misused. Implementation, tuning and the way database is used by the end user still might play a big role.
But you can see that the existing approaches still seem to have a pattern: using O_DIRECT will most likely require you to write a buffer cache, using Kernel Buffer Cache will require using fadvise and using AIO might require you to hook it up to some Futures-like interface. Some of these things are for more concrete use-cases, some are more generic. The purpose of this series is to help people to get basic vocabulary and understand how databases are ticking from the bottom layers up to make it easier to look into their subsystems, tune, optimise and pick the right tool for the right job.
In the next post, we’ll be talking about immutable on-disk data structures and LSM Trees.
If you liked the post and would like to be notified about the next parts, you can follow me on twitter.