If you like this story, check out my upcoming book on Database Internals!
Series consist of 5 pieces:
- Flavors of IO: Page Cache, Standard IO, O_DIRECT
- More Flavors of IO: mmap, fadvise, AIO
- LSM Trees
- Access Patterns in LSM Trees
- B-Trees and RUM Conjecture
New series on Distributed System concepts in Databases can be found here.
Knowing how IO works and understanding use-cases and trade-offs of algorithms and storage systems can make lives of developers and operators much better: they will be able to make better choices upfront (based on what’s under the hood of the database they’re evaluating), troubleshoot performance issues when database misbehaves (by comparing their workloads to the ones the database of their choice is intended to be used for) and tune their stack (by balancing load, switching to a different medium, file system, operating system, or picking a different index type).
While the Network IO is frequently discussed and talked about, Filesystem IO gets much less attention. Partly, the reason is that Network IO has many more features and implementation details, varying from one operating system to another, while Filesystem IO has a much smaller set of tools. Also, in modern systems people mostly use databases as storage means, so applications communicate with them through the drivers over the network and Filesystem IO is left for database developers to understand and take care of. I still believe it is important to understand how data is written on disk and read from it.
There are several “flavors” of IO (some functions omitted for brevity):
- Syscalls: open, write, read, fsync, sync, close
- Standard IO: fopen, fwrite, fread, fflush, fclose
- Vectored IO: writev, readv
- Memory mapped IO: open, mmap, msync, munmap
Let’s start by discussing Standard IO combined and some “userland” optimizations as this is the what application developers end up using the most.
There’s a bit of confusion in terms of “buffering” when talking about stdio.h functions. When using the Standard IO, it is possible to choose between full and line buffering or opt out from any buffering whatsoever. This “user space” buffering has nothing to do with Kernel buffering (Page Cache) which we will discuss later in this article. You can also think about as a distinction between “buffering” and “caching” which might draw a clear distinction between these concepts.
Block Device is a special file type providing buffered access to hardware devices such as HDDs or SSDs. Block Devices work act upon sectors, group of adjacent bytes. Most disk devices have a sector size of 512 bytes. Sector is the smallest unit of data transfer for block device, it is not possible to transfer less than one sector worth of data. However, often it is possible to fetch multiple adjacent segments at a time. The smallest addressable unit of File System is block. Block is a group of multiple adjacent sectors requested by a device driver. Typical block sizes are 512, 1024, 2048 and 4096 bytes. Usually IO is done through the Virtual Memory, which caches requested filesystem blocks in memory and serves as a buffer for intermediate operations. Virtual Memory works with pages, which map to filesystem blocks. Typical page size is 4096 bytes.
In summary, Virtual Memory pages map to Filesystem blocks, which map to Block Device sectors.
Standard IO uses read() and write() syscalls for performing IO operations. When reading the data, Page Cache is addressed first. If the data is absent, the Page Fault is triggered and contents are paged in. This means that reads, performed against the currently unmapped area will take longer, as caching layer is transparent to user.
During writes, buffer contents are first written to Page Cache. This means that data does not reach the disk right away. The actual hardware write is done when Kernel decides it’s time to perform a writeback of the dirty page.
Page Cache stores the recently accessed fragments of files that are more likely to be accessed in the nearest time. When working with disk files, read() and write() calls do not initiate disk accesses directly and go through Page Cache instead.
When the read operation is performed, the Page Cache is consulted first. If the data is already loaded in the Page Cache, it is simply copied out for the user: no disk access is performed and read is served entirely from memory. Otherwise file contents are loaded in Page Cache and then returned to the user. If Page Cache is full, least recently used pages are flushed on disk and evicted from cache to free space for new pages.
write() call simply copies user-space buffer to kernel Page Cache, marking the written page as dirty. Later, kernel writes modifications on disk in a process called flush or writeback. Actual IO normally does not happen immediately. Meanwhile, read() will supply data from the Page Cache instead of reading (now outdated) disk contents. As you can see, Page Cache is loaded both on reads and writes.
Pages marked dirty will be flushed to disk as since their cached representation is now different from the one on disk. This process is called writeback. writeback might have potential drawbacks, such as queuing up IO requests, so it’s worth understanding thresholds and ratios that used for writeback when it’s in use and check queue depths to make sure you can avoid throttling and high latencies. You can find more information on tuning Virtual Memory in Linux Kernel Documentation.
Logic behind Page Cache is explained by Temporal locality principle, that states that recently accessed pages will be accessed again at some point in nearest future.
Another principle, Spatial Locality, implies that the elements physically located nearby have a good chance of being located close to each other. This principle is used in a process called “prefetch” that loads file contents ahead of time anticipating their access and amortizing some of the IO costs.
Page Cache also improves IO performance by delaying writes and coalescing adjacent reads.
Disambiguation: Buffer Cache and Page Cache: previously entirely separate concepts, got unified in 2.4 Linux kernel. Right now it’s mostly referred to as Page Cache, but some people people still use term Buffer Cache, which became synonymous.
Page Cache, depending on the access pattern, holds file chunks that were recently accessed or may be accessed soon (prefetched or marked with fadvise). Since all IO operations are happening through Page Cache, operations sequences such as read-write-read can be served from memory, without subsequent disk accesses.
When performing a write that’s backed by the kernel and/or a library buffer, it is important to make sure that the data actually reaches the disk, since it might be buffered or cached somewhere. The errors will appear when the data is flushed to disk, which can be while fsyncing or closing the file. If you would like to learn more about it, check out LWN article Ensuring Data Reaches the Disk.
There are situations when it’s undesirable to use the Kernel Page Cache to perform IO. In such cases, one can use O_DIRECT flag when opening a file. It instructs the Operating Systems to bypass the Page Cache, avoid storing extra copy of data and perform IO operations directly against the block device. This means that buffers are flushed directly on disk, without copying their contents to the corresponding cached page first and waiting for the Kernel to trigger a writeback.
For a “traditional” application using Direct IO will most likely cause a performance degradation rather than the speedup, but in the right hands it can help to gain a fine-grained control over IO operations and improve performance. Usually applications using this type of IO implement their own application-specific caching layer.
Using Direct IO is often frowned upon by the Kernel developers. It goes so far, that Linux man page quotes Linus Torwalds: “The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid”.
However, databases such as PostgreSQL and MySQL use Direct IO for a reason. Developers can ensure fine-grained control over the data access , possibly using a custom IO Scheduler and an application-specific Buffer Cache. For example, PostgreSQL uses Direct IO for WAL (write-ahead-log), since they have to perform writes as fast as possible while ensuring its durability and can use this optimization since they know for sure that the data won’t be immediately reused, so writing it bypassing Page Cache won’t cause performance degradation.
It is discouraged to open the same file with Direct IO and Page Cache simultaneously, since direct operations will be performed against disk device even if the data is in Page Cache, which may lead to undesired results.
Because Direct IO involves direct access to backing store, bypassing intermediate buffers in Page Cache, it is required that all operations are aligned to sector boundary.
In other words, every operation has to have a starting offset of a multiple of 512 and a buffer size has to be a multiple of 512 as well. When using Page Cache, because writes first go to memory, alignment is not important: when actual block device write is performed, Kernel will make sure to split the page into parts of the right size and perform aligned writes towards hardware.
For example, RocksDB is making sure that the operations are block-aligned by checking it upfront (older versions were allowing unaligned access by aligning in the background).
Whether or not O_DIRECT flag is used, it is always a good idea to make sure your reads and writes are block aligned. Crossing segment boundary will cause multiple sectors to be loaded from (or written back on) disk as shown on images above. Using the block size or a value that fits neatly inside of a block guarantees block-aligned I/O requests, and prevents extraneous work inside the kernel.
Nonblocking Filesystem IO
I’m adding this part here since I very often hear “nonblocking” in the context of Filesystem IO. It’s quite normal, since most of the programming interface for Network and Filesystem IO is the same. But it’s worth mentioning that there’s no true “nonblocking” Filesystem IO, which can be understood in the same sense.
O_NONBLOCK is generally ignored for regular files, because block device operations are considered non-blocking (unlike socket operations, for example). Filesystem IO delays are not taken into account by the system. Possibly this decision was made because there’s a more or less hard time bound on operation completion.
For same reason, something you would usually use in Network context, like select and epoll, does not allow monitoring and/or checking status of regular files.
Today we’ve discussed Page Cache, Standard IO and usage of the O_DIRECT flag, an optimization often used by the database developers in order to gain control over the buffer caches that Standard IO delegates to the Kernel and discussed where it’s used, how it works, where it can be useful and what downsides it has.
The next post will be featuring Memory Mapping, Vectored IO and Page Cache Optimisations.
If you find anything to add or there’s an error in my post, do not hesitate to contact me, I’ll be happy to make corresponding updates.