On Disk IO, Part 1: Flavours of IO
In October, I’ll be in New York on O’Reilly Velocity Conference, giving a “What We Talk About When We Talk About On Disk IO” talk. I’ve decided to release some of my preparation notes as a series of blog posts.
Knowing how the IO works, which algorithms are used and under which circumstances can make lives of developers and operators much better: they will be able to make better choices upfront (based on what is in use by the database they’re evaluating), troubleshoot the performance issues when the database misbehaves (by comparing their workloads to the ones the database stack is intended to be used against) and tune their stack (by spreading the load, switching to a different disk type, file or operating system, or simply picking a different index type).
While the Network IO is frequently discussed and talked about, Filesystem IO gets much less attention. Of course, in the modern systems people mostly use databases as storage means, so applications communicate with them through the drivers over the network. I believe it is still important to understand how the data is written onto the disk and read back from it. Moreover, Network IO has many more things to discuss and ways to implement different things, very different from one operating system to another, while Filesystem IO has a much smaller set of tools.
There are several “flavours” of IO (some functions omitted for brevity):
- Syscalls: open, write, read, fsync, sync, close
- Standard IO: fopen, fwrite, fread, fflush, fclose
- Vectored IO: writev, readv
- Memory mapped IO: open, mmap, msync, munmap
Today, we’ll discuss, the Standard IO combined with series of “userland” optimisations. Most of the time the application developers are using it plus a couple of different flags on top of this. Let’s start with that.
There’s a bit of confusion in terms of “buffering” when talking about stdio.h functions, since they do some buffering themselves. When using the Standard IO, it is possible to choose between full and line buffering or opt out from any buffering whatsoever. This “user space” buffering has nothing to do with the buffering that will be done by the Kernel further down the line. You can also think about as a distinction between “buffering” and “caching” which should be make the concepts different and intuitive.
Disks (HDDs, SSDs) are called Block Devices and the smallest addressable unit on them is called sector: it is not possible to transfer an amount of data that is smaller than the sector size. Similarly, the smallest addressable unit of the file system is a block (which is generally larger than a sector). Block size is usually smaller than (or same as) the page size (a concept coming from the Virtual Memory).
Everything that we’re trying to address on disk, ends up being loaded in RAM and most likely cached by the Operating System for us in-between.
Page Cache (previously entirely separate, Buffer Cache and Page cache got unified in 2.4 Linux kernel) helps to keep cache the buffers that are more likely to be accessed in the nearest time. Temporal locality principle implies that the read pages will accessed multiple times within a small period in time, and spatial locality implies that the related elements have a good chance of being located close to each other, so it makes sense to save the data to amortise some of the IO costs. In order to improve the IO performance, the Kernel buffers data internally by delaying writes and coalescing adjacent reads.
Page Cache does not necessarily hold the whole files (although that certainly can happen). Depending on the file size and the access pattern, only the chunks that were accessed recently. Since all the IO operations are happening through the Cache, sequences of operations such as read-write-read can be served entirely from memory, without accessing the (meanwhile outdated) data on disk.
When the read operation is performed, the Page Cache is consulted first. If the data can already be located in the Page Cache, it is copied out for the user. Otherwise, it is loaded from the disk and stored in the Page Cache for the further accesses. When the write operation is performed, the page gets written to cache first and gets marked as dirty in the Cache.
Pages that were marked dirty, since their cached representation is now different from the persisted one, will be flushed to disk. This process is called writeback. Of course, the writeback has it’s own potential drawbacks, such as queuing up too many IO requests, so it’s worth understanding thresholds and ratios that are used for writeback when it’s in use and check queue depths to make sure you can avoid throttling and high latencies.
When performing a write that’s backed by the kernel and/or a library buffer, it is important to make sure that the data actually reaches the disk, since it might be buffered or cached somewhere. The errors will appear when the data is flushes to disk, which can be while syncing or closing the file.
Standard IO uses read() and write() syscalls for performing IO operations. It takes user space buffers, created by the process and, during reads, fills them up with data from the Page Cache.
When reading the data, cache is looked up first. If the data is absent, the Page Fault is triggered and the contents are paged in. This means that the write, performed against the currently unmapped area will take longer, because the caching layer is transparent to the user.
During the writes, the buffer contents are first written to the Page Cache. This means that using the Standard IO data does not reach the disk right away. The actual hardware write is done when Kernel decides it’s time to perform a writeback of the dirty page.
There are situations when it’s undesirable to use the Kernel cache layer to perform IO. In such cases, O_DIRECT is a flag that can be passed when opening a file. It instructs the Operating Systems to bypass the Page Cache and perform IO operations directly against the block device. This means that the buffer can be flushed directly to the disk, without copying it’s content to the corresponding page first and waiting for the Kernel to trigger a writeback.
For a “traditional” application using the Direct IO will most likely cause a performance degradation rather than the speedup, but in the right hands it can help to gain a fine-grained control over the IO operations and improve the performance. Usually the applications using this type of IO implement their own application-specific caching layer.
Using Direct IO is often frowned upon by the Kernel developers, and it goes so far that the Linux man page quotes Linus Torwalds: “The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid”.
However, the databases such as PostgreSQL and MySQL use Direct IO for a reason. Developers can ensure a more fine-grained control over the data access patterns, possibly using a custom IO Scheduler and an application-specific Buffer Cache. For example, PostgreSQL uses Direct IO for WAL (write-ahead-log), since they have to perform a write as fast as possible while insuring it’s durability and can use this optimisation since they know that the data won’t be immediately reused so writing it bypassing the Kernel page cache won’t result into performance degradation.
Direct reads will make a read directly from the disk, even if the data was recently accessed and might be sitting in the cache. This helps to avoid creating an extra copy of the data. The same is true for the write: when the write operation is performed, the write is done directly from the user space buffers.
Because Direct I/O involves direct access to the backing store, bypassing the intermediate Kernel buffers in the Page Cache, it is required that all the operations are sector-aligned (aligned to the 512B boundary).
In other words, every operation has to have a starting offset of a multiple of 512 and a buffer size has to be a multiple of 512 as well. When using Page Cache, because the writes first go into the main memory, alignment is not important: when performing a write to an actual block device, Kernel will make sure to split the Page into the parts of the right size and perform aligned writes towards the hardware storage.
For example, RocksDB is making sure that the operations are block-aligned by checking it upfront (older versions were allowing unaligned access by aligning in the background).
Whether or not O_DIRECT flag is used, it is always a good idea to make sure your reads and writes are block aligned: making an unaligned access will cause multiple sectors to be loaded from the disk (or written back on disk).
Using the block size or a value that fits neatly inside of a block guarantees block-aligned I/O requests, and prevents extraneous work inside the kernel.
Nonblocking Filesystem IO
I’m adding this part here since I very often hear “nonblocking” in the context of Filesystem IO. It’s quite normal, since most of the programming interface for network and Filesystem IO is the same. But it’s worth mentioning that there’s no true “nonblocking” IO which can be understood in the same sense.
O_NONBLOCK is generally ignored for regular (on disk) files, because the block device operations are usually considered non-blocking (unlike sockets, for example). The Filesystem IO delays are not taken into account by the system. Possibly this decision was made because there’s a more or less hard time bound on when the data will arrive.
For the same reason, something you would usually use like select and epoll do not allow monitoring and/or checking status of regular files.
Today we’ve discussed the most used and popular types of IO: Standard IO and usage of the O_DIRECT flag, an optimisation often used by the database developers in order to gain control over the buffer caches that Standard IO delegates to the Kernel and discussed where it’s used, how it works, where it can be useful and what downsides it has.
It is hard to find an optimal post size given there’s so much material to cover, but it felt about right to have a clear after Standard IO before moving the Part 2, featuring Memory Mapping, Vectored IO and Page Cache Optimisations.
If you find anything to add or there’s an error in my post, do not hesitate to contact me, I’ll be happy to make corresponding updates.
If you liked the post and would like to be notified about the next parts, you can follow me on twitter.