On Disk IO, Part 1: Flavors of IO

αλεx π
Sep 3, 2017 · 9 min read

If you like this story, check out my upcoming book on Database Internals!

Series consist of 5 pieces:

New series on Distributed System concepts in Databases can be found here.

Knowing how IO works and understanding use-cases and trade-offs of algorithms and storage systems can make lives of developers and operators much better: they will be able to make better choices upfront (based on what’s under the hood of the database they’re evaluating), troubleshoot performance issues when database misbehaves (by comparing their workloads to the ones the database of their choice is intended to be used for) and tune their stack (by balancing load, switching to a different medium, file system, operating system, or picking a different index type).

Flavours of IO, according to Wikipedia

While the Network IO is frequently discussed and talked about, Filesystem IO gets much less attention. Partly, the reason is that Network IO has many more features and implementation details, varying from one operating system to another, while Filesystem IO has a much smaller set of tools. Also, in modern systems people mostly use databases as storage means, so applications communicate with them through the drivers over the network and Filesystem IO is left for database developers to understand and take care of. I still believe it is important to understand how data is written on disk and read from it.

There are several “flavors” of IO (some functions omitted for brevity):

Let’s start by discussing Standard IO combined and some “userland” optimizations as this is the what application developers end up using the most.

Buffered IO

Sector/Block/Page

In summary, Virtual Memory pages map to Filesystem blocks, which map to Block Device sectors.

Standard IO

During writes, buffer contents are first written to Page Cache. This means that data does not reach the disk right away. The actual hardware write is done when Kernel decides it’s time to perform a writeback of the dirty page.

Standard IO takes a user space buffer and then copies it’s content to the page cache. When the O_DIRECT flag is used, the buffer is written directly to the block device.

Page Cache

How Buffered IO works: Applications perform reads and writes through the Kernel Page Cache, which allows sharing pages processes, serving reads from cache and throttling writes to reduce IO.

When the read operation is performed, the Page Cache is consulted first. If the data is already loaded in the Page Cache, it is simply copied out for the user: no disk access is performed and read is served entirely from memory. Otherwise file contents are loaded in Page Cache and then returned to the user. If Page Cache is full, least recently used pages are flushed on disk and evicted from cache to free space for new pages.

write() call simply copies user-space buffer to kernel Page Cache, marking the written page as dirty. Later, kernel writes modifications on disk in a process called flush or writeback. Actual IO normally does not happen immediately. Meanwhile, read() will supply data from the Page Cache instead of reading (now outdated) disk contents. As you can see, Page Cache is loaded both on reads and writes.

Pages marked dirty will be flushed to disk as since their cached representation is now different from the one on disk. This process is called writeback. writeback might have potential drawbacks, such as queuing up IO requests, so it’s worth understanding thresholds and ratios that used for writeback when it’s in use and check queue depths to make sure you can avoid throttling and high latencies. You can find more information on tuning Virtual Memory in Linux Kernel Documentation.

Logic behind Page Cache is explained by Temporal locality principle, that states that recently accessed pages will be accessed again at some point in nearest future.

Another principle, Spatial Locality, implies that the elements physically located nearby have a good chance of being located close to each other. This principle is used in a process called “prefetch” that loads file contents ahead of time anticipating their access and amortizing some of the IO costs.

Page Cache also improves IO performance by delaying writes and coalescing adjacent reads.

Disambiguation: Buffer Cache and Page Cache: previously entirely separate concepts, got unified in 2.4 Linux kernel. Right now it’s mostly referred to as Page Cache, but some people people still use term Buffer Cache, which became synonymous.

Page Cache, depending on the access pattern, holds file chunks that were recently accessed or may be accessed soon (prefetched or marked with fadvise). Since all IO operations are happening through Page Cache, operations sequences such as read-write-read can be served from memory, without subsequent disk accesses.

Delaying Errors

Direct IO

For a “traditional” application using Direct IO will most likely cause a performance degradation rather than the speedup, but in the right hands it can help to gain a fine-grained control over IO operations and improve performance. Usually applications using this type of IO implement their own application-specific caching layer.

How Direct IO works: Application bypasses the Page Cache, so the writes are made towards the hardware storage right away. This might result into performance degradation, since the Kernel buffers and caches the writes, sharing the cache contents between application. When used well, can result into major performance gains and improved memory usage.

Using Direct IO is often frowned upon by the Kernel developers. It goes so far, that Linux man page quotes Linus Torwalds: “The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid”.

However, databases such as PostgreSQL and MySQL use Direct IO for a reason. Developers can ensure fine-grained control over the data access , possibly using a custom IO Scheduler and an application-specific Buffer Cache. For example, PostgreSQL uses Direct IO for WAL (write-ahead-log), since they have to perform writes as fast as possible while ensuring its durability and can use this optimization since they know for sure that the data won’t be immediately reused, so writing it bypassing Page Cache won’t cause performance degradation.

It is discouraged to open the same file with Direct IO and Page Cache simultaneously, since direct operations will be performed against disk device even if the data is in Page Cache, which may lead to undesired results.

Block Alignment

Examples of unaligned writes (highlighted). Left to right: the write neither starts, nor ends on the block boundary; the write starts on the block boundary, but the write size isn’t a multiple of the block size; the write doesn’t start on the block boundary.

In other words, every operation has to have a starting offset of a multiple of 512 and a buffer size has to be a multiple of 512 as well. When using Page Cache, because writes first go to memory, alignment is not important: when actual block device write is performed, Kernel will make sure to split the page into parts of the right size and perform aligned writes towards hardware.

Examples of aligned writes (highlighted). Left to right: the write starts and ends on the block boundary and is exactly the size of the block; the write starts and ends on the block boundary and has a size that is a multiple of the block size.

For example, RocksDB is making sure that the operations are block-aligned by checking it upfront (older versions were allowing unaligned access by aligning in the background).

Whether or not O_DIRECT flag is used, it is always a good idea to make sure your reads and writes are block aligned. Crossing segment boundary will cause multiple sectors to be loaded from (or written back on) disk as shown on images above. Using the block size or a value that fits neatly inside of a block guarantees block-aligned I/O requests, and prevents extraneous work inside the kernel.

Nonblocking Filesystem IO

O_NONBLOCK is generally ignored for regular files, because block device operations are considered non-blocking (unlike socket operations, for example). Filesystem IO delays are not taken into account by the system. Possibly this decision was made because there’s a more or less hard time bound on operation completion.

For same reason, something you would usually use in Network context, like select and epoll, does not allow monitoring and/or checking status of regular files.

Closing Words

The next post will be featuring Memory Mapping, Vectored IO and Page Cache Optimisations.

If you find anything to add or there’s an error in my post, do not hesitate to contact me, I’ll be happy to make corresponding updates.

If you liked the post and would like to be notified about the next parts, you can follow me on twitter or subscribe to my mailing list .

databasss

Database Internals

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store