On Disk IO, Part 5: Access Patterns in LSM Trees
If you like the series, check out my upcoming book on Database Internals!
Series consist of 5 pieces:
- Flavors of IO: Page Cache, Standard IO, O_DIRECT
- More Flavors of IO: mmap, fadvise, AIO
- LSM Trees
- Access Patterns in LSM Trees
- B-Trees and RUM Conjecture
New series on Distributed System concepts in Databases can be found here.
In the first and second parts, we’ve discussed underlying Operating System mechanisms that help to perform writes on disk. In the third part, we’ve started talking about an immutable on-disk data structure, LSM Trees. Today we’ll be discussing random and sequential Access Patterns in LSM Trees.
Access patterns are patterns with which a program reads and writes the data. In general, we distinguish between the random and sequential access patterns. But, of course, nothing is absolute. Having fully sequential reads is not possible for ad-hoc queries, since the data has to be located first, but as soon as it is located, it can be read sequentially.
By sequential access we usually mean reads monotonically going from lower offsets to the higher ones and the higher offsets are immediately following the lower ones.
Random access is reading non-contiguous chunks of data. It usually involves disk seeks, skipping portions of the file in order to locate the data. Hop size is often hard to predict and spans many pages (for example, when traversing a B-Tree on disk, we have to skip entire levels in order to continue the search). In summary, sequential access implies reading contiguous blocks monotonically and random access is pretty much anything else.
Sequential access is often preferred because of it’s predictability. In one of previous posts we’ve discussed the fact that avoiding Page Faults allows for a better performance, since reads are served from RAM rather than disk. When reading data sequentially, Kernel may load the pages ahead of time in the process called prefetching: speculative reading from disk based on some prediction of future requests. In addition, sequential reads avoid additional seeks.
Optimising for sequential reads and for sequential writes are orthogonal problems. Records written sequentially are not always read together (for example, point queries in sequentially written LSM Tree are still random). Similarly, data read together wasn’t necessarily put on disk in a sequential manner (for example, a sequential read of the level in a B-Tree, which might have been updated randomly).
Random Reads on SSDs
On HDDs, sequential access is preferred to random because of their physical organisation and the way they work. Read/write head is attached to the mechanical arm that has to travel across the disk in order to read the blocks of data; disk has to rotate in to position the track sector under read/write head. This all involves a non-trivial amount of movement. Operating System tries to amortise the costs by caching, buffering and scheduling operations optimally.
SSDs are made of electronics and do not have any moving components. In this regard, SSDs are inherently different from HDDs and there’s no performance degradation caused by where data is stored on disk physically. However, current SSD technology suffers from the performance degradation caused by write amplification. Lack of moving parts allows for several other characteristics, such as parallelism, but we won’t be discussing them in this article.
Minimal read unit on SSD is page. Reads and writes are performed in pages. Deleting a page worth of data does not immediately remove data physically. Instead, a page is marked as stale and will wait for Garbage Collection to reclaim free space.
Because writes are performed in pages, even if a single byte has to be updated, the whole page will be written anyway. At the same time, because of the specifics of NAND storage, pages can not be updated in place, so writes can be performed only into the empty pages. These two properties attribute for the write amplification on SSDs.
After an extensive amount of random writes, an FTL (Flash Transportation Layer) runs out of free pages and has to perform Garbage Collection: a process that reads, consolidates then and writes active pages in free blocks, freeing blocks, occupied by stale pages and reclaiming disk space.
Some SSDs implement background Garbage Collection, which takes advantage of idle time in order to consolidate blocks and reclaim stale pages before new data has to be written, which ensures that future foreground write processes have enough free pages available. But given enough write pressure, Garbage Collection process may not keep up with the amount of work, negatively impacting write performance.
A key goal of log-structured systems is sequentialising writes. However, if the FTL is shared by two log- structured applications (or even a single application with multiple append streams), the incoming data into the FTL is likely to look random or disjoint. You can read more about “stacking” log operations in this paper.
We’ve discussed multiple things one has to take into consideration when working with SSDs. Writing complete pages is better than writing data smaller than the page size, since the smallest SSD unit storage is a page. Because updating page will effectively allocate a new page and invalidate the previous one, updates may result into Garbage Collection. It’s better to keep the write operations page-aligned in order to avoid additional write multiplication. And last, keeping the data with similar lifecycle together (e.g. the data that would be both written and discarded at the same time) will be beneficial for performance. Most of these points are points speak favour of immutable LSM-like Storage, rather than systems that allows in-place updates: writes are batched and SSTables are written sequentially, files are immutable and, when deleted, the whole file is invalidated at once.
Some data structures are inherently sequential. For example, a Write Ahead Log used by the databases and filesystems. It is used in order to facilitate durability: changes to the data files are first appended to the log sequentially.
When main storage catches up and records are committed to the data files, commit log segment holding recovery data for it is discarded. If the process dies before the main storage has a chance to catch up, Write Ahead Log is replayed to restore the state database had before restart. If we follow this procedure, data files don’t have to be flushed on disk on every operation: operations can be batched together, while still guaranteeing durability. Using Write-Ahead Log significantly reduces amount of writes for both mutable and immutable storage types.
It’s often advised to use a separate physical device for Write Ahead Log to make sure both memory table flushes and WAL writes are sequential. There are many other reasons to do so, too: to avoid IO saturation, for better failover, more predictable latencies.
LSM-Trees are using Memory tables, where data is stored before it gets to the main storage, for serving reads and batching writes together. After reaching a size threshold, memory table is written on disk.
Here, memory table serves as a buffer: read, write and update operations are performed against memory tables, allowing batching a few items together. When data is written on disk, it’s done sequentially, in one pass. This amortises a cost of small random writes and converts them into larger sequential allocations on disk, transforming updates of logically unrelated data into physically sequential I/O.
Unlike Write-Ahead Log (which writes items in the incoming order) Memory Tables pre-sort the data before it reaches disk in order to facilitate sequential read access. Records that are more likely to be read together, are written together.
As you can see all write operations in LSM Trees are sequential: Write-Ahead Log appends, Memtable flushes, Compactions. Using per-SSTable indexes or pre-sorting data can also help to make at least some read operations sequential. It can only be done to a certain extend as reads have to be performed against multiple files and then merged together.
At least for now, this is going to be the last article in the IO series: it’s been a lot of work to prepare all of it to publication and some editing is due. Over the next several months I’ll be refining the existing articles and adding more useful information to them.