Writing A Database: Part 2 — Write Ahead Log

Daniel Chia
4 min readSep 21, 2017

--

As always, code at https://github.com/danchia/ddb

So Uh, Your Data Isn’t Very Durable…

In Part 1, I wrote a really simple server using gRPC and Go that serves Get and Put requests from a in-memory map. If the server quit, it lost all the data, which I must admit is pretty terrible for a database.

I implemented write ahead logging, allowing recovery of the in-memory state upon server restart. While the idea itself is really simple, implementing it turned out to be quite a big can of worms! I ended up looking at how LevelDB, Cassandra and etcd approach this problem.

The Write Ahead Log

The Write Ahead Log (WAL) is a commonly used technique in database systems to maintain atomicity and durability of writes. The key idea behind the WAL is that before we make any actual modifications to the database state, we have to first log the complete set of actions we wish to have to be atomic and durable to durable storage (e.g. disk).

Durability is provided by writing the intended mutation to the WAL first, before applying the changes to for example, the in-memory representation. By writing to the WAL first, should the database then crash, we will be able to recover the mutation and reapply if necessary.

Atomicity is a little more subtle. Suppose a mutation requires changes A, B and C to happen, but we have no means of atomically applying all of them at once. We could first log

intending to apply A
intending to apply B
intending to apply C

and only then start making the actual applications. Should the server crash halfway, we can look at the log and see what operations potentially need to be redone.

In DDB, the WAL is an append-only file of records:

record:
length: uint32 // length of data section
checksum: uint32 // CRC32 checksum of data
data: byte[length] // serialized ddb.internal.LogRecord proto

Since serialized protos are not self describing, we need a length field to know big the data payload is. Additionally, to guard against various forms of corruption (and bugs!) we have a CRC32 checksum of the data.

Performance Is Critical, but Disk is Slow

Generally, the WAL ends up in the critical path of all mutating operations, since we have to perform our write-ahead logging before proceeding with the mutations.

You would think that we’d be good to go after the File.Write call returns, however that’s generally not the case due to OS caches.

I’ll use Linux as an example here. In Linux, files are cached in memory in the page cache, and furthermore, disk blocks are cached in the buffer cache. These caches help improve performance since applications often read what they wrote recently, and applications don’t always read or write in sequential order.

Linux typically operates in write-back mode, where the buffer cache is only flushed to disk periodically (~30 seconds). This means that even after File.Write completes, our data is not guaranteed to be on disk yet — a power loss at this point could results in data loss. To force your data to disk, you additional have to call fsync().

A wrote a quick benchmark to judge the performance of my WAL. The test repeated logs either 100 bytes or 1KB, calling fsync() every n calls.These tests were run on my Windows 10 machine with a local SSD.

Benchmarking

DDB WAL Benchmarks
BenchmarkAppend100NoSync 529 ns/op 200.23 MB/s
BenchmarkAppend100NoBatch 879939 ns/op 0.12 MB/s
BenchmarkAppend100Batch10 88587 ns/op 1.20 MB/s
BenchmarkAppend100Batch100 9727 ns/op 10.90 MB/s
BenchmarkAppend1000NoSync 2213 ns/op 455.45 MB/s
BenchmarkAppend1000NoBatch 906057 ns/op 1.11 MB/s
BenchmarkAppend1000Batch10 94318 ns/op 10.69 MB/s
BenchmarkAppend1000Batch100 14384 ns/op 70.08 MB/s

Not surprisingly, fsync() is slow! A 100 byte log entry takes 529ns without syncing, and 880us with syncing. 880us would limit us to ~1.1k QPS. This would probably be much worse on a regular HDD, where the disk seek would probably cost us ~10ms. It’s not uncommon with HDD to use a dedicated drive just for the WAL in order to reduce the seek time.

To sanity check my results, I ran etcd’s WAL benchmarks.

etcd WAL Benchmarks
BenchmarkWrite100EntryWithoutBatch 868817 ns/op 0.12 MB/s
BenchmarkWrite100EntryBatch10 79937 ns/op 1.35 MB/s
BenchmarkWrite100EntryBatch100 9512 ns/op 11.35 MB/s
BenchmarkWrite1000EntryWithoutBatch 875304 ns/op 1.15 MB/s
BenchmarkWrite1000EntryBatch10 84618 ns/op 11.92 MB/s
BenchmarkWrite1000EntryBatch100 12380 ns/op 81.50 MB/sk

etcd’s single 100 byte write is 869ns, so I’m pretty close! Their larger batches do a fair bit better, but that’s not surprising given their implementation is more optimized. I suspect if I were to measure a latency histogram instead, their performance would likely have shorter tail latencies.

To Sync Or Not To Sync?

Given that sync-ing is so expensive, what do other databases do?

  • LevelDB actually defaults to not syncing. They claim that non-sync writes can often be used safely, and that a user should choose when they wish to sync.
  • Cassandra defaults to periodic syncing every 10s. Writes are acknowledged once placed in the OS file buffer.
  • etcd has some logic around whether to sync, but best I can tell user writes would end up causing a sync.

I decided for now to err on the side of correctness and always sync. One potential optimization I will be on the lookout for is to try and batch updates to the WAL, thus amortizing the cost of the sync.

Other Sneaky Correctness Issues / Things I Didn’t Do

Most WAL implementations write their logs in segments. After a segment reaches a certain size, the WAL starts a new segment. This makes it easy to truncate earlier parts of the log once they are no longer needed.

Dealing with multiple files, or really the filesystem in general, can get tricky. In particular, just like with compilers and memory, the OS is often free to re-order operations to disk, and many file operations are not atomic. Techniques like writing to a temp file then renaming it into it’s final place to have an atomic file write are common. For kicks, you can check out this Github issue of another project lamenting the difficulty of getting ACID filesystem writes.

Next Time: Consensus

I hope to get to replication via a consensus algorithm (e.g. Raft) next.

If you enjoyed reading this, please re-share, and tweet me (@DanielChiaJH) any feedback!

--

--

Daniel Chia

Software Engineer at Google. Opinions stated are my own, not of my company.