WAL usage looks broken in modern Time Series Databases?

Photo by Stephen Dawson on Unsplash

Write-ahead logging (WAL) is a common practice among modern time series databases — Prometheus uses WAL, InfluxDB uses WAL, TimescaleDB transiently uses WAL from PostgreSQL, Cassandra also uses WAL.

Let’s look into WAL theory. WAL is used as a protection against losing of recently added data on power loss. All the incoming data must be written into write ahead log before returning success to the client. This guarantees that the data may be recovered from WAL file after power loss. Looks simple and great in theory! What’s in the practice?

Page cache and WAL

Database developers know that the operating system (OS) doesn’t write data to persistent storage on every write syscall. The data is just written into page cache residing in RAM unless direct IO is used. So data successfully written into WAL file may disappear on power loss. How to deal with this? Either to use cumbersome direct IO or to explicitly tell the OS to flush recently written data from page cache to persistent storage via fsync syscall. But fsync has a major downside — it is very slow on SSD (1K-10K rps) and extremely slooow on HDD (100 rps). For instance, 1M inserts per second may be easily slowed down to 100 inserts per second with fsync-after-each-inserted-row strategy.

What do DB devs do with slow fsync? They relax data safety guarantees in various ways:

  • Prometheus calls fsync only after big chunk of data (aka segment) is written into WAL, so all the segment data may be lost / corrupted on power loss before fsync. The data may be corrupted if the OS flushes a few pages with the written data to disk, but doesn’t flush the remaining pages. Prometheus fscync’s segments every 2 hours by default, so a lot of data may be corrupted on hardware reset.
  • Cassandra by default calls fsync on WAL only every 10 seconds, so the last 10 seconds of data may be lost / corrupted on power loss. Probably, replication can help in this case.
  • InfluxDB by default calls fsync on every write request, so it is recommended feeding InfluxDB with write requests containing 5K-10K data points in order to alleviate fsync slowness. It recommends setting wal-fsync-delay to non-zero value for workloads with high volume of writes and/or for slow HDDs, so data may be lost on power loss.
  • TimescaleDB relies on PostgreSQL’s WAL mechanism, which puts data into WAL buffers in RAM and periodically flushes them to WAL file. This means that the the data from unflushed WAL buffers is lost on power loss or on process crash.

So, modern TSDBs provide relaxed data safety guarantees — recently inserted data may be lost on power loss. The following questions arise:

  • Don’t these relaxations defeat the main purpose of write ahead logging? IMHO, the answer to this question is “yes”. Sad, but true :(
  • Are there better approaches with similar data safety guarantees exist? Yes — SSTable.

SSTable instead of WAL?

The idea is simple — just buffer data in memory and atomically flush it into SSTable-like data structure on disk without the need of WAL. The flush may be triggered either by timeout (i.e. every N seconds) or by reaching the maximum buffer size. This gives similar data safety guarantees as the “optimized WAL usage” described in the previous chapter — recently inserted data may be lost on power loss / process crash.

Careful reader may notice the difference — "optimized WAL usage” can result in data corruption, while “write directly to SSTable” approach is vulnerable to process crash. IMHO, recently written data loss on process crash has lower severity comparing to data corruption. Properly implemented database shutdown procedure significantly reduces the risk of data loss. The shutdown procedure is quite simple — stop accepting new data, then flush in-memory buffers to disk, then exit.

The following databases prefer writing directly into SSTable instead of WAL:

  • ClickHouse. By default it writes incoming data directly do persistent storage in SSTable-like format. It supports in-memory buffering via Buffer table.
  • VictoriaMetrics. It buffers incoming data in RAM and periodically flushes it to SSTable-like data structure on disk. Flush interval is hard-coded to one second.

Conclusions

WAL usage looks broken in modern time-series databases. It doesn’t guarantee data safety for recently inserted data on power loss. WAL has two additional drawbacks:

  • Write-ahead logging tends to consume significant portion of disk IO bandwidth. It is recommended to put WAL into a separate physical disk due to this drawback. “Write directly to SSTable” approach requires less disk IO bandwidth, so higher volumes of data may be consumed by the database without WAL.
  • WAL may slow down database startup times due to slow recovery step and even may lead to OOMs and crash loops.

Prometheus, InfluxDB and Cassandra already use LSM-like data structures with SSTables, so they may quickly switch to the new approach. It is unclear yet whether TimescaleDB could use the new approach, since it doesn’t use LSM.