Performance impact of Write Cache for Hard/Solid State disk drives

Crademann
Coccoc Engineering Blog
9 min readOct 8, 2021

This study is a part of research of Ceph’s IO subsystem acceleration, and mostly applicable for the Ceph-specific IO workload.

In some discussions around the internet, people suggesting different settings for the drive’s write cache, but there are no complete study detailing the effect of the write cache on the performance of the disk subsystem.

Let’s figure out what is write cache first.

Write cache using a volatile (RAM) memory in disk drive for temporary storage of written data. Disk cache (usually 64–128MB nowadays) used for reads (read-ahead) and writes.

When write cache is enabled (default), host writes come to the write cache first, if there are no additional flags provided, then flushed to the the storage media (plates/NAND) in the background.

Write cache Pros:

HDD: data can be re-organized before writing, sorted, flushed to the plates in chunks, allowing NCQ to work efficiently and spend less time for seeks.

SSD: theoretically, reducing write amplification and NAND cells wear, improving drive’s lifespan. SSD can’t rewrite single 4k page, they have to rewrite a whole block for that, which is usually much bigger than a page. Write cache may organize written data into bigger chunks, reducing the amount of actually written data.

Write cache Cons:

Write cache is a volatile memory, so written data can be suddenly lost during a power loss.

There are 2 ways to control the persistence of written data to the storage media from a write cache:

https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt

a) FLUSH CACHE command to the disk drive, instructing it to flush all the data from volatile cache to the media. Controlled by setting REQ_PREFLUSH bit.

b) FUA bit, which is more granular, and work just for a specific data. Controlled by REQ_FUA bit.

https://en.wikipedia.org/wiki/Disk_buffer#Force_Unit_Access_(FUA)

Top-grade SSDs come with a super-capacitors (kind of BBU for RAID adapters with onboard cache), allowing cached data to be flushed after the power loss, so they can simply ignore FLUSH CACHE / FUA , pretending that they do not have a volatile cache, and showing higher performance.

So, let’s figure out how write cache affecting IO performance with a different write patterns.

To find it, i tested HDD/SSD disks with enabled/disabled write cache, and different sync settings (async/fsync/disabled).

Test stand:

HDD:

Model Family:     HGST Ultrastar 7K6000
Device Model: HGST HUS726040ALE610
Serial Number: N8GTWL6Y
LU WWN Device Id: 5 000cca 244cb5077
Firmware Version: APGNT907
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)

SSD:

Model Family:     Samsung based SSDs
Device Model: SAMSUNG MZ7LH1T9HMLT-00005
Serial Number: S455NW0R407752
LU WWN Device Id: 5 002538 e7140c4e2
Firmware Version: HXT7904Q
User Capacity: 1,920,383,410,176 bytes [1.92 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)

CPU: 2x Intel(R) Xeon(R) CPU E5–2630 v4 @ 2.20GHz

MEM: 8x32GB DDR4–2400 @2133 MT/s

MB: X10DRi

HBA: MB embedded SATA controller: Intel Corporation C610/X99 series chipset sSATA Controller, AHCI

We’ll use fio with a following default settings:

# fio -ioengine=libaio -direct=1 -invalidate=1 -bs=4k -runtime=60 -filename=/dev/sdX

and changing:

-fsync=1 - doing fsync() calls after each write, triggering FLUSH CACHE (also tested with -fdatasync, there are no difference in the final results)

-sync=1 - opening block device file with O_SYNC, setting FUA bit for each write

-rw=randwrite|write - random/sequential write patterns

-iodepth=1|128 - 1/128 IO threads to emulate signle- and multithreaded applications

Results:

HDD:

SSD:

IO pattern divided into 4 groups for easier comparison:

  • baseline (purple) — direct (unbuffered) writes with no sync flags set. Worst case due to single-threaded random writes, when OS scheduler doesn’t work, and optimizations can be done by the drive only.
  • journal write (yellow) — load pattern similar to journal write, worst values over sync/fsync should be taken
  • worst case (red)— single-threaded random writes, when writes cannot be merged by OS scheduler and dispatched to the the drive “as is”. Should be compared with baseline values.
  • other group with multi-threaded random/sequential writes showing the cases, when OS IO scheduler can work effectively and organize IO requests into bigger blocks before dispatching to the device, which works especially effective in case if multi-threaded sequential writes (that’s why we have such a weird high numbers).

As we can see, disabling write cache gives a noticeable performance boost for writes, performed with fsync()/sync()

Let’s take a look how writes look like on the block level with a different cache settings.

Write cache is enabled:

O_SYNC:

[pid 1688687] openat(AT_FDCWD, "/dev/sdc", O_RDWR|O_SYNC|O_DIRECT|O_NOATIME) =8,32  21      577     1.159821013 1696065  Q WFS 314869920 + 8 [fio]          
8,32 21 578 1.159822910 1696065 G WFS 314869920 + 8 [fio]
8,32 21 579 1.159837744 494 D WS 314869920 + 8 [kworker/21:1H]
8,32 21 580 1.160078514 0 C WS 314869920 + 8 [0]
8,32 21 581 1.160083818 494 D FN [kworker/21:1H]
8,32 21 582 1.168708932 0 C FN 0 [0]
8,32 21 583 1.168710581 0 C WS 314869920 [0]

WFS here is a REQ_WRITE with the REQ_FUA flag set

We can see that there are no additional flush request happening after write is completed, and this way of persisting data into the backing media works faster than write and following whole buffer flush.

fsync()/fdatasync():

[pid 1697421] openat(AT_FDCWD, "/dev/sdc", O_RDWR|O_DIRECT|O_NOATIME) = 58,32  21     2883     5.028204572 1700232  Q  WS 6286222192 + 8 [fio]         
8,32 21 2884 5.028205818 1700232 G WS 6286222192 + 8 [fio]
8,32 21 2885 5.028206940 1700232 P N [fio]
8,32 21 2886 5.028207718 1700232 U N [fio] 1
8,32 21 2887 5.028208500 1700232 I WS 6286222192 + 8 [fio]
8,32 21 2888 5.028210961 1700232 D WS 6286222192 + 8 [fio]
8,32 21 2889 5.028552938 0 C WS 6286222192 + 8 [0] <- write is done
8,32 21 2890 5.028568438 1700232 Q FWS [fio] <- flush op
8,32 21 2891 5.028569112 1700232 G FWS [fio]
8,32 21 2892 5.028575422 494 D FN [kworker/21:1H]
8,32 21 2893 5.036893221 0 C FN 0 [0]
8,32 21 2894 5.036894784 0 C WS 0 [0] <- flush op done

Here we can see that after successful write operation, fio issuing fsync() call, forcing drive to perform FLUSH CACHE.

FWS here is REQ_PREFLUSH | REQ_WRITE | SYNC

Thus, for each write operation with fsync() / fdatasync()

a) two independent operations (write/flush) issued to the device, each have to be received, queued, dispatched, handled by the process. All of this adding and overhead and reducing the performance

b) whole write cache flushing operation may cause a significant write amplification, rotational disk drive may have to handle plenty of random writes to flush the write cache, making this operation pretty slow

Write cache is disabled:

O_SYNC:

8,32  39      136     0.620358451 1680674  P   N [fio]                        
8,32 39 137 0.620359884 1680674 U N [fio] 1
8,32 39 138 0.620360414 1680674 I WS 3288341176 + 8 [fio]
8,32 39 139 0.620363128 1680674 D WS 3288341176 + 8 [fio]
8,32 39 140 0.620945955 0 C WS 3288341176 + 8 [0]
8,32 39 141 0.620967308 1680674 Q WS 84473584 + 8 [fio]
8,32 39 142 0.620968329 1680674 G WS 84473584 + 8 [fio]
8,32 39 143 0.620969738 1680674 P N [fio]
8,32 39 144 0.620971128 1680674 U N [fio] 1
8,32 39 145 0.620971508 1680674 I WS 84473584 + 8 [fio]
8,32 39 146 0.620973965 1680674 D WS 84473584 + 8 [fio]
8,32 39 147 0.621559643 0 C WS 84473584 + 8 [0]

fsync()/fdatasync():

8,32  21     1185     1.355483245 1702607  P   N [fio]                        
8,32 21 1186 1.355484215 1702607 U N [fio] 1
8,32 21 1187 1.355484755 1702607 I WS 3986931968 + 8 [fio]
8,32 21 1188 1.355495503 1702607 D WS 3986931968 + 8 [fio]
8,32 21 1189 1.356063749 0 C WS 3986931968 + 8 [0]
8,32 21 1190 1.356090475 1702607 Q WS 4309057840 + 8 [fio]
8,32 21 1191 1.356091253 1702607 G WS 4309057840 + 8 [fio]
8,32 21 1192 1.356092037 1702607 P N [fio]
8,32 21 1193 1.356092497 1702607 U N [fio] 1
8,32 21 1194 1.356092791 1702607 I WS 4309057840 + 8 [fio]
8,32 21 1195 1.356094313 1702607 D WS 4309057840 + 8 [fio]
8,32 21 1196 1.356976650 0 C WS 4309057840 + 8 [0]

As we can see, write patterns are similar now, same as the performance with those options.

No Fflags present, no FUA bit set or FLUSH CACHE initiated, writes comes directly the media, without caching.

Analysing the results:

HDD:

a) As we can see for non-sync operations, even for a single-threaded random writes, enabled write cache actually helps to increase the performance, since write requests coming to the quick volatile memory (RAM) first, then can be re-organized by NCQ, flushed in batches, allowing the device to reduce IO time and increase the performance capacity. With disabled write cache performance degrades to the level of sync writes, which is expected.

b) With O_SYNC / fsync() enabled, write cache adding more overhead, reducing the performance. Disabling write cache increasing the performance, since no additional write/flush cache operation is requested → no overhead.

c) O_SYNC / fsync() with write cache disabled are identical, since on the block layer they are translated to the same commands.

d) O_SYNC / fsync() performance with enabled write cache is different for 128-threaded writes, with O_SYNC performance is higher, since each thread setting FUAbit independently and “touching” only it’s data, not all the data in write cache, which adding less overhead and increasing the performance, both for random and sequential writes

SSD:

Picture are pretty much the same as for HDD, due to the similar logic, but scale is different because of higher performance of NAND memory in comparison with rotational media.

a) for non-sync writes, disabling write cache reducing performance just a bit for single-threaded workload. For multithreaded workload performance drop may be more significant, also, since storage layer performance depending on many factors (like garbage collector in action, lack of free writable blocks, worn NAND cells, way of NAND cells oranizing (MLC/TLC/QLC)), performance drop may differ within wide limits. We should consider this particular SSD drive results just as an example only, for other drives it will be different.

b) Disabling write cache giving a noticeable performance boost, not that big as for HDD, since NAND cells are still pretty fast and the overhead for cache flushing is not that big.

Conclusions:

1) Writes with FUA bit set seems working in linux. It’s better to use it rather than fsync()/fdatasync() because:

a) it’s more fine-grained and making less overhead and write amplification

b) can be used even with enabled write cache, to get the write cache benefits for non-sync workload, and make sure that data with O_DIRECT is safely flushed to the media for specific files.

2) For workloads where fsync()/fdatasync()/O_SYNC is the major part, disabling write cache will give a significant performance boost for HDD. For SSD as well, but by sacrificing it’s lifespan and durability, since it will lead to higher write amplification (theoretically, we didn’t test it carefully). Disabling write cache for SSD may be considered for having maximum performance, when lifespan is not important.

Practical application

Applying all this knowledges for Ceph, i tested OSD performance, created in the different ways, with different write cache settings.

To test real OSD performance (not under some particular applications like RBD or CephFS), i used ceph osd bench like following:

# ceph tell osd.X bench 268435456 4096

where

268435456 - the payload size

4096 - IO block size

It’s making single-threaded writes with a defined block size (default 4M).

I ran tests with 4k block and 4M block.

OSDs was configured in 3 options:

a) HDD only

b) SSD only

c) data on HDD + WAL/db on SSD

Results:

HDD, SSD:

OSD on HDD+SSD(WAL), different write cache configurations

As we can see from the results, disabling write cache on HDDs can improve it’s performance, both with small and big blocks.

Disabling write cache on SSDs doesn’t give a significant boost in any configuration, so we can keep it enabled and save SSD drive’s lifespan.

OSD processes using fdatasync() calls after each write, for both journal and data. Since it’s using disk drives exclusively, disabling write cache leading to the better performance, as shown in the tests above.

Thanks for reading :)

--

--