NVMe storage verification and benchmarking

Krzysztof Ciepłucha
19 min readMar 22, 2022

--

Photo by panumas nikhomkhai from Pexels

In the previous article (Proper network connectivity verification and benchmarking) I focused on the network connectivity and Network Adapters (NICs). This time, let’s talk about storage.

We won’t tackle here diskless servers (booting and running completely from the network or any other remote storage, including SAN or NAS systems). The assumption is that your servers have at least one NVMe SSD.

Why NVMe? If you pick any random server, there is still a high chance you will find SSDs or HDDs with SATA or SAS interface, and there are some scenarios where these interfaces are “good enough”. But it’s time to realize that in 2022, with PCIe Gen4 NVMe drives on the market and even faster generations coming, both SAS and SATA interfaces are just too slow and should not be used in the Data Center anymore. I won’t go into details here (you can find multiple articles on that topic elsewhere), but the performance difference between same-class NVMe and SATA SSD is tremendous, not only in terms of bandwidth, but also number of operations per second (IOPS) and latency, and the only reason you should still consider using SATA SSD is for small boot drive (usually in a M.2 form factor).

Form factor

There are several different form factors on the market, with M.2 (mainly for internal use) and 2.5" U.2 (externally accessible) being most popular. New form factors are coming (like EDSFF E.1 and E.3) and you might still find some old drives in a form of an PCIe add-in-card (AIC), but as long as we are dealing with locally-attached NVMe over PCIe drives, the form factor doesn’t really matter. You can learn more about various form factors on the SNIA web site. Please keep in mind that the M.2 form factor also supports other protocols, with SATA being most popular, but we are going to focus only on NVMe drives.

For the purpose of this article what really matters is the number of PCIe lines needed by specific drive / form factor. The M.2 NVMe drives can use either two (x2) or four (x4) PCIe lines, and most if not all Data Center U.2 drives on the market use four (x4) lines. In addition to the number of PCIe lines, you also need to pay attention to the PCIe version. Previous generation drives (still most drives on the market) supports PCIe Gen3 (with 8GT/s per line) and new drives supports PCIe Gen4 (with 16GT/s per line). It’s easy to guess that Gen4 drive can be twice as fast as the Gen3 drive, but it’s only half-truth. The drive performance is limited by many factors, including controller performance, flash module types and organization or firmware, but if you take a look at various drive specification from leading vendors on the market, it is clear that Gen4 drives are usually much faster, especially in terms of bandwidth.

Topology

NVMe drives are essentially PCIe devices, so we are going to check connection between the controller on the drive and our server PCIe bus in a similar way to how we checked NICs in the previous article. At the same time, there are some differences too. It’s easy to imagine you might need much more drives in the server than network interfaces, and even though with each generation of CPUs and servers you are getting more and more PCIe lines, they are still fairly limited, so in some servers you can’t directly connect all NVMe drives without some tricks. Most server vendors allows you to choose between several options or topologies. They may use different names or hide it behind some other options, but in general the most popular options are:

  1. Direct connection, where each drive is directly connected to the PCIe bus, without any additional hubs, switches or controllers. This provides best performance since each drive has dedicated bandwidth and no additional latency is introduced
  2. Switched connection, where group of drives are connected to larger number of PCIe lines through one or more PCIe switches. This is often used in 2RU servers in configurations with more than 8 or 16 drives (depending on the server vendor and model). Most common scenario is a server supporting 24 NVMe drives, where each half of the drives are connected via PCIe switch and usually share sixteen (x16) PCIe lines to the PCIe bus. That means there is 3:1 oversubscription (12 times x4 = 48 lines towards the drives with only 16 lines from the switch to the host PCIe bus). This topology is still much better than typical SAS or SATA-only configuration when all drives share single HBA or RAID controller via some SAS expander, but it’s clear the overall system performance will be limited if you try to saturate all the drives at the same time.
  3. Using Tri-Mode RAID adapter, where group of drives (usually up to 8 drives) are connected through additional Tri-Mode RAID adapter. Tri-Mode means such adapter is supposed to support all three popular protocols — SATA, SAS and NVMe, but in reality you cannot mix NVMe and other drives on the same adapter, so it’s either NVMe-only or SAS/SATA RAID. The obvious benefit of using such adapter is ability to use bunch of NVMe drives in a RAID configuration providing either additional performance (by using stripping) or redundancy, or both. But there are several drawbacks too: the logical volumes are presented to the host system as traditional SCSI devices, meaning more complicated and slower software stack compared to the NVMe. Also, the controller itself has limited performance and introduces additional latency. Finally, currently available controllers on the market supports only up to 8 drives, so if you need more capacity or performance you have to use multiple controllers in a server, and they aren’t cheap.
  4. Mixed topology, where some of the drives are directly connected and others are connected via switch or Tri-Mode RAID controller, or any other combinations of the above topologies. In that case, you should carefully choose proper drive bays and put your high performance drives often used for caching (like Intel Optane SSDs) in directly-connected bays and distribute the other drives in a way that provides the lowest level of oversubscription (if possible).

As you can easily guess, the preferred option should be always direct connection, so if you are not building some storage solution or need hardware RAID, you should always choose this topology when available. It’s also the simplest and cheapest.

Listing NVMe drives in the system

Let’s start with listing available NVMe drives. The easiest way to do that is using nvme command with the -v (verbose) option from the nvme-cli package. In most cases this package is not installed by default, so you need to use your OS package manager to install it if necessary. To get all the details you need to run this tool as privileged (root) user, so we are going to use it with sudo:

$ sudo nvme list -v
NVM Express Subsystems
Subsystem Subsystem-NQN Controllers
---------------- ------------------------------------------------------------------------------------------------ ----------------
nvme-subsys0 nqn.2021-06.com.intel:PHAB1234567ABCDEFG nvme0
NVM Express ControllersDevice SN MN FR TxPort Address Subsystem Namespaces
-------- -------------------- ---------------------------------------- -------- ------ -------------- ------------ ----------------
nvme0 PHAB1234567ABCDEFG Dell Ent NVMe P5600 MU U.2 3.2TB 1.1.5 pcie 0000:31:00.0 nvme-subsys0 nvme0n1
NVM Express NamespacesDevice NSID Usage Format Controllers
------------ -------- -------------------------- ---------------- ----------------
nvme0n1 1 3.20 TB / 3.20 TB 512 B + 0 B nvme0

Unfortunately the output is very wide and it doesn’t fit nicely here, but I’ll try to explain most important things here. Also, some older versions of nvme-cli do not support “-v” argument. In that case you can use command “sudo nvme list-subsys” to map specific disk to PCIe address.

First, you need to know that each NVMe drive has built-in controller represented in the system by device named nvmeX (where X is a number starting from 0). Dual-port or other special models can have more than one controller, but you will rarely find those in typical servers. Now, the whole drive capacity can be divided into smaller pieces called namespaces. This is somewhat similar to partitioning, but it’s done on the drive level instead of the Operating System level. Each namespace is presented to the system as a separate device and is usually named nvmeXnY (where the nvmeX part represents the drive itself, and the Y is the namespace number starting from 1), so for example nvme0n1 is a first namespace (1) on a first NVMe drive (nvme0). Most drives on the market still supports only single namespace, and even for drives supporting more namespaces the factory default configuration is single namespace covering all externally available capacity. The actual drive capacity is usually larger, but some space is reserved for internal purposes like firmware images, controller logs and configuration, spare space used for improving performance and endurance, etc. From the Operating System perspective, the block device used to store data is the namespace, not the controller.

Getting back to our output, you can also find the drive model and maker (here: Dell Ent NVMe P5600 MU U.2 3.2TB, which is Dell-specific version of Intel SSD D7-P5600 3.2TB), serial number (here: PHAB1234567ABCDEFG), firmware version (1.1.5), device PCIe address (0000:31:00.0) and a list of namespaces (nvme0n1). For each namespace you can find its capacity and format (logical sector size — either 512B or 4KB). From our perspective, the most interesting here is the PCIe address we are going to use to investigate device state in a next step. You should also take note of a model and capacity, as we need to find the drive specification later to see what performance levels should we expect. Not all operating systems properly works with 4KB-formatted drives, so in most cases you will see the namespaces with 512B block size. This can be changed, but the operation is destructive (ALL DATA IS LOST!) and you shouldn’t do that without good reason.

PCIe slot and link speed

Similar to the Network Adapter, the NVMe SSD is connected to the system PCIe bus via set of connectors and cables, but compared to the simple PCIe add-on-card there are multiple connection points on the way. Take for example the standard 2.5" U.2 hot-swappable drive in a Dell R650 server — the drive is installed in a carrier and plugged-in directly into the drive backplane. On the other side of the backplane there are multiple connectors with cables leading directly to the mainboard or RAID controller. In some other configurations and topologies you might find additional midplanes or PCIe switches and more cables, sometimes additional PCIe lines are borrowed from a PCIe add-on-card or special riser card. Each connection and cable is a potential place where some issues might arise, like poor contact or completely broken line causing similar issues like we saw with the Network Adapters — either the device does not work at all and is not even detected by the Operating System, or the drive is detected but is not working properly (the error rate is too high for any proper I/O), or it is unable to negotiate higher speeds (for example PCIe Gen4 drive is working only as a Gen3 device) or use all PCIe lines (x4 drive is using only two lines). To detect these kind of issues we are going to use lspci tool from the pci-utils package (again, if the tool is not already installed in your system please use your OS package manager to install it first). Running as a privileged user (using sudo) is needed to get all the details we need. Also, since the output is very long, we are going to filter it out with grep to get only necessary information. The -s argument is the drive PCIe address obtained earlier from the nvme tool.

$ sudo lspci -vv -nn -s 0000:31:00.0|grep Lnk
LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <512ns, L1 <16us
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
LnkSta: Speed 16GT/s (ok), Width x4 (ok)
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
LnkCtl3: LnkEquIntrruptEn- PerformEqu-

First, we need to look at the link capabilities (the line starting with LnkCap) — here we can see the link supports 16GT/s speed (meaning PCIe Gen4) and four PCIe lines (Width: x4). Next we need to check the actual link status (the line starting with LnkSta) — we have the same numbers here with an “ok” comment (Speed 16GT/s (ok), Width x4 (ok)) so everything looks good and our drive should have optimal performance. It is also perfectly fine to use Gen3 drive (like Intel P4500 or P4600 series) in a Gen4-capable server — in that case you should see Speed 8GT/s in a LnkSta line. The Gen4 drives should also work in a Gen3-capable server, but in that case the drive performance will be limited. Typical problems seen in the field includes either dirty contacts on the drive or backplane, or damaged or improperly connected internal PCIe cables — in both cases the drive is usually detected and seems to be working fine, but the performance will be limited due to the missing PCIe lines or lower negotiated speed.

In some topologies with lots of NVMe drives and PCIe switches you might see that only two PCIe lines are available for each drive. In that case you need to confirm with your server vendor whether this is by design and not caused by some HW failure or bad connection, and if you are looking for optimal performance you might need to reconsider whether to use different configuration.

Now, I’m not going to dive into discovering and verifying the full PCIe topology here, which might be important if you have switched topology, but just keep in mind that in that case you should also verify connection between the PCIe switch itself and the host side, and check with the vendor what level of oversubscription to expect.

In case of using Tri-Mode controller, you will not see individual NVMe drives connected via this adapter using the lspci tool, so the only thing you can do is to verify whether the adapter itself is properly installed, and use other tools provided with the adapter to verify connectivity to the drives (if such a tool is available).

Performance testing

Now that we have basic checks behind us, you should also run some benchmarks to confirm your drives can achieve the expected performance. The good thing is that most vendors these days publish the performance numbers in their data sheets for sequential and random reads and writes.

Warning! Before you run any benchmark on a new SSD drive (no matter whether it is using NVMe, SAS or SATA interface) you should first fill it up with some random data, otherwise you might see strange results (usually much better than spec, but I also stumbled upon a case recently when one of the drives reported almost 2x worse performance than spec until it was properly conditioned). The reason for that is because the controllers in modern SSD drives are smart and remember which blocks were used and which are never written before, or successfully erased, and when the application requests non-used block, the controller don’t have to read the actual data from the flash memory modules, it just returns an empty block (filled with zeros, or in some rare cases just some random data). In some rare cases the controller might not be able to respond fast enough resulting in much worse performance than what the drive is actually capable.
For that reason you really should properly pre-condition the drive first (fill it with some random data) before running the actual benchmark.
See the “Drive pre-conditioning” section near the end of the article for more details if you want to learn how to pre-condition your drives before the test.

The data sheet for the Intel D7-P5600 3.2TB drive we are using here can be found on Intel website ark.intel.com. Pay attention to the drive form factor and capacity as different capacities and form factors of the same model might have different performance. Also note that our drive is OEM version specific to Dell so it’s using different firmware and might or might not have different performance characteristic. I found the Dell-specific data sheet for that drive here. This time the numbers are the same, but that’s not always the case.

We are not going to do full benchmarking here, our goal is just to quickly verify whether there are some hidden issues impacting the drive performance or not, so we will be focusing on the non-destructive sequential read test (to verify the bandwidth) and random read test (to verify IOPS and latency). According to the specification, we should see up to 7000MB/s for sequential reads and up to 780k IOPS for random reads with 4KB block size. Sometimes vendors also specify the block size for sequential operations (if not, we can assume 128KB or larger) and the queue depth and/or number of workers for random workloads. This is very important, as very often it’s impossible to saturate the drive from single process doing single I/O operation at a time, so we need to simulate multiple workloads communicating with the drive at the same time.

The “gold standard” storage benchmarking tool is fio written by Jens Axboe. You can find the full documentation here and the copy of the source code with examples is also available on github. In almost every case this tool needs to be installed first either using your OS package manager, or downloaded and compiled manually. Please notice that the tool is regularly improved and developed, so the version available via package manager might be a little bit older, but unless you are using very old Linux distribution, you don’t have to worry. The new versions focus usually on more advanced use-cases and features, rarely fixing serious bugs in the basic functionality, but just to be sure you can always compare the version available in your OS with the release notes to be sure there are no known issues which can produce incorrect results.

In our example server we have only single NVMe drive (nvme0) with single namespace (nvme0n1), but if you have multiple drives installed, you should not only test the performance of each drive individually, but also all drives at the same time. The reason is that it can help you to uncover whether or not there is some sort of oversubscription or other bottleneck in the system limiting the total aggregated performance of your NVMe drives. The fio tool can easily handle that for you and I will explain how to do that in a moment.

Now, the fio tool needs set of parameters to know what and how should be tested — they can be either provided as a command-line arguments, or in a text configuration file. I prefer the config file option as it is much more readable and easier to use, especially with multiple drives or more complicated scenarios. Below you can find example config for sequential read test for single drive with single namespace (nvme0n1):

$ cat nvme-seq-read.fio[global]
name=nvme-seq-read
time_based
ramp_time=5
runtime=30
readwrite=read
bs=256k
ioengine=libaio
direct=1
numjobs=1
iodepth=32
group_reporting=1
[nvme0]
filename=/dev/nvme0n1

Here you can see we are running sequential read test (readwrite=read) for 30 seconds with 5 seconds ramp time. The I/O block size is 256KB (bs=256k), we are using direct I/O (direct=1), there will be single worker (numjobs=1) but we also use larger IO depth (iodepth=32) so we don’t have to wait for each I/O request to complete before sending next one. Just to remind you — our goal here is to test the maximum bandwidth, so we want to force the drive to send as much data as possible using larger block size and proper iodepth. In case you have more drives in the system just copy-paste the last section starting with [nvme0] and change the name of the section in square brackets [nvme0] to something else (preferably the NVMe device name) and also the filename= parameter pointing to the namespace block device.

To start the test you just need to run fio as a root user (we need full access to the device representing each namespace we want to test) and provide the filename with the config as the only parameter.

$ sudo fio nvme-seq-read.fio
nvme0: (g=0): rw=read, bs=(R) 256KiB-256KiB, (W) 256KiB-256KiB, (T) 256KiB-256KiB, ioengine=libaio, iodepth=32
fio-3.19
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=6622MiB/s][r=26.5k IOPS][eta 00m:00s]
nvme0: (groupid=0, jobs=1): err= 0: pid=1193233: Tue Mar 22 11:47:25 2022
read: IOPS=26.5k, BW=6616MiB/s (6938MB/s)(194GiB/30001msec)
slat (nsec): min=3859, max=90489, avg=5500.47, stdev=1745.66
clat (usec): min=246, max=2580, avg=1203.17, stdev=45.26
lat (usec): min=253, max=2587, avg=1208.77, stdev=45.21
clat percentiles (usec):
| 1.00th=[ 1106], 5.00th=[ 1205], 10.00th=[ 1205], 20.00th=[ 1205],
| 30.00th=[ 1205], 40.00th=[ 1205], 50.00th=[ 1205], 60.00th=[ 1205],
| 70.00th=[ 1205], 80.00th=[ 1205], 90.00th=[ 1205], 95.00th=[ 1205],
| 99.00th=[ 1205], 99.50th=[ 1369], 99.90th=[ 1893], 99.95th=[ 2073],
| 99.99th=[ 2278]
bw ( MiB/s): min= 6620, max= 6632, per=100.00%, avg=6627.51, stdev= 1.76, samples=59
iops : min=26483, max=26530, avg=26509.98, stdev= 7.03, samples=59
lat (usec) : 250=0.01%, 500=0.01%, 750=0.03%, 1000=0.40%
lat (msec) : 2=99.50%, 4=0.07%
cpu : usr=2.45%, sys=14.80%, ctx=789519, majf=0, minf=70
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=793962,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=6616MiB/s (6938MB/s), 6616MiB/s-6616MiB/s (6938MB/s-6938MB/s), io=194GiB (208GB), run=30001-30001msec
Disk stats (read/write):
nvme0n1: ios=1850052/0, merge=0/0, ticks=1359757/0, in_queue=1359757, util=99.88%

There are lots of interesting statistics here, but the one we are looking for is the read bandwidth for each group in MB/s (most vendors including Intel uses megabytes per second in their specs and fio provides results in both MiB/s and MB/s, so pay attention to pick the right number). Here we achieved 6938MB/s which is very close to the 7000MB/s stated in the spec, so we can assume everything is fine.

For comparison, a single 6GB/s SATA drive installed in the same server, connected via the latest generation Dell HBA355i PCIe Gen4 SAS/SATA adapter, can achieve only 550MB/s for the same test. With two drives we get the double aggregated bandwidth, and with eight drives in parallel we can reach about 4400MB/s (8*550MB/s) — still far from almost 7000MB/s for single NVMe drive. But with more drives you won’t get more aggregated bandwidth, in fact there is no increase at all — for 9 drives we still get about 4400MB/s (9*489MB/s) and with 10 drives… not surprisingly we are still at the 4400MB/s aggregated bandwidth (10*440MB/s). Clearly the HBA355i become the limitation. And this is with the latest PCIe Gen4 adapter (16GT/s) using eight PCIe lines (x8). By the way, the maximum theoretical throughput of PCIe Gen4 x8 bus is 15.754GB/s and the ASIC used (Broadcom LSI SAS3816) is rated at 13700MB/s according to the spec, so the bottleneck is probably some SAS/SATA port expander chip used to support more drives. If you are unlucky, and are using older server or Gen3 adapter, your aggregated performance might be additionally limited. Hope you can see now why I believe SAS/SATA drives should be considered legacy and should not be used anymore… But let’s get back to the main topic.

The bandwidth test is usually enough for quick system evaluation, but you might be also interested in IOPS and latency test which can help uncover some more obscure issues. We can use slightly modified config file with the readwrite mode set to randwrite, block size set to 4KB (bs=4k), sixteen parallel workers per drive (numjobs=16) and iodepth set to 16 (iodepth=16). Different vendors can use slightly different parameters (iodepth or number of workers) so you might need to adjust config if necessary. We can also specify lsfr random generator which do not require storing map in memory (saves lots of memory and processing power). Also, each drive is a separate group now (the new_group statement disk specific section). The full config should look like this:

$ cat nvme-rand-read.fio[global]
name=nvme-rand-read
time_based
ramp_time=5
runtime=30
readwrite=randread
random_generator=lfsr
bs=4k
ioengine=libaio
direct=1
numjobs=16
iodepth=16
group_reporting=1
[nvme0]
new_group
filename=/dev/nvme0n1

We are starting the test in the same way just specifying new config file as an argument to fio:

$ sudo fio nvme-rand-read.fio
nvme0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.19
Starting 16 processes
Jobs: 16 (f=16): [r(16)][100.0%][r=3911MiB/s][r=1001k IOPS][eta 00m:00s]
nvme0: (groupid=0, jobs=16): err= 0: pid=49038: Tue Mar 22 15:21:44 2022
read: IOPS=1000k, BW=3907MiB/s (4096MB/s)(114GiB/30001msec)
slat (nsec): min=887, max=274397, avg=2170.83, stdev=1302.31
clat (usec): min=11, max=2586, avg=253.29, stdev=100.50
lat (usec): min=12, max=2588, avg=255.58, stdev=100.58
clat percentiles (usec):
| 1.00th=[ 104], 5.00th=[ 131], 10.00th=[ 149], 20.00th=[ 174],
| 30.00th=[ 196], 40.00th=[ 217], 50.00th=[ 237], 60.00th=[ 258],
| 70.00th=[ 281], 80.00th=[ 314], 90.00th=[ 379], 95.00th=[ 457],
| 99.00th=[ 578], 99.50th=[ 644], 99.90th=[ 807], 99.95th=[ 873],
| 99.99th=[ 1074]
bw ( MiB/s): min= 3475, max= 4301, per=100.00%, avg=3913.52, stdev=10.76, samples=944
iops : min=889676, max=1101097, avg=1001861.85, stdev=2755.37, samples=944
lat (usec) : 20=0.01%, 50=0.01%, 100=0.74%, 250=55.97%, 500=40.26%
lat (usec) : 750=2.85%, 1000=0.16%
lat (msec) : 2=0.02%, 4=0.01%
cpu : usr=5.76%, sys=15.21%, ctx=15938319, majf=0, minf=1102
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=30002688,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
READ: bw=3907MiB/s (4096MB/s), 3907MiB/s-3907MiB/s (4096MB/s-4096MB/s), io=114GiB (123GB), run=30001-30001msec
Disk stats (read/write):
nvme0n1: ios=34875030/0, merge=0/0, ticks=8598952/0, in_queue=8598952, util=99.78%

This time we are looking for the aggregated IOPS which in our case is roughly 1000000 (r=1001k IOPS), which is surprisingly higher than 780000 specified by Dell and Intel, even though the test was run on fully written (preconditioned) drive. Some of the reasons might be improved firmware or new revision of the drive which performs better than initially released and tested drive. Anyway, this shouldn’t bother us as anything close to the spec or better is a good news. You should also take a look at the latency percentiles to see if there are no unusual latency spikes. We tried to generate as much I/O operations as possible so the latency is slightly elevated for some small percentage of reads, but 99% operations finished in 578 microseconds or faster, which is reasonable.

Interestingly, on a new (empty) drive without proper pre-conditioning we can observe around 1654k IOPS, which is much better than what the spec says, but as mentioned earlier — this is just cheating since the drive controller don’t have to actually read the data from the flash modules. Also, for the same reason the latency is much lower with 99% of reads finishing below 100 microseconds and 99.99% below 235 microseconds.

Drive pre-conditioning

Warning! By following this guide you fully understand that all existing data on your drives specified in the fio job config file will be COMPLETELY and IRREVERSIBLY ERASED, including any partitions and filesystems. If you have any data you want to keep please DO NOT run the following commands.

If you want to pre-condition your drives easily, you can use fio with the following job configuration file. Specify each drive in separate section, so all drives will be written in parallel.

$ cat nvme-precond-all.fio
[global]
name=nvme-precond-all
readwrite=write
bs=1M
ioengine=libaio
direct=1
numjobs=1
iodepth=32
group_reporting=0
[nvme0]
filename=/dev/nvme0n1
[nvme1]
filename=/dev/nvme1n1
[nvme2]
filename=/dev/nvme2n1
[nvme3]
filename=/dev/nvme3n1

Having the job config ready you can use it in familiar way. Once again — make sure you specified the right drives and you want to erase them completely.

$ sudo fio nvme-precond-all.fio
[...]

Please wait until all jobs are finished to make sure that all drives were fully written. After the drives are properly pre-conditioned, you can run the actual performance tests and be sure this time you are getting the real performance results.

Final words

We just barely touched the art of benchmarking and by no means this article should be used as a guidance in that area, but I hope that after reading you will get some knowledge about how to identify most common issues with NVMe drives and run some basic tests to confirm there is no performance degradation in the system.

Please let me know if you find it useful or would like me to focus more on some specific aspects and if you like my articles, please consider following my account to get notified about new content. I don’t have specific schedule and like to write about the problems (and solutions) I encounter personally, so please don’t expect frequent updates.

--

--

Krzysztof Ciepłucha

Cloud Solutions Architect working with various container orchestration and virtualization systems and data center hardware, network and storage.