Over-Provisioning SSD for Increased Performance and Write Endurance
Introduction
This is not a deep dive into SSD technology, however many of the high-level concepts related to performance and write endurance are summarized. Many aspects of SSD technology have been left out for brevity.
The goal of this article is to help you understand why SSD fail under heavy write workloads. And how I improved SSD write endurance of my on-premise Kubernetes cluster using etcd
on SSD devices. At a high level etcd
is the key value database used to store the configuration of the Kubernetes cluster, the actual state of the system and the desired state of the system. The etcd
produces a steady steam 24x7x365 of disk writes to storage. This article can apply to any application dealing with heavy write workloads on SSD storage.
What Got me Here?
The abbreviated story is that I used a popular name-brand consumer grade SSD in a small on-premise Kubernetes cluster being used as a proof of concept (not enterprise grade servers) and experienced 100% failure rate of the six SSDs within 1 week of use. All of them had to be replaced.
This lead me look more carefully at the SSD brand and research how to improve write endurance. I won’t say which brand failed so quickly as my sample size was not large enough to bad-mouth the vendor, and they refunded me without issue. I will state that I switched to Samsung brand SSDs which I manually over-provisioned and these have been running great for several months.
I didn’t have access to devices well known for endurance like the Intel Optane or Kioxia SSDs. This experiment is still in progress and being monitored.
What is Over-Provisioning?
Over-Provisioning an SSD is just an inclusion of extra storage capacity used to increase the endurance of the SSD by distributing the number of writes and erases across a larger population of NAND flash memory blocks and pages over time. In addition it improves write performance by increasing the probability that a write operation will have immediate access to a pre-erased block.
All SSDs reserve some amount of space to support write operations (see How SSD are Written below). The reserved space can be used for the controller firmware, reserved space for failed block replacements, etc. The factory determines the minimum amount of space the SSD will have for over-provisioning. However, you can allocate more space for better performance and improved write endurance.
How SSD are Written
The smallest addressable unit of NAND flash memory is typically a 4KiB page. Then typically 64 of these pages are grouped into a block of 256KiB. While the SSD is able to write one 4KiB page at a time, it can only write to a empty (blank) page.
It is impossible with NAND flash to directly update an existing page which has data. To update an existing page it has to be erased. However, individual pages cannot be erased. Only an entire block of 64 4KiB pages can be erased at one time. Instead the SSD will write to a different blank page and then update the Logical Block Address (LBA) table. Inside the LBA table, the original page is marked as invalid
and the new page is marked as the current location for the new data.
Garbage Collecting
While the old page is marked as invalid
that space is not yet available for reuse. Based on some schedule the SSD will erase all pages marked as invalid
. This scheduled process is called Garbage Collecting:
- STEP1 — The SSD controller which manages the NAND flash memory will locate block(s) which have page(s) marked as
invalid
. - The controller will read all the
valid
pages of the block(s) (skipping theinvalid
pages) and re-write thevalid
pages to a new block. - STEP2 —The original block of 64 pages is then erased allowing it to be used for new data.
Garbage Collecting does not improve performance within a single pass. It will take a number of writing data and garbage collecting cycles to consolidate the freed up spaces to improve performance.
Factory Over-Provisioning Basics
Most technical people are aware that one gigabyte of storage is not 1,000,000,000 bytes. They know it is some multiplication of 1024 bytes (which would be 1,048,576 * 1024 or more precisely 2³⁰ bytes which is 1,073,741,824 bytes). This is the difference in marketing terms between one gigabyte (GB) and one gibibyte (Gib). There is roughly about 7.37% difference between the two terms. That difference is typically the minimum reserved space for over-provisioning, sometimes called the “built-in” amount, but the actual amount could be higher especially with higher quality SSDs.
Typically the minimum reserved space for Over-Provisioning is just the difference between binary and decimal naming conventions used to describe how large an SSD is.
For example, a 128 GB SSD would have a minimum reserved space of 128 * 73,741,824 = 9,438,953,472
bytes of over-provisioning.
So in theory if you completely filled up the SSD, it will still have that 7.37% of disk space to support disk writes (disk write performance will suffer with such little space to work with).
To improve the write performance issue, some manufacture will simply reduce the capacity of the SSD. If you see a 100GB SSD, it is probably a native 128GB device with 28GB used for over-provisioning. That would be 28% over-provisioning in addition to the 7.37% minimum built-in.
Write Amplifications
The SSD writes
often require writing data more than once. You have the initial write to save the data the first time and later when moving data during multiple garbage collection cycles. This results in more data being written to the SSD flash memory than was originally issued by the host system. This is known as a write amplification which is not desirable as it increases wear on the flash and reduces available bandwidth to flash memory.
Many factors contribute to write amplification such as if the data is written randomly vs. sequentially. As less space is available, and less contiguous blocks are available data becomes written to more randomly. Writing to non-sequential LBAs have the largest impact on write amplifications.
Data compression reduces the number of pages and blocks written and helps reduce write amplifications. Consider using a filesystem that supports compression.
TRIM
The SSD usually is not aware of which blocks of data are invalid and available for reuse with new data. It is not until the Operating System tries to store new data in a previously used location that the SSD knows that a particular location contains invalid data. Otherwise the SSD can only track free space not consumed by user data, everything else the SSD considers “valid data”.
The TRIM command enables the Operating System to alert the SSD on file deletions about pages which contain unneeded data to be tagged as invalid
. These invalid pages are not copied during the next garbage collection and wear leveling. This all helps to reduce write amplification and improve performance.
The larger the amount of over-provisioning, the better the SSD will perform over longer periods of time. Especially for workloads with lots of random writes.
NAND Cell Write Limits
NAND flash memory can only be written to and erased a limited number of times. You would expect this limit to get better with each generation of storage devices, however apparently the opposite has happened as manufactures try to make SSD storage cheaper:
- SLC (Single Level Cell) flash holds 1 bit of data per cell. Depending on brand had a lifetime of about 50,000 to 100,000 write/erase cycles.
- MLC (Multi Level Cell) flash holds 2 bits of data per cell. Lower cost, lower speed, lower lifetime of about 1,000 to 10,000 write/erase cycles.
- TLC (Triple Level Cell) flash holds 3 bits of data per cell with lifetimes between 3,000 and 5,000 write/erase cycles. Other variants of TLC sometimes called 3D NAND or V-NAND (Vertical NAND) the lifetimes dropped as low as 1,000 cycles.
- QLC (Quad Level Cell) flash holds 4 bits of data per cell….
- PLC (Penta Level Cell) flash holds 5 bits….
Wear-Leveling
This wearing out of cells is responsible for the physical limits on the lifetime of NAND flash memory. When data is repeatedly written in areas of the NAND the respective cells wear out quickly. Wear-leveling is a function of the SSD that helps prevent repeated writes to the same cells. It enables cells to be utilized more evenly by swapping the blocks exposed to high write cycles with free blocks.
How Much to Allocate to Over-Provisioning
While increasing the space allocated to Over-Provisioning has the advantage of improving the performance and lifetime of the SSD (write durability) it also decreases the available space for the host to use.
How much to allocate will depend on your storage requirements and types of applications used. I personally do not bother to adjust the over-provisioning area on my laptops or desktops. Nor on my servers where SSDs used for generic file storage which are part of larger storage pools. Once the SSDs are going to host database type applications, high-write transaction logs, then I will adjust the over-provisioned storage area.
In my case it was planning for the Kubernetes etcd
key value database store. The etcd
storage is not large. By default the storage limit size is 2GB and something significantly smaller than that is expected. Somewhere around 50MB to 500MB range is typical for small to medium clusters.
The etcd
RAFT consensus protocol depends on persistently storing metadata to a log — the majority of etcd
cluster members must write EVERY request to disk. The etcd
stores multiple versions of keys, takes snapshots and merges them back with previous on-disk snapshots. It also performs inflight compaction to remove prior key revisions. This constant write, rewrite, and data shuffling 24x7x365 can wear out an SSD quicker than you would expect.
On my small three etcd
master nodes proof of concept cluster at pretty much idle load, Prometheus shows on average 15kB/s sustained writes. That is about 54MB/hour which we can round to 40GB/month being written to a very small amount of SSD space. Excessive wear leveling and failed NAND cells is a valid concern.
I decided that 240GB of 1TB SSD was enough local storage for my needs. (I use 2 SSDs per node, over-provisioned identically configured as a mirror). That is roughly 760GB (75% of storage) being allocated to over-provisioning (per device). I expect this to provide a few years of service.
Determine Current Over-Provisioning
The smartctl
utility is the Linux version of SmartMonTools used to interact with the Self-Monitoring, Analysis and Reporting Technology (SMART) system built into many disk devices. Before over-provisioning, I used smartctl
to show the device information:
$ sudo smartctl -i /dev/sda=== START OF INFORMATION SECTION ===
Model Family: Samsung based SSDs
Device Model: Samsung SSD 860 EVO M.2 1TB
Serial Number: [redacted]
LU WWN Device Id: [redacted]
Firmware Version: RVT24B6Q
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: M.2
TRIM Command: Available, deterministic, zeroed
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
- The starting user capacity is
1,000,204,886,016
bytes [1.00 TB] - Assuming actual capacity is
1,099,511,627,776
bytes [1 TiB] - Built-in Over-Provisioning is
99,306,761,760
bytes [99.3 GB] (about 10% OP)
Calculate Sectors Needed
We need to convert the desired usable space into the number of sectors required to hold that space. As stated above, my requirement is about 240GB of disk space.
240 Gib = 257,698,037,760
bytes. As smartctl
output above shows, the sector size of this device is 512 bytes. We then divide the desired byte count by 512 to get 503,316,480
sectors (if your device reports 4096 byte sectors then divide by that instead).
I now have my magic number of 503316480
sectors. Next we want to instruct the SSD device that this is now the total number of available sectors which can be used for User Storage. All remaining sectors are reserved in the Host Protected Area (HPA) to be used for Over-Provisioning.
How to Adjust Over-Provisioning?
Most SSD manufactures provide custom tool for this. However, the standard Linux hdparm
disk management tool can be used with SATA based SSDs, and I will demonstrate it below.
I have seen conflicting information if this can be adjusted on an active working SSD. I personally wouldn’t do that. I suggest adjusting the Over-Provisioning area before creating partitions and formatting the device for use.
The hdparm
utility might not be installed by default. Don’t be surprised if you need to install that manually from your package repository.
$ sudo hdparm -Np503316480 --yes-i-know-what-i-am-doing /dev/sda/dev/sda:
setting max visible sectors to 503316480 (permanent)
max sectors = 503316480/1953525168, HPA is enabled
- The
N
parameter is used to set the maximum visible number of sectors. - The
p
makes the change permanent (you can only issue one permanent change per session, if you need to change this, power off the computer first before attempting again) - The
— yes-i-know-what-i-am-doing
is a required flag since this is a destructive operation. If you do not supply this flag you will get an error message about it. - The
/dev/sda
is the device name to be updated. Your device name may be different.
Over-Provision Verification
At this point it is HIGHLY suggest you reboot the system and test the hdparm
command again to make sure it is still reported correctly. It’s possible some Linux distributions have HPA disabled by default which will ignore the changed settings.
You should be able to use hdparm
with -N
to confirm the HPA has been unlocked:
$ sudo hdparm -N /dev/sda/dev/sda:
max sectors = 503316480/1953525168, HPA is enabled
Lastly you can confirm with smartctl
that the new User Capacity is reported correctly:
$ sudo smartctl -i /dev/sda=== START OF INFORMATION SECTION ===
Model Family: Samsung based SSDs
Device Model: Samsung SSD 860 EVO M.2 1TB
Serial Number: [redacted]
LU WWN Device Id: [redacted]
Firmware Version: RVT24B6Q
User Capacity: 257,698,037,760 bytes [257 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: M.2
TRIM Command: Available, deterministic, zeroed
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
The User Capacity is now reported as 257,698,037,760
bytes which is 240 Gib. Now the SSD device is ready to be partitioned and formatted to suit your needs.
Conclusion
Over-Provisioning is a proven technique that has been used for many years. I’m expecting this to last quiet a while.
Notes on NVMe Over-Provisioning
In my opinion, the need to over-provision is from the use of NAND flash. Changing the disk interface from SATA to NVMe doesn’t really change the need to over-provision if you expect to have a write heavy application.
The hdparm
utility does not support NVMe devices and the instructions above will not work for NVMe storage devices. However, the utilities provides by the manufacture such as Samsung Magician does allow you to change the over-provisioning on their NVMe drives:
Resources:
- https://www.atpinc.com/blog/over-provisioning-ssd-benefits-endurance-and-performance
- https://www.seagate.com/tech-insights/ssd-over-provisioning-benefits-master-ti/
- https://www.techtarget.com/searchstorage/definition/overprovisioning-SSD-overprovisioning
- https://semiconductor.samsung.com/resources/white-paper/S190311-SAMSUNG-Memory-Over-Provisioning-White-paper.pdf
- https://www.thomas-krenn.com/en/wiki/SSD_Over-provisioning_using_hdparm
- https://en.wikipedia.org/wiki/Multi-level_cell