Fast Resilvering

Pawan Prakash Sharma
CloudByte

--

Resilvering :-

When a disk goes bad, we replace it with the good device; a resilvering operation is initiated to calculate the data from the redundant device(raidz or mirror) and to write the data to the new device. This action is a form of disk scrubbing.

Old logic :-

The old algorithm was going through all of the blocks in a pool file by file, snapshot by snapshot. When it found a block which needed to be resilvered, it would issue the resilvering IO to build the missing data on the new disk. The algorithm is performed in object block order as below :-

  • Grab next object
  • Read all its blocks in logical object sequence
  • Repeat
ZFS Pool Structure

(Typically we have a couple of top level vdevs in a pool and we replace one physical disk under that vdev which triggers the resilvering.)

The Problem :-

The old resilvering involves walking the entire B-Tree of the pool, and the re-silvering blocks that would have been on the missing disk. Without walking through every single txg in the tree, it cannot know which blocks would have been on the missing disk, hence it scans the entire metadata universe for the pool.

It doesn’t necessarily read all the data, only sufficient metadata to determine whether it actually needs to read the corresponding data or not.

And to replace a physical disk, we traverse the whole pool, looking for the blocks belonging to this vdev. Also the old resilvering algorithm repairs blocks by traversing the block-tree, which can degrade into lots of small random I/O which is not very performing. The writes are impacted due to this algorithm as we are doing the resilvering in the spa sync only.

VRT Resilvering :-

The new resilvering works by traversing all the blocks for only that top vdev under which replacement is going on, which avoids unnecessary scan of the whole pool. Also, it issues IOs smartly in a sequential manner which saves the disk seek time to do the IO.

The way it works is, it creates a VRT map of the blocks for each top level vdev. It has all the necessary information needed to issue the resilvering IO. A typical block entry in the VRT map will have information about all the copy of data(dva), block checksum and transaction number in which this block was written and other information needed for resilvering. VRT map maintains AVL tree of all the entries for each top level vdev. It uses this VRT map to do the resilvering in a sequential manner with a little overhead of maintaining VRT map . The advantages of this are listed below :-

  1. Faster scanning :- We are now scanning the blocks of only that vdev under which the replacement is going on, as opposed to the old logic which scans the entire pool for the blocks for this vdev.
  2. Sequential IOs :- The new resilvering algorithm resilvers blocks in LBA order. This saves the disk seek time to do the IOs. If we have lots of random IOs, we will spend most of the time seeking to appropriate location rather than doing the actual IO. While doing the VRT resilvering, it issues the IO in a sequential manner so that it can save the disk seek time which results in fast resilvering.
  3. Better IO aggregation :- The new resilvering algorithm traverses the VRT map by the lowest disk offset to the Highest. Depending on how the data is laid out, zfs and underlying layers aggregate the IOs and issue them in bulk to get the performance.
  4. No impact on Existing IOs :- VRT resilvering works out of spa sync, so there is no impact on ongoing IOs. This resilvering logic is not going to take any spa sync time and we will get better write throughput when resilvering is going on.
  5. Perform better with more number of vdevs :- As it scans metadata of the only vdev which is undergoing the replace process, in comparison to the old logic which will read all the metadata of the pool, VRT resilvering will perform much better.
  6. Scalable with the Petabytes of pool with thousands of vdevs :- As we are maintaining per top level VRT map and scanning only one vdev’s metadata, it can scale to Petabytes of pools and thousands of vdevs.
  7. Resumable :- In VRT resilvering, we store minor information on the disk so that we can resume it after reboot/export. The information is very minimal, we store required vdev, in use VRT map, transaction numbers to the disk and when we come back after reboot, we continue from where we had left off.
  8. Multiple resilvering :- In the old logic, only one resilvering can be active at a time, that means if resilvering is going on for a particular vdev and you want to replace a disk from some other vdev group, it will cancel the resilvering and start it from the beginning. In case of VRT resilvering, we don’t need to cancel it as it is per vdev level, so you can run resilvering parallel on some other vdev, without having to interfere in the old one.

Conclusion :- We compared VRT resilvering with old logic and VRT resilvering performed very well compared to the Old logic. We are getting very good performance improvement using this new approach. Also, we have provided a way to upgrade old pools to have the VRT map. Now old pools can also use fast resilvering for the disk replacement activity.

--

--