Stupid ZFS tricks — expanding zraid

How do I expand a ZFS zraid?

Right now you cannot add disks to zraid arrays to expand their volume. In md raid and similar systems you can -say- have 5 x 1 TB disks in a raid6 for a total capacity of 3 x 1 TB, then add one disks and 4 x 1 TB capacity. That capability is currently being developed in ZFS. So you have to make an entire new pool for this. (as a side note, this is not well documented. People who go into the documentation assuming that this is possible might end up expanding by adding a new device in a linear manner, which destroys the array since it is now 2x the capacity with half of it at redundancy 0)

There is a trick that allows you to do that without having an unlimited number of disks. You can create a degraded array, that means the new array doesn’t have all the disks it will eventually have. In md raid you can directly do that, you have mdadm commands that allow you to create degraded raid arrays. In zpool you do not have direct support for that, however you can hack it up by faking zpool members with sparse files.

Before we go to the commands let me explain how the trick is working:

  • you have 5 x 1 TB disks in a zraid3, for a capacity of 2 x 1 TB
  • you want to add 3 disks
  • which would give you 8 x 1 TB disks for a capacity of 5 x 1 TB
  • since you cannot expand the old zraid you would ordinarily need 13 disks during the transfer

However, you have 3 redundancy disks in the old array and 3 redundancy disks in the new array. So if you were to be willing to do this with no redundancy at all you could do it with 7 disks total, 2 in the old array, 5 in the new array. Obviously you have 8 at least, let’s say you have 9 (bought one extra as a shelf spare). So:

  • you have 5 disks in the existing array
  • you have 4 new, empty disks (you bought 3 for capacity and 1 as a spare)
  • you reduce redundancy of the existing array by one, now you have 4 disks in the old array (still 2x redundant) and 5 disks for the new array. You can create the new array in a degraded state with no redundancy.
  • that allows you to copy over all the data. It’ll be slow, but it will work.

Let’s look at error scenarios during transfer with reduced redundancy:

  • if a disk in the new array fails you have to start over after buying another disks.
  • if a disk in the old array fails you are reduced to redundancy 1x. You probably want to abort the copy, take down the new array and move 1 or 2 disks back to the original array. You will have to resilver with 1 redundant disks. That is the critical risk here, as disks are more likely to fail during resilver. That is why you might want to buy more disks.
  • the risk of one failure in the old array also depends on how far the copy is. If it is almost finished you might want to plow through.

Finishing after successful copy:

  • after the copy is finished you move over redundant disks from the old array to the new one.
  • so in the minimal version the copy happens with 4x disks in old array (2 redundant) and 5 disks in the new array (0 redundant).
  • if the copy worked you move over another disks from the old array, which is now at 3 disks (1 redundant), whereas the new array arrives at 6 disks (1 redundant). The risk here is a bit less than the copy, since the old array’s disks are not loaded. All activity is on the new array.
  • move another disks, now the old array is at 0 redundant, and the new one moves from 1 to 2 redundant.
  • destroy the old array to move another disks, now the new array is at full redundancy. 8 disks, 5 for capacity, 3 for redundancy.
  • you have one spare disk.

Preparations. Now, there are a couple of things to reduce the risk:

  • you absolutely need to scrub the old array before doing any of this. A hidden read error would instantly kick out a disk. You don’t want that during this procedure.
  • if you can, set the old array to readonly during this entire procedure. If you set it to readonly you will not end up with a dirty array when disks drop. If you have read errors that degrade the disks below redundancy you can still use hex editing and zdb to put the failed disk back . That is usually possible when no writes were involved.

So the command sequence:

# scrub first to make sure you have no looming read errors
zpool scrub oldarray
# set old array to readonly if you can
...
# take one disk out of the existing array
...
# create a couple of sparse files that will be fake members
# of the new array:
truncate -s 8001563222016 /tmp/FD1.img
truncate -s 8001563222016 /tmp/FD2.img
truncate -s 8001563222016 /tmp/FD3.img
# create the new array
zpool create -f newarray \
-o feature@userobj_accounting=disabled \
-o feature@edonr=disabled \
-o feature@project_quota=disabled \
-o feature@allocation_classes=disabled \
-o feature@resilver_defer=disabled \
raidz3 \
ata-TOSHIBA_MD04ACA500_Y5R7K9CXFS9A-part4 \
ata-TOSHIBA_MD04ACA500_Y5N8K655FS9A-part4 \
ata-TOSHIBA_HDWE150_19MRK03SFB9G-part4 \
ata-TOSHIBA_HDWE150_19HRK02NFB9G-part4 \
ata-TOSHIBA_HDWE150_98I4K05WFB9G-part4 \
/tmp/FD[123].img
# that command also turns off features that will make a Linux
# zpool unreadable under FreeBSD and Solaris.
# Now, as fast as possible take the fake devices out of the array:
zpool offline newarray /tmp/FD1.img
zpool offline newarray /tmp/FD2.img
zpool offline newarray /tmp/FD3.img
# you do that as fast as possible since the resilver will fill the
# sparse files with actual data.
# now you can copy the data:
zfs snapshot -r oldarray@tonewarray
zfs send -R oldarray@tonewarray | zfs recv -Fdu newarray
# if the copy succeeds move on to remove disks from oldarray
# and add them to newarray