Wandering Thoughts archives

2016-06-19

Why ZFS can't really allow you to add disks to raidz vdevs

Today, the only change ZFS lets you make to a raidz vdev once you've created it is to replace a disk with another one. You can't do things like, oh, adding another disk to expand the vdev, which people wish for every so often. On the surface, this is an artificial limitation that could be bypassed if ZFS wanted to, although it wouldn't really do what you want. Underneath the surface, there is an important ZFS invariant that makes it impossible.

What makes this nominally easy in theory is that ZFS raidz vdevs already use variable width stripes. A conventional RAID system uses full width stripes, where all stripes span all disks. When you add another disk, the RAID system has to change how all of the existing data is laid out to preserve this full width; you goes from having the data and parity striped across N disks to having it striped across N+1 disks. But with variable width stripes, ZFS doesn't have this problem; adding an existing disk doesn't require touching any of the existing stripes, even what were full width stripes. All that happens is they go from being full width stripes to being partial width stripes.

However, this is probably not really what you wanted because it doesn't get you as much new space as adding a disk does in a conventional RAID system. In a conventional RAID system, the reshaping involved both minimizes the RAID overhead and gives you a large contiguous chunk of free space at the end of the RAID array. In ZFS, simply adding a disk this way would obviously not do that; all of your old 'full width' stripes are now somewhat inefficient partial width stripes, and much of the free space is going to be scattered about in little bits at the end of those partial width stripes.

In fact, the free space issue is the fatal flaw here. ZFS raidz imposes a minimum size on chunks of free space; they must be large enough that it can write one data block plus its parity blocks (ie N+1, where N is the raidz level). Were we to just add another disk along side existing disks, much of the free space on it could in fact violate this invariant. For example, if the vdev previously had two consecutive full width stripes next to each other, adding a new disk will create a single-block chunk of free space in between them.

You might be able to get around this by immediately marking such space on the new disk as allocated instead of free, but if so you could find that you got almost no extra space from adding the disk. This is probably especially likely on a relatively full pool, which is exactly the situation where you'd like to get space quickly by adding another disk to your existing raidz vdev.

Realistically, adding a disk to a ZFS raidz vdev requires the same sort of reshaping as adding a disk to a normal RAID-5+ system; you really want to rewrite stripes so that they span across all disks as much as possible. As a result, I think we're unlikely to ever see it in ZFS.

ZFSRaidzDiskAddition written at 02:03:01; Add Comment

2016-06-17

Why you can't remove a device from a ZFS pool to shrink it

One of the things about ZFS that bites people every so often is that you can't remove devices from ZFS pools. If you do 'zpool add POOL DEV', congratulations, that device or an equivalent replacement is there forever. More technically, you cannot remove vdevs once they're added, although you can add and remove mirrors from a mirrored vdev. Since people do make mistakes with 'zpool add', this is periodically a painful limitation. At this point you might well ask why ZFS can't do this, especially since many other volume managers do support various forms of shrinking.

The simple version of why not is ZFS's strong focus on 'write once' immutability and being a copy on write filesystem. Once it writes filesystem information to disk, ZFS never changes it; if you change data at the user level (by rewriting a file or deleting it or updating a database or whatever), ZFS writes a new copy of the data to a different place on disk and updates everything that needs to point to it. That disk blocks are not modified once written creates a whole lot of safety in ZFS and is a core invariant in the whole system.

Removing a vdev obviously requires breaking this invariant, because as part of removing vdev A you must move all of the currently in use blocks on A over to some other vdev and then change everything that points to those blocks to use the new locations. You need to do this not just for ordinary filesystem data (which can change anyways) but also for things like snapshots that ZFS normally never modifies once created. This is a lot of work (and code) that breaks a bunch of core ZFS invariants. As a result, ZFS was initially designed without the ability to do this and no one has added it since.

(This is/was known as 'block pointer rewrite' in the ZFS community. ZFS block pointers tell ZFS where to find things on disk (well, on vdevs), so you need to rewrite them if you move those things from one disk to another.)

About a year and a half ago, I wrote an entry about how ZFS pool shrinking might be coming. Given what I've written here, you might wonder how it works. The answer is that it cheats. Rather than touch the ZFS block pointers, it adds an extra layer underneath them that maps IO from one vdev to another. I'm sure this works, but it also implies that removing a vdev adds a more or less permanent extra level of indirection for access to all blocks that used to be on the vdev. In effect the removed vdev lingers on as a ghost instead of being genuinely gone.

(This obviously has an effect on, for example, ZFS RAM usage. That mapping data has to live somewhere, and may have to be fetched off disk, and we've seen this show before.)

Having the ability to remove an accidentally added vdev is a good thing, but the more I look at the original Delphix blog entry, the more dubious I am about ever using it for anything big. A quick removal of an accidentally added vdev has the advantage that almost nothing should be on the new vdev, and normal churn might well get rid of the few bits that wound up on it (and so allow the extra indirection to go away). Shrinking an old, well used pool by a vdev or two is not going to be like that, especially if you have things like old snapshots.

ZFSWhyNoVdevRemoval written at 02:10:33; Add Comment

By day for June 2016: 17 19; before June; after June.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.