Wandering Thoughts archives

2016-06-17

Why you can't remove a device from a ZFS pool to shrink it

One of the things about ZFS that bites people every so often is that you can't remove devices from ZFS pools. If you do 'zpool add POOL DEV', congratulations, that device or an equivalent replacement is there forever. More technically, you cannot remove vdevs once they're added, although you can add and remove mirrors from a mirrored vdev. Since people do make mistakes with 'zpool add', this is periodically a painful limitation. At this point you might well ask why ZFS can't do this, especially since many other volume managers do support various forms of shrinking.

The simple version of why not is ZFS's strong focus on 'write once' immutability and being a copy on write filesystem. Once it writes filesystem information to disk, ZFS never changes it; if you change data at the user level (by rewriting a file or deleting it or updating a database or whatever), ZFS writes a new copy of the data to a different place on disk and updates everything that needs to point to it. That disk blocks are not modified once written creates a whole lot of safety in ZFS and is a core invariant in the whole system.

Removing a vdev obviously requires breaking this invariant, because as part of removing vdev A you must move all of the currently in use blocks on A over to some other vdev and then change everything that points to those blocks to use the new locations. You need to do this not just for ordinary filesystem data (which can change anyways) but also for things like snapshots that ZFS normally never modifies once created. This is a lot of work (and code) that breaks a bunch of core ZFS invariants. As a result, ZFS was initially designed without the ability to do this and no one has added it since.

(This is/was known as 'block pointer rewrite' in the ZFS community. ZFS block pointers tell ZFS where to find things on disk (well, on vdevs), so you need to rewrite them if you move those things from one disk to another.)

About a year and a half ago, I wrote an entry about how ZFS pool shrinking might be coming. Given what I've written here, you might wonder how it works. The answer is that it cheats. Rather than touch the ZFS block pointers, it adds an extra layer underneath them that maps IO from one vdev to another. I'm sure this works, but it also implies that removing a vdev adds a more or less permanent extra level of indirection for access to all blocks that used to be on the vdev. In effect the removed vdev lingers on as a ghost instead of being genuinely gone.

(This obviously has an effect on, for example, ZFS RAM usage. That mapping data has to live somewhere, and may have to be fetched off disk, and we've seen this show before.)

Having the ability to remove an accidentally added vdev is a good thing, but the more I look at the original Delphix blog entry, the more dubious I am about ever using it for anything big. A quick removal of an accidentally added vdev has the advantage that almost nothing should be on the new vdev, and normal churn might well get rid of the few bits that wound up on it (and so allow the extra indirection to go away). Shrinking an old, well used pool by a vdev or two is not going to be like that, especially if you have things like old snapshots.

solaris/ZFSWhyNoVdevRemoval written at 02:10:33; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.