You can't delegate a ZFS administration permission to delete only snapshots
ZFS has a system that lets you selectively delegate administration
permissions from root to other users (exposed through '
on a per filesystem tree basis. This led to the following interesting
question (and answer) over on the fediverse:
@wxcafe: hey can anyone here confirm that there's no zfs permission for destroying only snapshots?
@cks: I can confirm this based on the ZFS on Linux code. The 'can you destroy a snapshot' code delegates to a general 'can you destroy things' permission check that uses the overall 'destroy' permission.
(It also requires mount permissions, presumably because you have to be able to unmount something that you're about to destroy.)
The requirement for unmount means that delegating 'destroy' permissions may not work on Linux (or may not always work), because only root can unmount things on Linux. I haven't tested to see whether ZFS will let you delegate unmount permission (and thereby pass its internal checks) but then later the unmount operation will fail, or whether the permission cannot be delegated on Linux (which would mean that you can't delegate 'destroy' either).
The inability to allow people to only delete snapshots is a bit unfortunately, because you can delegate the ability to create them (as the 'snapshot' permission). It would be nice to be able to delegate snapshot management entirely to people (or to an unprivileged account used for automated snapshot management) but not let them destroy the filesystem itself.
This situation is the outcome of two separate and individually
sensible design decisions, which combine together here in a not
great way. First, ZFS decided that creating snapshots would be a
zfs' command but destroying them would be part of '
destroy' (a decision that I personally dislike because of how it
puts you that much closer to an irreversible error). Then when it
added delegated permissions, ZFS chose to delegate pretty much by
zfs' commands, although it could have chosen a different split.
Since destroying snapshots is part of '
zfs destroy', it is all
covered under one 'destroy' permission.
(The code in the ZFS kernel module does not require this; it has a separate permission check function for each sort of thing being destroyed. They all just call a common permission check function.)
The good news is that while writing this entry and reading the
zfs allow' manpage, I realized that there may sort of be a
workaround under specific situations. I'll just quote myself
Actually I think it may be possible to do this in practice under selective circumstances. You can delegate a permission only for descendants of a filesystem, not for the filesystem itself, so if a filesystem will only ever have snapshots underneath it, I think that a 'descendants only' destroy delegation will in practice only let people destroy snapshots, because that's all that exists.
Disclaimer: this is untested.
On our fileservers, we don't have nested filesystems (or at least not any that contain data), so we could do this; anything that we'll snapshot has no further real filesystems as children. However in other setups you would have a mixture of real filesystems and snapshots under a top level filesystem, and delegating 'destroy' permission would allow people to destroy both.
(This assumes that you can delegate 'unmount' permission so that the ZFS code will allow you to do destroys in the first place. The relevant ZFS code checks for unmount permission before it checks for destroy permission.)
Doing frequent ZFS scrubs lets you discover problems close to when they happened
Somewhat recently, the ZFS on Linux mailing list had a discussion of how frequently you should do ZFS scrubs, with a number of people suggesting that modern drives only really need relatively infrequent scrubs. As I was reading through the thread as part of trying to catch up on the list, it struck me that there is a decent reason for scrubbing frequently despite this. If we assume that scrubs surface existing problems that had previously been silent (instead of creating new ones), doing frequent scrubs lowers the mean time before you detect such problems.
Lowering the mean time to detection has the same advantage it does in programming (with things like unit tests), which is that it significantly narrows down when the underlying problem could have happened. If you scrub data once a month and you find a problem in a scrub, the problem could have really happened any time in the past month; if you scrub every week and find a problem, you know it happened in the past week. Relatedly, the sooner you detect that a problem happened in the recent past, the more likely you are to still have logs, traces, metrics, and other information that might let you look for anomalies and find a potential cause (beyond 'the drive had a glitch', because that's not always the problem).
In a modern ZFS environment with sequential scrubs (or just SSDs), scrubs are generally fast and low impact (although it depends on your IO load), so the impact of doing them every week for all of your data is probably low. I try to scrub the pools on my personal machines every week, and I generally don't notice. Now that I'm thinking about scrubs this way, I'm going to try to be more consistent about weekly scrubs.
(Our fileservers scrub each pool once every four weeks on a rotating basis. We could lower this, even down to once a week, but despite what I've written here I suspect that we're not going to bother. We don't see checksum errors or other problems very often, and we probably aren't going to do deep investigation of anything that turns up. If we can trace a problem to a disk IO error or correlate it with an obvious and alarming SMART metric, we're likely to replace the disk; otherwise, we're likely to clear the error and see if it comes back.)
What we do to enable us to grow our ZFS pools over time
In my entry on why ZFS isn't good at growing and reshaping pools, I mentioned that we go to quite some lengths in our ZFS environment to be able to incrementally expand our pools. Today I want to put together all of the pieces of that in one place to discuss what those lengths are.
Our big constraint is that not only do we need to add space to pools over time, but we have a fairly large number of pools and which pools will have space added to them is unpredictable. We need a solution to pool expansion that leaves us with as much flexibility as possible for as long as possible. This pretty much requires being able to expand pools in relatively small increments of space.
The first thing we do, or rather don't do, is that we don't use raidz. Raidz is potentially attractive on SSDs (where the raidz read issue has much less impact), but since you can't expand a raidz vdev, the minimum expansion for a pool using raidz vdevs is at least three or four separate 'disks' to make a new raidz vdev (and in practice you'd normally want to use more than that to reduce the raidz overhead, because a four disk raidz2 vdev is basically a pair of mirrors with slightly more redundancy but more awkward management and some overheads). This requires adding relatively large blocks of space at once, which isn't feasible for us. So we have to do ZFS mirroring instead of the more space efficient raidz.
(A raidz2 vdev is also potentially more resilient than a bunch of mirror vdevs, because you can lose any arbitrary two disks without losing the pool.)
However, plain mirroring of whole disks would still not work for us because that would mean growing pools by relatively large amounts of space at a time (and strongly limit how many pools we can put on a single fileserver). To enable growing pools by smaller increments of space than a whole disk, we partition all of our disks into smaller chunks, currently four chunks on a 2 TB disk, and then do ZFS mirror vdevs using chunks instead of whole disks. This is not how you're normally supposed to set up ZFS pools, and on our older fileservers using HDs over iSCSI it caused visible performance problems if a pool ever used two chunks from the same physical disk. Fortunately those seem to be gone on our new SSD-based fileservers.
Even with all of this we can't necessarily let people expand existing pools by a lot of space, because the fileserver their pool is on may not have enough free space left (especially if we want other pools on that fileserver to still be able to expand). When people buy enough space at once, we generally wind up starting another ZFS pool on a different fileserver, which somewhat cuts against the space flexibility that ZFS offers. People may not have to decide up front how much space they want their filesystems to have, but they may have to figure out which pool a new filesystem should go into and then balance usage across all of their pools (or have us move filesystems).
(Another thing we do is that we sell pool space to people in 1 GB increments, although usually they buy more at once. This is implemented using a pool quota, and of course that means that we don't even necessarily have to grow the pool's space when people buy space; we can just increase the quota.)
Although we can grow pools relatively readily (when we need to), we still have the issue that adding a new vdev to a ZFS pool doesn't rebalance space usage across all of the pool's vdevs; it just mostly writes new data to the new vdev. In a SSD world where seeks are essentially free and we're unlikely to saturate the SSD's data transfer rates on any regular basis, this imbalance probably doesn't matter too much. It does make me wonder if nearly full pool vdevs interact badly with ZFS's issues with coming near quota limits (and a followup).