How we guarantee there's always some free space in our ZFS pools
One of the things that we discovered fairly early on in our experience
with ZFS (I think within the lifetime of the first generation
Solaris fileservers) is that ZFS gets very
unhappy if you let a pool get completely full. The situation has
improved since then, but back in the days we couldn't even change
ZFS properties, much less remove files as
root. Being unable to
change properties is a serious issue for us because NFS exports
are controlled by ZFS properties, so if we had a full pool we
couldn't modify filesystem exports to cut off access from client
machines that were constantly filling up the filesystem.
(At one point we resorted to cutting off a machine at the firewall, which is a pretty drastic step. Going this far isn't necessary for machines that we run, but we also NFS export filesystems to machines that other trusted sysadmins run.)
To stop this from happening, we use pool-wide quotas. No matter
how much space people have purchased in a pool or even if this is a system pool
that we operate, we insist that it always have a minimum safety
margin, enforced through a '
quota=' setting on the root of the
pool. When people haven't purchased enough to use all of the pool's
current allocated capacity, this safety margin is implicitly the
space they haven't bought. Otherwise, we have two minimum margins.
The explicit minimum margin is that our scripts that manage pool
quotas always insist on a 10 MByte safety margin. The implicit
minimum margin is that we normally only set pool quotas in full GB,
so a pool can be left with several hundred MB of space between its
real maximum capacity and the nearest full GB.
All of this pushes the problem back one level, which is determining
what the pool's actual capacity is so we can know where this safety
margin is. This is relatively straightforward for us because all
of our pools use mirrored vdevs, which means that the size reported
zpool list' is a true value for the total usable space (people
with raidz vdevs are on their own here). However, we must reduce
this raw capacity a bit, because ZFS reserves 1/32nd of the pool
for its own internal use. We must reserve
at least 10 MB over and above this 1/32nd of the pool in order to
actually have a safety margin.
(All of this knowledge and math is embodied into a local script, so that we never have to do these calculations by hand or even remember the details.)
PS: These days in theory you can change ZFS properties and even remove files when your pool is what ZFS will report as 100% full. But you need to be sure that you really are freeing up space when you do this, not using more because of things like snapshots. Very bad things happen to your pool if it gets genuinely full right up to ZFS's internal redline (which is past what ZFS will normally let you unless you trick it); you will probably have to back it up, destroy it, and recreate it to fully recover.
(This entry was sparked by a question from a commentator on yesterday's entry on how big our fileserver environment is.)
Revisiting what the ZFS recordsize is and what it does
I'm currently reading Jim Salter's ZFS 101—Understanding ZFS
storage and performance,
and got to the section on ZFS's important
where the article attempts to succinctly explain a complicated ZFS
specific thing. ZFS recordsize is hard to explain because it's
relatively unlike what other filesystems do, and looking back I've
never put down a unified view of it in one place.
The simplest description is that ZFS recordsize is the (maximum) logical block size of a filesystem object (a file, a directory, a whatever). Files smaller than recordsize have a single logical block that's however large it needs to be (details here); files of recordsize or larger have some number of recordsize logical blocks. These logical blocks aren't necessarily that large in physical blocks (details here); they may be smaller, or even absent entirely (if you have some sort of compression on and all of the data was zeros), and under some circumstances the physical block can be fragmented (these are 'gang blocks').
(ZFS normally doesn't fragment the physical block that implements your logical block, for various good reasons including that one sequential read or write is generally faster than several of them.)
However, this logical block size has some important consequences because ZFS checksums are for a single logical block. Since ZFS always verifies the checksum when you read data, it must read the entire logical block even if you ask only for a part of it; otherwise it doesn't have all the data it needs to compute the checksum. Similarly, it has to read the entire logical block even when your program is only writing a bit of data to part of it, since it has to update the checksum for the whole block, which requires the rest of the block's data. Since ZFS is a copy on write system, it then rewrites the whole logical block (into however large a physical block it now requires), even if you only updated a little portion of it.
Another consequence is that since ZFS always writes (and reads) a full logical block, it also does its compression at the level of logical blocks (and if you use ZFS deduplication, that also happens on a per logical block basis). This means that a small recordsize will generally limit how much compression you can achieve, especially on disks with 4K sectors.
(Using a smaller maximum logical block size may increase the amount of data that you can deduplicate, but it will almost certainly increase the amount of RAM required to get decent performance from deduplication. ZFS deduplication's memory requirements for good performance are why you should probably avoid it; making them worse is not usually a good idea. Any sort of deduplication is expensive and you should use it only when you're absolutely sure it's worth it for your case.)
Looking back at DTrace from a Linux eBPF world (some thoughts)
As someone who made significant use of locally written DTrace scripts on OmniOS and has since moved from OmniOS to Linux for our current generation of fileservers, I've naturally been watching the growth of Linux's eBPF tooling with significant interest (and some disappointment, since they're still sort of a work in progress). This has left me with some thoughts on the DTrace experience on Solaris and then OmniOS as contrasted with the eBPF experience on Linux.
On Linux, eBPF and the tools surrounding it are a lot more rough and raw than I think DTrace ever was, and certainly than any version of DTrace that I actively used. By the time I started using it, DTrace was basically fully cooked on Solaris and accordingly there was little change between the start of my DTrace experience and the end of it (I think DTrace gained some more convenience functions for things like dealing with IP addresses, but no major shifts). But at the same time, how Linux as a whole has developed eBPF and the tools surrounding it has made Linux eBPF an open system with multiple interfaces (and levels of interface) in a way that DTrace never was, and this has enabled things that at least I never saw people doing on Solaris and Illumos.
In the DTrace world, the interface almost everyone was supposed to
use to deal with DTrace was, well,
dtrace the command and its
language (the DTrace library is explicitly marked as unstable and
private in its manual page).
You might run the command in various ways and generate programs for
it through templates or other things, but that was how you interacted
with things. If other levels of interface were documented (such as
building raw DTrace programs yourself and feeding them to the
kernel), they were definitely not encouraged and as a result I don't
think the community ever did much with them.
(People definitely built tools that used the DTrace system that
didn't produce processed text output from
dtrace, but these were
clearly product level work from a dedicated engineer team, not
anything you would produce for smaller scale things. Often they
came from Sun or a Illumos distribution provider, and so were
entitled to use private interfaces.)
By contrast, the Linux eBPF ecology has created a whole suite of
tools at various levels of the stack. There's an equivalent of
dtrace, but you can also set up eBPF instrumentation of something
from inside a huge number of different programming environments and
do a bunch of things with the information that it produces. This
has led to very flexible things, such as the Cloudflare eBPF
(which lets you surface Prometheus metrics for anything that you
can write an eBPF program for).
I can't help but feel that the Linux eBPF ecology has benefited a
lot from the fractured and separated way that eBPF has been developed.
No single group owned the entire stack, top to bottom, and so the
multiple groups involved were all tacitly forced to provide more
or less documented and more or less stable interfaces to each other.
The existence of these interfaces then allowed other people to come
along and take advantage of them, writing their own tools on top
of one or another bit and re-purposing them to do things like
create useful Grafana dashboards (via
a comment on here). DTrace's single
unified development gave us a much more polished
much sooner (and made it usable on all Solaris versions that supported
DTrace the idea), but that was it.
(I don't fault the DTrace developers for keeping libdtrace private and so on; it really is the obvious thing to do in a unitary development environment. Of course you don't want to lock yourself into backward compatibility with a kernel interface or internal implementation that you now realize is not the best idea.)
ZFS on Linux has now become the OpenZFS ZFS implementation
The other day I needed to link to a specific commit in ZFS on Linux for my entry on how deduplicated ZFS streams are now deprecated, so I went to the ZFS on Linux Github repository, which I track locally. Somewhat to my surprise, I wound up on the OpenZFS repository, which is now described as 'OpenZFS on Linux and FreeBSD' and is linked as such from the open-zfs.org page that links to all of them.
(The OpenZFS repo really is a renaming of the ZFS on Linux repo, because my git pulls have transparently kept working. The git tree in the OpenZFS repo is the same git tree that was previously ZFS on Linux. I believe that this is a change for OpenZFS's own repo, although I don't know where that was.)
I knew this was coming (I believe I've seen it mentioned in passing in various places), but it's still something to see that it's been done now. As I thought last year (in this entry), the center of gravity of ZFS development has shifted from Illumos to Linux. The OpenZFS 'zfs' repository doesn't represent itself as the ZFS upstream, but it certainly has a name that tacitly endorses that view (and the view is pretty much the reality).
Although there are risks to this shift, it feels inevitable. Despite ZFS being a third party filesystem on Linux, Linux is still where the action is. It's certainly where we went for our current generation of ZFS fileservers, because Linux could give us things that OmniOS (sadly) could not (such as working 10G-T Ethernet with Intel hardware). As someone who is on Linux ZFS now, I'm glad that it hasn't stagnated even as I'm sad that Illumos apparently did (for ZFS and other things).
I don't think this means anything different for ZFS on Illumos than it did before, when the Illumos people were talking about increasingly adopting changes from what was then ZFS on Linux (cf). I believe that changes and features are flowing between Illumos and OpenZFS (in both directions), but I don't know if there's any effort to make the OpenZFS repo directly useful for (or on) Illumos.
'Deduplicated' ZFS send streams are now deprecated and on the way out
For a fair while, '
zfs send' has had support for a -D argument, aka
--dedup, that causes it to send what is called a 'deduplicated stream'.
zfs(1) manpage describes this
Generate a deduplicated stream. Blocks which would have been sent multiple times in the send stream will only be sent once. The receiving system must also support this feature to receive a deduplicated stream. This flag can be used regardless of the dataset's
dedupproperty, but performance will be much better if the filesystem uses a dedup-capable checksum (for example,
Dedup send can only deduplicate over the set of blocks in the send command being invoked, and it does not take advantage of the dedup table to do so. This is a very common misconception among not only users, but developers, and makes the feature seem more useful than it is. As a result, many users are using the feature but not getting any benefit from it.
Dedup send requires a nontrivial expenditure of memory and CPU to operate, especially if the dataset(s) being sent is (are) not already using a dedup-strength checksum.
Dedup send adds developer burden. It expands the test matrix when developing new features, causing bugs in released code, and delaying development efforts by forcing more testing to be done.
As a result, we are deprecating the use of `zfs send -D` and receiving of such streams. This change adds a warning to the man page, and also prints the warning whenever dedup send or receive are used.
I actually had the reverse misconception about how deduplicated
sends worked; I assumed that they required deduplication to be on
in the filesystem itself. Since we will never use deduplication, I
never looked any further at the '
zfs send' feature. It probably
wouldn't have been a net win for us anyway, since our OmniOS
fileservers didn't have all that fast CPUs
and we definitely weren't using one of the dedup-strength checksums.
(Our current Linux fileservers have better CPUs, but I think they're still not all that impressive.)
The ZFS people are planning various features to deal with the removal of this feature so that people will still be able to use saved deduplicated send streams. However, if you have such streams in your backup systems, you should probably think about aging them out. And definitely you should move away from generating new ones, even though this change is not yet in any release of ZFS as far as I know (on any platform).
How we set up our ZFS filesystem hierarchy in our ZFS pools
Our long standing practice here, predating even the first generation of our ZFS fileservers, is that we have two main sorts of filesystems, home directories (homedir filesystems) and what we call 'work directory' (workdir) filesystems. Homedir filesystems are called /h/NNN (for some NNN) and workdir filesystems are called /w/NNN; the NNN is unique across all of the different sorts of filesystems. Users are encouraged to put as much stuff as possible in workdirs and can have as many of them as they want, which mattered a lot more in the days when we used Solaris DiskSuite and had fixed-sized filesystems.
(This creates filesystems called things like /h/281 and /w/24.)
When we moved from DiskSuite to ZFS, we made the obvious decision
to keep these user-visible filesystem names and the not entirely
obvious decision that these filesystem names should work even on
the fileservers themselves. This meant using the ZFS
property to set the mount point of all ZFS homedir and workdir
filesystems, which works (and worked fine). However, this raised
another question, that of what the actual filesystem name inside
the ZFS pool should look like (since it no longer has to reflect
the mount point).
There are a number of plausible answers here. For example, because our 'NNN' numbers are unique, we could have made all filesystems be simply '<pool>/NNN'. However, for various reasons we decided that the ZFS pool filesystem should reflect the full name of the filesystem, so /h/281 is '<pool>/h/281' instead of '<pool>/281' (among other things, we felt that this was easier to manage and work with). This created the next problem, which is that if you have a ZFS filesystem of <pool>/h/281, <pool>/h has to exist in some form. I suppose that we could have made these just be subdirectories in the root of the pool, but instead we decided to make them be empty and unmounted ZFS filesystems that are used only as containers:
zfs create -o mountpoint=none fs11-demo-01/h zfs create -o mountpoint=none fs11-demo-01/w
We create these in every pool as part of our pool setup automation, and then we can make, for example, fs11-demo-01/h/281, which will be mounted everywhere as /h/281.
(Making these be real ZFS filesystems means that they can have properties that will be inherited by their children; this theoretically enables us to apply some ZFS properties only to a pool's homedir or workdir filesystems. Probably the only useful one here is quotas.)
You can't delegate a ZFS administration permission to delete only snapshots
ZFS has a system that lets you selectively delegate administration
permissions from root to other users (exposed through '
on a per filesystem tree basis. This led to the following interesting
question (and answer) over on the fediverse:
@wxcafe: hey can anyone here confirm that there's no zfs permission for destroying only snapshots?
@cks: I can confirm this based on the ZFS on Linux code. The 'can you destroy a snapshot' code delegates to a general 'can you destroy things' permission check that uses the overall 'destroy' permission.
(It also requires mount permissions, presumably because you have to be able to unmount something that you're about to destroy.)
The requirement for unmount means that delegating 'destroy' permissions may not work on Linux (or may not always work), because only root can unmount things on Linux. I haven't tested to see whether ZFS will let you delegate unmount permission (and thereby pass its internal checks) but then later the unmount operation will fail, or whether the permission cannot be delegated on Linux (which would mean that you can't delegate 'destroy' either).
The inability to allow people to only delete snapshots is a bit unfortunately, because you can delegate the ability to create them (as the 'snapshot' permission). It would be nice to be able to delegate snapshot management entirely to people (or to an unprivileged account used for automated snapshot management) but not let them destroy the filesystem itself.
This situation is the outcome of two separate and individually
sensible design decisions, which combine together here in a not
great way. First, ZFS decided that creating snapshots would be a
zfs' command but destroying them would be part of '
destroy' (a decision that I personally dislike because of how it
puts you that much closer to an irreversible error). Then when it
added delegated permissions, ZFS chose to delegate pretty much by
zfs' commands, although it could have chosen a different split.
Since destroying snapshots is part of '
zfs destroy', it is all
covered under one 'destroy' permission.
(The code in the ZFS kernel module does not require this; it has a separate permission check function for each sort of thing being destroyed. They all just call a common permission check function.)
The good news is that while writing this entry and reading the
zfs allow' manpage, I realized that there may sort of be a
workaround under specific situations. I'll just quote myself
Actually I think it may be possible to do this in practice under selective circumstances. You can delegate a permission only for descendants of a filesystem, not for the filesystem itself, so if a filesystem will only ever have snapshots underneath it, I think that a 'descendants only' destroy delegation will in practice only let people destroy snapshots, because that's all that exists.
Disclaimer: this is untested.
On our fileservers, we don't have nested filesystems (or at least not any that contain data), so we could do this; anything that we'll snapshot has no further real filesystems as children. However in other setups you would have a mixture of real filesystems and snapshots under a top level filesystem, and delegating 'destroy' permission would allow people to destroy both.
(This assumes that you can delegate 'unmount' permission so that the ZFS code will allow you to do destroys in the first place. The relevant ZFS code checks for unmount permission before it checks for destroy permission.)
Doing frequent ZFS scrubs lets you discover problems close to when they happened
Somewhat recently, the ZFS on Linux mailing list had a discussion of how frequently you should do ZFS scrubs, with a number of people suggesting that modern drives only really need relatively infrequent scrubs. As I was reading through the thread as part of trying to catch up on the list, it struck me that there is a decent reason for scrubbing frequently despite this. If we assume that scrubs surface existing problems that had previously been silent (instead of creating new ones), doing frequent scrubs lowers the mean time before you detect such problems.
Lowering the mean time to detection has the same advantage it does in programming (with things like unit tests), which is that it significantly narrows down when the underlying problem could have happened. If you scrub data once a month and you find a problem in a scrub, the problem could have really happened any time in the past month; if you scrub every week and find a problem, you know it happened in the past week. Relatedly, the sooner you detect that a problem happened in the recent past, the more likely you are to still have logs, traces, metrics, and other information that might let you look for anomalies and find a potential cause (beyond 'the drive had a glitch', because that's not always the problem).
In a modern ZFS environment with sequential scrubs (or just SSDs), scrubs are generally fast and low impact (although it depends on your IO load), so the impact of doing them every week for all of your data is probably low. I try to scrub the pools on my personal machines every week, and I generally don't notice. Now that I'm thinking about scrubs this way, I'm going to try to be more consistent about weekly scrubs.
(Our fileservers scrub each pool once every four weeks on a rotating basis. We could lower this, even down to once a week, but despite what I've written here I suspect that we're not going to bother. We don't see checksum errors or other problems very often, and we probably aren't going to do deep investigation of anything that turns up. If we can trace a problem to a disk IO error or correlate it with an obvious and alarming SMART metric, we're likely to replace the disk; otherwise, we're likely to clear the error and see if it comes back.)
What we do to enable us to grow our ZFS pools over time
In my entry on why ZFS isn't good at growing and reshaping pools, I mentioned that we go to quite some lengths in our ZFS environment to be able to incrementally expand our pools. Today I want to put together all of the pieces of that in one place to discuss what those lengths are.
Our big constraint is that not only do we need to add space to pools over time, but we have a fairly large number of pools and which pools will have space added to them is unpredictable. We need a solution to pool expansion that leaves us with as much flexibility as possible for as long as possible. This pretty much requires being able to expand pools in relatively small increments of space.
The first thing we do, or rather don't do, is that we don't use raidz. Raidz is potentially attractive on SSDs (where the raidz read issue has much less impact), but since you can't expand a raidz vdev, the minimum expansion for a pool using raidz vdevs is at least three or four separate 'disks' to make a new raidz vdev (and in practice you'd normally want to use more than that to reduce the raidz overhead, because a four disk raidz2 vdev is basically a pair of mirrors with slightly more redundancy but more awkward management and some overheads). This requires adding relatively large blocks of space at once, which isn't feasible for us. So we have to do ZFS mirroring instead of the more space efficient raidz.
(A raidz2 vdev is also potentially more resilient than a bunch of mirror vdevs, because you can lose any arbitrary two disks without losing the pool.)
However, plain mirroring of whole disks would still not work for us because that would mean growing pools by relatively large amounts of space at a time (and strongly limit how many pools we can put on a single fileserver). To enable growing pools by smaller increments of space than a whole disk, we partition all of our disks into smaller chunks, currently four chunks on a 2 TB disk, and then do ZFS mirror vdevs using chunks instead of whole disks. This is not how you're normally supposed to set up ZFS pools, and on our older fileservers using HDs over iSCSI it caused visible performance problems if a pool ever used two chunks from the same physical disk. Fortunately those seem to be gone on our new SSD-based fileservers.
Even with all of this we can't necessarily let people expand existing pools by a lot of space, because the fileserver their pool is on may not have enough free space left (especially if we want other pools on that fileserver to still be able to expand). When people buy enough space at once, we generally wind up starting another ZFS pool on a different fileserver, which somewhat cuts against the space flexibility that ZFS offers. People may not have to decide up front how much space they want their filesystems to have, but they may have to figure out which pool a new filesystem should go into and then balance usage across all of their pools (or have us move filesystems).
(Another thing we do is that we sell pool space to people in 1 GB increments, although usually they buy more at once. This is implemented using a pool quota, and of course that means that we don't even necessarily have to grow the pool's space when people buy space; we can just increase the quota.)
Although we can grow pools relatively readily (when we need to), we still have the issue that adding a new vdev to a ZFS pool doesn't rebalance space usage across all of the pool's vdevs; it just mostly writes new data to the new vdev. In a SSD world where seeks are essentially free and we're unlikely to saturate the SSD's data transfer rates on any regular basis, this imbalance probably doesn't matter too much. It does make me wonder if nearly full pool vdevs interact badly with ZFS's issues with coming near quota limits (and a followup).
Some effects of the ZFS DVA format on data layout and growing ZFS pools
One piece of ZFS terminology is DVA and DVAs, which is short for Data Virtual Address. For ZFS, a DVA is the equivalent of a block number in other filesystems; it tells ZFS where to find whatever data we're talking about. The short summary of what fields DVAs have and what they mean is that DVAs tell us how to find blocks by giving us their vdev (by number) and their byte offset into that particular vdev (and then their size). A typical DVA might say that you find what it's talking about on vdev 0 at byte offset 0x53a40ed000. There are some consequences of this that I hadn't really thought about until the other day.
Right away we can see why ZFS has a problem removing a vdev; the vdev's number is burned into every DVA that refers to data on it. If there's no vdev 0 in the pool, ZFS has no idea where to even start looking for data because all addressing is relative to the vdev. ZFS pool shrinking gets around this by adding a translation layer that says where to find the portions of vdev 0 that you care about after it's been removed.
In a mirror vdev, any single disk must be enough by itself to recover all data. Since the DVA simply specifies a byte offset within the vdev, this implies that in ZFS mirror vdevs, all copies of a block are at the same place on each disk, contrary to what I once thought might be the case. If vdev 0 is a mirror vdev, our DVA says that we can find our data at byte offset 0x53a40ed000 on each and every disk.
In a RAID-Z vdev, our data lives across multiple disks (with parity) but we only have the byte offset to its start (and then its size). The first implication of this is that in a RAID-Z vdev, a block is always striped sequentially across your disks at basically the same block offsets. ZFS doesn't find one bit of free space on disk 1, a separate bit on disk 2, a third bit on disk 3, and so on, and join them all together; instead it finds a contiguous stripe of free space starting on some disk, and uses it. This space can be short or long, it doesn't have to start on the first disk in the RAID-Z vdev, and it can wrap around (possibly repeatedly).
(This makes it easier for me to understand why ZFS rounds raidzN write sizes up to multiples of N+1 blocks. Possibly I understood this at some point, but if so I'd forgotten it since.)
Another way to put this is that for RAID-Z vdevs, the DVA vdev byte
addresses snake across all of the vdev's disks in sequence, switching
to a new disk ever
asize bytes. In a vdev with a 4k asize, vdev bytes
0 to 4095 are on the first disk, vdev bytes 4096 to 8191 are on the
the second disk, and so on. The unfortunate implication of this is
that the number of disks in a RAID-Z vdev is an implicit part of the
addresses of data in it. The mapping from vdev byte offset to the disk
and the disk's block where the block's stripe starts depends on how many
disks are in the RAID-Z vdev.
(I'm pretty certain this means that I was wrong in my previous explanation of why ZFS can't allow you to add disks to raidz vdevs. The real problem is not inefficiency in the result, it's that it would blow up your ability to access all data in your vdev.)
ZFS can grow both mirror vdevs and raidz vdevs if you replace the disks with larger ones because in both cases this is just adding more available bytes of space at the top of ZFS's per-vdev byte address range for DVAs. You have to replace all of the disks because in both cases, all disks participate in the addressing. In mirror vdevs this is because you write new data at the same offset into each disk, and in raidz vdevs it's because the addressable space is striped across all disks and you can't have holes in it.
(You can add entire new vdevs because that doesn't change the interpretation of any existing DVAs, since the vdev number is part of the DVA and the byte address is relative to the vdev, not the pool as a whole. This feels obvious right now but I want to write it down for my future self, since someday it probably won't be as clear.)