Some additional information on ZFS performance as you approach quota limits
@alanjude: re: <my entry>] - Basically, when you are close to the quota limit, ZFS will rate-limit incoming writes as it has to be sure you won't go over your quota. You end up having to wait for the pending transactions to flush to find out how much room you have left
I was turned on to the issue by @garrett_wollman who uses quotas at a large institution similar to yours. I expect you won't see the worst of it until you are within 100s of MB of the quota. So it isn't being over 95% or something, so much as being 'a few transactions' from full
@garrett_wollman: Turning off compression when the dataset gets near-full clears the backlog (obviously at a cost), as does increasing the quota if you have the free space for it.
@thatcks: Oh interesting! We have compression off on most of our datasets; does that significantly reduce the issue (although presumably not completely eliminate it)?
(Sadly we have people who (sometimes) run pools and filesystems that close to their quota limits.)
@garrett_wollman: I don't know; all I can say is that turning compression off on a wedged NFS server clears the backlog so requests for other datasets are able to be serviced.
All of this makes a bunch of sense, given the complexity of enforcing filesystem size limits, and it especially makes sense that compression might cause issues here; any sort of compression creates a very uncertain difference between the nominal size and the actual on-disk size, and ZFS quotas are applied to the physical space used, not the logical space.
(I took a quick look in the ZFS on Linux source code but I couldn't spot anything that was obviously different when there was a lot of quota room left.)
ZFS performance really does degrade as you approach quota limits
Every so often (currently monthly), there is an "OpenZFS leadership meeting". What this really means is 'lead developers from the various ZFS implementations get together to talk about things'. Announcements and meeting notes from these meetings get sent out to various mailing lists, including the ZFS on Linux ones. In the September meeting notes, I read a very interesting (to me) agenda item:
- Relax quota semantics for improved performance (Allan Jude)
- Problem: As you approach quotas, ZFS performance degrades.
- Proposal: Can we have a property like quota-policy=strict or loose, where we can optionally allow ZFS to run over the quota as long as performance is not decreased.
This is very interesting to me because of two reasons. First, in
the past we have definitely seen significant problems on our OmniOS
machines, both when an entire pool hits a
quota limit and when a single filesystem hits a
refquota limit. It's nice to know
that this wasn't just our imagination and that there is a real issue
here. Even better, it might someday be improved (and perhaps in a
way that we can use at least some of the time).
Second, any number of people here run very close to and sometimes at the quota limits of both filesystems and pools, fundamentally because people aren't willing to buy more space. We have in the past assumed that this was relatively harmless and would only make people run out of space. If this is a known issue that causes serious performance degradation, well, I don't know if there's anything we can do, but at least we're going to have to think about it and maybe push harder at people. The first step will have to be learning the details of what's going on at the ZFS level to cause the slowdown.
(It's apparently similar to what happens when the pool is almost full, but I don't know the specifics of that either.)
With that said, we don't seem to have seen clear adverse effects on our Linux fileservers, and they've definitely run into quota limits (repeatedly). One possible reason for this is that having lots of RAM and SSDs makes the effects mostly go away. Another possible reason is that we haven't been looking closely enough to see that we're experiencing global slowdowns that correlate to filesystems hitting quota limits. We've had issues before with somewhat subtle slowdowns that we didn't understand (cf), so I can't discount that we're having it happen again.
ZFS is not a universal filesystem that is always good for all workloads
Every so often, people show up on various ZFS mailing lists with problems where ZFS is performing not just a bit worse than other filesystems or the raw disks, but a lot worse. Often although not always, these people are using raidz on hard disks and trying to do random IO, which doesn't work very well because of various ZFS decisions. When this happens, whatever their configuration and workload, the people who are trying out ZFS are surprised, and this surprise is reasonable. Most filesystems today are generally good and also generally have relatively flat performance characteristics, where you can't make them really bad unless you have very unusual and demanding workloads.
Unfortunately, ZFS is not like this today. For all that I like it a lot, I have to accept the reality that ZFS is not a universal filesystem that works fine in all reasonable configurations and under all reasonable workloads. ZFS usually works great for many real world workloads (ours included), but there are perfectly reasonable setups where it will fall down, especially if you're using hard drives instead of SSDs. Raidz is merely an unusually catastrophic case (and an unusually common one, partly because no one expects RAID-5/6 to have that kind of drawback).
(Many of the issues that cause ZFS problems are baked into its fundamental design, but as storage gets faster and faster their effects are likely to diminish a lot for most systems. There is a difference between 10,000 IOPs a second and 100,000, but it may not matter as much as a difference between 100 a second and 1,000. And not all of the issues are about performance; there is also, for example, that there's no great solution to shrinking a ZFS pool. In some environments that will matter a lot.)
People sometimes agonize about this and devote a lot of effort to pushing water uphill. It's a natural reaction, especially among fans of ZFS (which includes me), but I've come to think that it's better to quickly identify situations where ZFS is not a good fit and recommend that people move to another filesystem and storage system. Sometimes we can make ZFS fit better with some tuning, but I'm not convinced that even that is a good idea; tuning is often fragile, partly because it's often relatively specific to your current workload. Sometimes the advantages of ZFS are worth going through the hassle and risk of tuning things like ZFS's recordsize, but not always.
(Having to tune has all sorts of operational impacts, especially since some things can only be tuned on a per-filesystem or even per-pool basis.)
PS: The obvious question is what ZFS is and isn't good for, and that I don't have nice convenient answers for. I know some pain points, such as raidz on HDs with random IO and the lack of shrinking, and others you can spot by looking for 'you should tune ZFS if you're doing <X>' advice, but that's not a complete set. And of course some of the issues today are simply problems with current implementations and will get better over time. Anything involving memory usage is probably one of them, for obvious reasons.
What happens in ZFS when you have 4K sector disks in an
Suppose, not entirely hypothetically,
that you've somehow wound up with some 4K 'advance format' disks
(disks with a 4 KByte physical sector size but 512 byte emulated
(aka logical) sectors) in a ZFS pool (or vdev) that has an
ashift of 9 and thus expects disks with a 512 byte
sector size. If you import or otherwise bring up the pool, you get
slightly different results depending on the ZFS implementation.
In ZFS on Linux, you'll get one ZFS
Event Daemon (
event for each disk, with a class of
vdev.bad_ashift. I don't
believe this event carries any extra information about the mismatch;
it's up to you to use the information on the specific disk and the
vdev in the event to figure out who has what ashift values. In the
current Illumos source, it looks like you get a somewhat more
straightforward message, although I'm not sure how it trickles out
to user level. At the kernel level it says:
Disk, '<whatever>', has a block alignment that is larger than the pool's alignment.
This error is not completely correct, since it's the vdev ashift that matters here, not the pool ashift, and it also doesn't tell you what the vdev ashift or the device ashift are; you're once again left to look those up yourself.
(I was going to say that the only likely case is a 4K advance format
disk in an
ashift=9 vdev, but these days you might find some SSDs
or NVMe drives that advertise a physical sector size larger than
This is explicitly a warning, not an error. Both the ZFS on Linux and Illumos code have the a comment to this effect (differing only in 'post an event' versus 'issue a warning'):
/* * Detect if the alignment requirement has increased. * We don't want to make the pool unavailable, just * post an event instead. */
This is a warning despite the fact that your disks can accept IO
for 512-byte sectors because what ZFS cares about (for various
reasons) is the physical sector size, not the logical one. A vdev
ashift=9 really wants to be used on disks with real 512-byte
physical sectors, not on disks that just emulate them.
(In a world of SSDs and NVMe drives that have relatively opaque and complex internal sizes, this is rather less of an issue than it is (or was) with spinning rust. Your SSD is probably lying to you no matter what nominal physical sector size it advertises.)
The good news is that as far as I can tell, this warning has no further direct effect on pool operation. At least in ZFS on Linux, the actual disk's ashift is only looked up in one place, when the disk is opened as part of a vdev, and the general 'open a vdev' code discards it after this warning; it doesn't get saved anywhere for later use. So I believe that ZFS IO, space allocations, and even uberblock writes will continue as before.
That ZFS continues operating after this warning doesn't mean that
life is great, at least if you're using HDs. Since no ZFS behavior
changes here and ZFS can do a using disks with 4K physical sectors
ashift=9 vdev will likely leave your disk (or disks) doing
a lot of read/modify/write operations when ZFS does unaligned writes
(as it can often do). This both performs relatively badly and
leaves you potentially exposed to damage to unrelated data if there's
a power loss part way through.
(But, as before, it's a lot better than not
being able to replace old dying disks with new working ones. You
just don't want to wind up in this situation if you have a choice,
which is a good part of why I advocate for creating basically all
pools as '
ashift=12' from the start.)
PS: ZFS events are sort of documented in the
but the current description of vdev.bad_ashift is not really
helpful. Also, I wish that the ZFS on Linux project itself had the
current manpages online (well, apart from as manpage source in the
Github repo, since most people
find manpages in their raw form to be not easy to read).
Some things on the GUID checksum in ZFS pool uberblocks
When I talked about how '
zpool import' generates its view of a
pool's configuration, I mentioned that an
additional kernel check of the pool configuration is that ZFS
uberblocks have a simple 'checksum' of all of
the GUIDs of the vdev tree. When the kernel is considering
a pool configuration, it rejects it if the sum of the GUIDs in the
vdev tree doesn't match the GUID sum from the uberblock.
(The documentation of the disk format claims that it's only the checksum of the leaf vdevs, but as far as I can see from the code it's all vdevs.)
I was all set to write about how this interacts with the vdev
configurations that are in ZFS labels, but
as it turns out this is no longer applicable. In versions of ZFS
that have better ZFS pool recovery,
the vdev tree that's used is the one that's read from the pool's
Meta Object Set (MOS), not the pool configuration that was passed
in from user level by '
zpool import'. Any mismatch between the
uberblock GUID sum and the vdev tree GUID sum likely indicates a
serious consistency problem somewhere.
(For the user level vdev tree, the difference between having a vdev's configuration and having all of its disks available is potentially important. As we saw yesterday, the ZFS label of every device that's part of a vdev has a complete copy of that vdev's configuration, including all of the GUIDs of its elements. Given a single intact ZFS label for a vdev, you can construct a configuration with all of the GUIDs filled in and thus pass the uberblock GUID sum validation, even if you don't have enough disks to actually use the vdev.)
The ZFS uberblock update sequence guarantees that the ZFS disk labels and their embedded vdev configurations should always be up to date with the current uberblock's GUID sum. Now that I know about the embedded uberblock GUID sum, it's pretty clear why the uberblock must be synced on all vdevs when the vdev or pool configuration is considered 'dirty'. The moment that the GUID sum of the current vdev tree changes, you'd better update everything to match it.
(The GUID sum changes if any rearrangement of the vdev tree happens.
This includes replacing one disk with another, since each disk has
a unique GUID sum. In case you're curious, the ZFS disk label always
has the full tree for a top level vdev, including the special
replacing' and '
spare' sub-vdevs that show up during these
PS: My guess from a not very extensive look through the kernel code
is that it's very hard to tell from user level if you have a genuine
uberblock GUID sum mismatch or another problem that returns the
same extended error code to user level. The good news is that I
think the only other case that returns
is if you have missing log device(s).
zpool import' generates its view of a pool's configuration
Full bore ZFS pool import happens in two stages,
zpool import' puts together a vdev configuration for the
pool, passes it to the kernel, and then the kernel reads the real
pool configuration from ZFS objects in the pool's Meta Object Set.
zpool import' does this is outlined at a high level by a
to summarize the comment, the configuration is created by assembling
and merging together information from the ZFS label of each device.
There is an important limitation to this process, which is that the
ZFS label only contains information on the vdev configuration, not
on the overall pool configuration.
To show you what I mean, here's relevant portions of a ZFS label
(as dumped by '
zdb -l') for a device from one of our pools:
txg: 5059313 pool_guid: 756813639445667425 top_guid: 4603657949260704837 guid: 13307730581331167197 vdev_children: 5 vdev_tree: type: 'mirror' id: 3 guid: 4603657949260704837 is_log: 0 children: type: 'disk' id: 0 guid: 7328257775812323847 path: '/dev/disk/by-path/pci-0000:19:00.0-sas-phy3-lun-0-part6' children: type: 'disk' id: 1 guid: 13307730581331167197 path: '/dev/disk/by-path/pci-0000:00:17.0-ata-4-part6'
(For much more details that are somewhat out of date, see the ZFS On-Disk Specifications [pdf].)
Based on this label, '
zpool import' knows what the GUID of this
vdev is, which disk of the vdev it's dealing with and where the
other disk or disks in it are supposed to be found, the pool's GUID,
how many vdevs the pool has in total (it has 5) and which specific
vdev this is (it's the fourth of five; vdev numbering starts from
0). But it doesn't know anything about the other vdevs, except
that they exist (or should exist).
When zpool assembles the pool configuration, it will use the best
information it has for each vdev, where the 'best' is taken to be
the vdev label with the highest
txg (transaction group number).
The label with the highest txg for the entire pool is used to
determine how many vdevs the pool is supposed to have. Note that
there's no check that the best label for a particular vdev has a
txg that is anywhere near the pool's (assumed) current txg. This
means that if all of the modern devices for a particular vdev
disappear and a very old device for it reappears, it's possible for
zpool to assemble a (user-level) configuration that claims that the
old device is that vdev (or the only component available for that
vdev, which might be enough if the vdev is a mirror).
If zpool can't find any labels for a particular vdev, all it can
do in the configuration is fill in an artificial 'there is a vdev
missing' marker; it doesn't even know whether it was a raidz or a
mirrored vdev, or how much data is on it. When '
prints the resulting configuration, it doesn't explicitly show these
missing vdevs; if I'm reading the code right, your only clue as to
where they are is that the pool configuration will abruptly skip
from, eg, 'mirror-0' to 'mirror-2' without reporting 'mirror-1'.
There's an additional requirement for a working pool configuration,
although it's only checked by the kernel, not zpool. The pool
uberblocks have a
ub_guid_sum field, which must match the sum
of all GUIDs in the vdev tree. If the GUID sum doesn't match, you'll
get one of those frustrating 'a device is missing somewhere' errors
on pool import. An entirely missing vdev naturally forces this to
happen, since all of its GUIDs are unknown and obviously not
contributing what they should be to this sum. I don't know how this
interacts with better ZFS pool recovery.
ZFS pool imports happen in two stages of pool configuration processing
The mechanics of how ZFS pools are imported is one of the more obscure areas of ZFS, which is a potential problem given that things can go very wrong (often with quite unhelpful errors). One special thing about ZFS pool importing is that it effectively happens in two stages, first with user-level processing and then again in the kernel, and these two stages use two potentially different pool configurations. My primary source for this is the discussion from Illumos issue #9075:
[...] One of the first tasks during the pool load process is to parse a config provided from userland that describes what devices the pool is composed of. A vdev tree is generated from that config, and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects that are necessary to load the pool are read. The exact configuration of the pool is also stored inside the MOS. Since the configuration provided from userland is external and might not accurately describe the vdev tree of the pool at the txg that is being loaded, it cannot be relied upon to safely operate the pool. For that reason, the configuration in the MOS is read early on. [...]
Here's my translation of that. In order to tell the kernel to load
a pool, '
zpool import' has to come up with a vdev configuration
for the pool and then provide it to the kernel. However, this is
not the real pool configuration; the real pool configuration is
stored in the pool itself (in regular ZFS objects that are part of
the MOS), where the kernel reads it again as the kernel imports the
Although not mentioned explicitly, the pool configuration that
zpool import' comes up with and passes to the kernel is not read
from the canonical pool configuration, because reading those ZFS
objects from the MOS requires a relatively full implementation of
ZFS, which '
zpool import' does not have (the kernel obviously
does). One source of the pool configuration for '
is the ZFS cache file,
/etc/zfs/zpool.cache, which theoretically
contains current pool configurations for all active pools. How
zpool import' generates a pool configuration for exported or
deleted pools is sufficiently complicated to need an entry of its
This two stage process means that there are at least two different
things that can go wrong with a ZFS pool's configuration information.
zpool import' may not be able to put together what it
thinks is a valid pool configuration, in which case I believe that
it doesn't even try to pass it to the kernel. Second, the kernel
may dislike the configuration that it's handed for its own reasons.
In older versions of ZFS (before better ZFS pool recovery landed), any mismatch between the actual pool
configuration and the claimed configuration from user level was
apparently fatal; now, only some problems are fatal.
As far as I know, '
zpool import' doesn't clearly distinguish
between these two cases in its error messages when you're actually
trying to import a pool. If you're just running it to see what pools
are available, I believe that all of what '
zpool import' reports
comes purely from its own limited and potentially imperfect
configuration assembly, with no kernel involvement.
(When a pool is fully healthy and in good shape, the configuration
zpool import' puts together at the user level will completely
match the real configuration in the MOS. When it's not is when you
run into potential problems.)
Our last OmniOS fileserver is now out of production (and service)
On Twitter, I noted a milestone last evening:
This evening we took our last OmniOS fileserver out of production and powered it off (after a great deal of slow work; all told this took more than a year). They've had a good run, so thank you Illumos/OmniOS/OmniTI/etc for the generally quiet and reliable service.
We still haven't turned any of our iSCSI backends off (they're Linux, not OmniOS), but that will be next, probably Friday (the delay is just in case). Then we'll get around to recycling all of the hardware for some new use, whatever it will turn out to be.
When we blank out the OmniOS system disks as part of recycling the hardware, that really will be the end of the line for the whole second generation of our fileserver infrastructure and the last lingering traces of our long association with Sun will be gone, swallowed by time.
It's been pointed out to me by @oclsc that since we're still using ZFS (now ZFS on Linux), we still have a tie to Sun's lineage. It doesn't really feel the same, though; open source ZFS is sort of a lifeboat pushed out of Sun toward the end, not Sun(ish) itself.
(This is probably about as fast as I should have expected from having almost all of the OmniOS fileservers out of production at the end of May. Things always come up.)
Various people and groups at the department have been buying Sun machines and running Sun OSes (first SunOS and then Solaris) almost from the beginning of Sun. I don't know if we bought any Sun 1s, but I do know that some Sun 2s were, and Sun 3s and onward were for many years a big presence (eventually only as servers, although we did have some Sunrays). With OmniOS going out of service, that is the end of our use of that lineage of Unix.
(Of course Sun itself has been gone for some time, consumed by Oracle. But our use of its lineage lived on in OmniOS, since Illumos is more or less Solaris in open source form (and improved from when it was abandoned by its corporate parent).)
I have mixed feelings about OmniOS and I don't have much sentimentality about Solaris itself (it's complicated). But I still end up feeling that there is a weight of history that has shifted here in the department, at the end of a long slow process. Sun is woven through the history of the department's computing, and now all that remains of that is our use of ZFS.
(For all that I continue to think that ZFS is your realistic choice for an advanced filesystems, I also think that we probably wouldn't have wound up using it if we hadn't started with Solaris.)
A hazard of our old version of OmniOS: sometimes powering off doesn't
Two weeks ago, I powered down all of our OmniOS fileservers that
are now out of production, which is
most of them. By that, I mean that I logged in to each of them via
SSH and ran '
poweroff'. The machines disappeared from the network
and I thought nothing more of it.
This Sunday morning we had a brief power failure. In the aftermath of the power failure, three out of four of the OmniOS fileservers reappeared on the network, which we knew mostly because they sent us some email (there were no bad effects of them coming back). When I noticed them back, I assumed that this had happened because we'd set their BIOSes to 'always power on after a power failure'. This is not too crazy a setting for a production server you want up at all costs because it's a central fileserver, but it's obviously no longer the setting you want once they go out of production.
Today, I logged in to the three that had come back, ran '
on them again, and then later went down to the machine room to pull
out their power cords. To my surprise, when I looked at the physical
machines, they had little green power lights that claimed they were
powered on. When I plugged in a roving display and keyboard to check
their state, I discovered that all three were still powered on and
sitting displaying an OmniOS console message to the effect that they
were powering off. Well, they might have been trying to power off,
but they weren't achieving it.
I rather suspect that this is what happened two weeks ago, and why
these machines all sprang back to life after the power failure. If
OmniOS never actually powered the machines off, even a BIOS setting
of 'resume last power state after a power failure' would have powered
the machines on again, which would have booted OmniOS back up again.
Two weeks ago, I didn't go look at the physical servers or check
their power state through their lights out management interface;
it never occurred to me that '
poweroff' on OmniOS sometimes might
not actually power the machine off, especially when the machines
did drop off the network.
(One out of the four OmniOS servers didn't spring back to life after the power failure, and was powered off when I looked at the hardware. Perhaps its BIOS was set very differently, or perhaps OmniOS managed to actually power it off. They're all the same hardware and the same OmniOS version, but the server that probably managed to power off had no active ZFS pools on our iSCSI backends; the other three did.)
At this point, this is only a curiosity. If all goes well, the last OmniOS fileserver will go out of production tomorrow evening. It's being turned off as part of that, which means that I'm going to have to check that it actually powered off (and I'd better add that to the checklist I've written up).
Almost all of our OmniOS machines are now out of production
Last Friday, my co-workers migrated the last filesystem from our HD-based OmniOS fileservers to one of our new Linux fileservers. With this, the only OmniOS fileserver left in production is serving a single filesystem, our central administrative filesystem, which is extremely involved to move because everything uses it all the time and knows where it is (and of course it's where our NFS automounter replacement lives, along with its data files). Moving that filesystem is going to take a bunch of planning and a significant downtime, and it will only happen after I come back from vacation.
(Unlike last time around, we haven't destroyed any pools or filesystems yet in the old world, since we didn't run into any need to.)
This migration has been in process in fits and starts since late last November, so it's taken about seven months to finish. This isn't because we have a lot of data to move (comparatively speaking); instead it's because we have a lot of filesystems with a lot of users. First you have to schedule a time for each filesystem that the users don't object to (and sometimes things come up so your scheduled time has to be abandoned), and then moving each filesystem takes a certain amount of time and boring work (so often people only want to do so many a day, so they aren't spending all of their day on this stuff). Also, our backup system is happier when we don't suddenly give it massive amounts of 'new' data to back up in a single day.
(I think this is roughly comparable to our last migration, which seems to have started at the end of August of 2014 and finished in mid-February of 2015. We've added significantly more filesystems and disk space since then.)
The MVP of the migration is clearly
zfs send | zfs recv' (as it always has been). Having to do the
migrations with something like
rsync would likely have been much
more painful for various reasons; ZFS snapshots and ZFS send are
things that just work, and they come with solid and extremely
reassuring guarantees. Part of their importance was that the speed
of an incremental ZFS send meant that the user-visible portion of
a migration (where we had to take their filesystem away temporarily)
could be quite short (short enough to enable opportunistic migrations,
if we could see that no one was using some of the filesystems).
At this point we've gotten somewhere around four and a half years of lifetime out of our OmniOS fileservers. This is probably around what we wanted to get, especially since we never replaced the original hard drives and so they're starting to fall out of warranty coverage and hit what we consider their comfortable end of service life. Our first generation Solaris fileservers were stretched much longer, but they had two generations of HDs and even then we were pushing it toward the end of their service life.
(The actual server hardware for both the OmniOS fileservers and the Linux iSCSI backends seems fine, so we expect to reuse it in the future once we migrate the last filesystem and then tear down the entire old environment. We will probably even reuse the data HDs, but only for less important things.)
I think I feel less emotional about this migration away from OmniOS than I did about our earlier migration from Solaris to OmniOS. Moving away from Solaris marked the end of Sun's era here (even if Sun had been consumed by Oracle by that point), but I don't have that sort of feelings about OmniOS. OmniOS was always a tool to me, although unquestionably a useful one.
(I'll write a retrospective on our OmniOS fileservers at some point, probably once the final filesystem has migrated and everything has been shut down for good. I want to have some distance and some more experience with our Linux fileservers first.)
PS: To give praise where it belongs, my co-workers did basically all of the hard, grinding work of this migration, for various reasons. Once things got rolling, I got to mostly sit back and move filesystems when they told me one was scheduled and I should do it. I also cleverly went on vacation during the final push at the end.