2015-10-09
How much space ZFS reserves in your pools varies across versions
Back in my entry on the difference in available pool space between
zfs list
and zpool list
, I noted
that one of the reasons the two differ is that ZFS reserves some
amount of space internally. At the time I wrote that the code said
it should be reserving 1/32nd of the pool size (and still allow
some things down to 1/64th of the pool, like ZFS property changes)
but our OmniOS fileservers seemed to be only reserving 1/64th of
the space (and imposing a hard limit at that point). It turns out
that this discrepancy has a simple explanation: ZFS has changed its
behavior over time.
This change is Illumos issue 4951, 'ZFS administrative commands should use reserved space, not fail with ENOSPC', which landed in roughly July of 2014. When I wrote my original entry in late 2014 I looked at the latest Illumos source code at the time and so saw this change, but of course our ZFS fileservers were using a version of OmniOS that predated the change and so were using the old 1/64th of the pool hard limit.
The change has propagated into various Illumos distributions and other ZFS implementations at different points. In OmniOS it's in up to date versions of the r151012 and r151014 releases, but not in r151010 and earlier. In ZFS on Linux, it landed in the 0.6.5 release and was not in 0.6.4. In FreeBSD, this change is definitely in -current (and appears to have arrived very close to when it did in Illumos), but it postdates 10.0's release and I think arrived in 10.1.0.
This change has an important consequence: when you update across this change, your pools will effectively shrink, because you'll go from ZFS reserving 1/64th of their space to reserving 1/32nd of their space. If your pools have lots of space, well, this isn't a problem. If your pools have only some space, your users may notice it suddenly shrinking a certain amount (some of our pools will lose half their free space if we don't expand them). And if your pools are sufficiently close to full, they will instantly become over-full and you'll have to delete things to free up space (or expand the pool on the spot).
I believe that you can revert to the old 1/64th limit if you really want to, but unfortunately it's a global setting so you can't do it selectively for some pools while leaving others at the default 1/32nd limit. Thus, if you have to do this you might want to do so only temporarily in order to buy time while you clean up or expand pools.
(Of course, by now most people may have already dealt with this. We're a bit behind the times in terms of what OmniOS version we're using.)
Sidebar: My lesson learned here
The lesson I've learned from this is that I should probably stop reflexively reading code from the Illumos master repo and instead read the OmniOS code for the branch we're using. Going straight to the current 'master' version is a habit I got into in the OpenSolaris days, when there simply was no source tree that corresponded to the Solaris 10 update whatever that we were running. But these days that's no longer the case and I can read pretty much the real source code for what's running on our fileservers. And I should, just to avoid this sort of confusion.
(Perhaps going to the master source and then getting confused was a good thing in this case, since it's made me familiar with the new state of affairs too. But it won't always go so nicely.)
Our low-rent approach to verifying that NFS mounts are there
Our mail system has everyone's inboxes
in an old-fashioned /var/mail
style single directory; in fact it
literally is /var/mail
. This directory is NFS mounted from one
of our fileservers, which raises
a little question: how can we be sure that it's actually there?
Well, there's always going to be a /var/mail
directory. But what
we care about is that this directory is the actual NFS mounted
filesystem instead of the directory on the local root filesystem
that is the mount point, because we very much do not want to ever
deliver email to the latter.
(Some people may say that limited directory permissions on the mount point should make delivery attempts fail. 'Should' is not a word that I like in this situation, either in 'should fail' or 'that failure should be retried'.)
There are probably lots of clever solutions to this problem involving
advanced tricks like embedded Perl bits in the mailer that look at
NFS mount state and so on. We opted for a simple and low tech
approach: we have a magic flag file in the NFS version of /var/mail
,
imaginatively called .NFS-MOUNTED
. If the flag file is not present,
we assume that the filesystem is not mounted and stall all email
delivery to /var/mail
.
This scheme is subject to various potential issues (like accidentally
deleting .NFS-MOUNTED
some day), but it has the great virtue that
it is simple and relatively bulletproof. It helps that Exim has
robust support for checking whether or not a file exists (although
we use a hack for various reasons). The whole
thing has worked well and basically transparently, and we haven't
removed one those .NFS-MOUNTED
files by accident yet.
(We actually use this trick for several NFS-mounted mail related
directories that we need to verify are present before we start
trying to do things involving them, not just /var/mail
.)
(I mentioned this trick in passing here, but today I feel like writing it up explicitly.)
Sidebar: our alternate approach with user home directories
Since user home directories are NFS mounted, you might be wondering
if we also use flag files there to verify that the NFS mounts are
present before checking things like .forward
files. Because of
how our NFS mounts are organized, we use an alternate approach
instead. In short, our NFS mounts aren't directly for user home
directories; instead they're for filesystems with user home directories
in them.
(A user has a home directory like /h/281/cks
, where /h/281
is
the actual NFS mounted filesystem.)
In this situation it suffices to just check that the user's home
directory exists. If it does, the NFS filesystem it is in must be
mounted (well, unless someone has done something very perverse).
As a useful side bonus, this guards against various other errors
(eg, 'user home directory was listed wrong in /etc/passwd
').