2014-03-31
I'm done with building tools around 'zpool status' output
Back when our fileserver environment was young,
I built a number of local tools and scripts that relied on 'zpool
status' to get information about pools, pool states, and so on. The
problem with using 'zpool status' is of course that it is not an API,
it's something intended for presentation to users, and so as a result
people feel free to change its output from time to time. At the time
using zpool's output seemed like the best option despite this, or more
exactly the best (or easiest) of a bad lot of options.
Well, I'm done with that.
We're in the process of migrating to OmniOS. As I've had to touch
scripts and programs to update them for OmniOS's changes in the output
of 'zpool status', I've instead been migrating them away from using
zpool at all in favour of having them rely on a local ZFS status
reporting tool. This migration isn't complete
(some tools haven't needed changes yet and I'm letting them be), but
it's already simplified my life in various ways.
One of those ways is that now we control the tools. We can guarantee
stable output and we can make them output exactly what we want. We
can even make them output the same thing on both our current Solaris
machines and our new OmniOS machines so that higher level tooling is
insulated from what OS version it's running on. This is very handy and
not something that would be easy to do with 'zpool status'.
The other, more subtle way that this makes my life better is that I now
have much more confidence that things are not going to subtly break on
me. One problem with using zpool's output is that all sorts of things
can change about it and things that use it may not notice, especially
if the output starts omitting things to, for example, 'simplify' the
default output. Since our tools are abusing private APIs they may well
break (and may well break more than zpool's output), but when they
break we can make sure that it's a loud break. The result is much more
binary; if our tools work at all they're almost certainly accurate. A
script's interpretation of zpool's output is not necessarily so.
(Omitting things by default is not theoretical. In between S10U8 and
OmniOS, 'zfs list' went from including snapshots by default to
excluding them by default. This broke some of our code that was parsing
'zfs list' output to identify snapshots, and in a subtle way; the
code just thought there weren't any when there were. This is of course
a completely fair change, since 'zfs list' is not an API and this
probably makes things better for ordinary users.)
I accept that rolling our own tools has some additional costs and has
some risks. But I'd rather own those costs and those risks explicitly
rather than have similar ones arise implicitly because I'm relying on a
necessarily imperfect understanding of zpool's output.
Actually, writing this entry has made me realized that it's only half of the story. The other half is going to take another entry.
2014-03-19
Why I like ZFS's zfs send and zfs receive
Our new fileserver environment to be has reached the point where it needs real testing, so on Monday I took the big leap and moved my home directory filesystem to our first new fileserver in order to give it some real world usage. Actually that's a bit of a misnomer. I'd copied my home directory filesystem over to the new fileserver several weeks ago simply to put some real data into the system; what I did on Monday was re-synchronize the copy with the live version and then switch all of our actual Ubuntu servers to NFS mounting it from the new fileserver.
This is exhibit one of why I like zfs send and zfs receive, or more
specifically why I like incremental zfs send and zfs receive. They
make it both easy and efficient to copy, repeatedly update, and then
synchronize filesystems like this during a move. You can sort of do the
same thing with rsync at the user level but zfs send does it faster
(sometimes drastically so), to some degree more reliably, and certainly
more conveniently.
(As I've found the hard way, if there is enough of the wrong sort of
activity in a filesystem an incremental rsync can take very close to
the same amount of time as a non-incremental one. That was a painful
experience. This doesn't happen with zfs send; you only ever pay for
the amount of data actually changed.)
The other reason why I'm so fond of zfs send is this magic trick:
incremental zfs send with snapshots is symmetric. Once you've
synchronized two copies this way you can do it in either direction and
you can reverse directions partway through. My home directory's move
is almost certainly temporary (and it's certainly temporary if problems
arise) and as long as I retain the filesystems and snapshots on both
sides I can move back just as fast as I moved in the first place. I
don't have to make a full copy and then synchronize it and so on; I
can just make a new snapshot on the new fileserver and send back the
difference between my last 'move over' snapshot and this first 'move
back' snapshot. Speaking as someone who's currently basically jumping up
and down on a new fileserver that should be good but hasn't been fully
proven yet, that's a rather reassuring thing.
(In fact if I'm cautious I can update my old home directory every night or just periodically, so that I won't lose much even if the new environment goes down in total flames somehow.)
PS: Since I've now tested this in both directions, I can say that
you can zfs send a filesystem from Solaris 10 update 8 to OmniOS
and then zfs send snapshots from it from OmniOS back to S10U8
(provided that you don't 'zfs upgrade' the filesystem while it's
on OmniOS; if you do, it'll become incompatible with S10U8's old
ZFS version (or at least so the documentation says, I haven't tested
this personally for the obvious reasons)).
Sidebar: how we freeze filesystems when moving them
The magic sequence we use is to unshare the filesystem (since all
of our access to filesystems is over NFS), set readonly=on, and
then make the final move snapshot. After the newly moved filesystem
is up and running we set canmount=off on the old version and then
let it sit until we have a backup of the moved one and everything
is known to be good and so on.
We almost always do zfs receive into unmounted filesystems (ie
'zfs receive -u ...'.)
2014-03-09
Solaris gives us a lesson in how not to write documentation
Here are links to the manpages for reboot in Solaris 8,
Solaris 9, and
Solaris 10 (or
the more readable Illumos version,
which is probably closer to the Solaris 11 version). They are all,
well, manual-pagey, and thus most system administrators have well
honed skills in how to read them. If you read any of these, it
probably looks basically harmless. If you read them in succession
you'll probably wind up feeling that they're all basically the same,
although Solaris 10 has grown some new x86-related stuff.
This is an illusion and a terrible mistake, because at the very bottom of the Solaris 9, Solaris 10, and Illumos versions you will find the following new section (presented in its entirety):
NOTES
The
rebootutility does not execute the scripts in/etc/rcnum.dor execute shutdown actions in inittab(4). To ensure a complete shutdown of system services, use shutdown(1M) or init(1M) to reboot a Solaris system.
Let me translate this for you: since Solaris 9, reboot
shoots your system in the head instead of doing an orderly shutdown.
Despite the wording earlier in the manpage that 'the reboot utility
performs a sync(1M) operation on the disks, and then a multi-user
reboot is initiated. See init(1M) for details', SMF (or the System V
init system in Solaris 9) is not involved in things at all (and thus
no multi-user reboot happens). Reboot instead simply SIGTERMs all
processes. That stuff I quoted from the DESCRIPTION section is now a
flat out lie.
This is a drastic change in reboot's behavior. It is at odds with
the traditional System V init behavior,
and reboot's behavior in Solaris 8 (as far as I know),reboot's behavior on other systems (including but not limited
to Linux). Sun decided to bury this drastic behavior change in a
casual little note at the bottom of the manpage, so far down that
almost no one reads that far (partly because it is after all of the
really boring boilerplate).
This is truly an epic example of how not to write documentation. Vital changes go at the start of your manpages, not the very end, and they and their effects should be very clearly described instead of hidden behind what is basically obfuscation.
(The right way to do it would have been a complete rewrite of the DESCRIPTION section and perhaps an update to the SYNOPSIS as well.)
By the way, this phrasing for the NOTES section is especially
dangerous in Solaris 10 and onwards where SMF services normally
handle shutdown actions, not /etc/rcnum.d scripts (or inittab
actions). In anything using SMF it's possible to read this section
but still not realize what reboot really does because it doesn't
explicitly say that SMF is bypassed too.
Update: As pointed out in comments by Ade, this appears to be historical Solaris behavior (contrary to what I thought). However, it is not the (documented) behavior of other System V R4 systems such as SGI Irix and it is very likely to surprise people coming from other Unixes.
2014-03-05
ZFS's problem with boot time magic
One of the problems with ZFS (on Solaris et al) is that in practice it involves quite a bit of magic. This magic is great when it works but is terrible when something goes wrong, because it leaves you with very little to work with to diagnose and fix your problems. Most of this magic revolves around the most problematic times in the life of ZFS, that being system shutdown and startup.
I've written before about boot time ZFS pool activation, so let's talk about how it would work in a
non-magical environment. There are essentially two boot time jobs,
activating pools and then possibly importing filesystems from the
pools. Clearly these should be driven by distinct commands, one
command to activate all non-active pools listed in /etc/zfs/zpool.cache
(if possible) and then maybe one command to mount all unmounted ZFS
filesystems. You don't really need the second command if pool
activation also mounts filesystems the same way ZFS import does,
but maybe you don't all of that happening during (early) boot and
would rather defer both mounting and sharing until later.
ZFS on Solaris doesn't work this way. There is no pool activation
command; pools just magically activate. And as I've found out,
pools also apparently magically mount all of their filesystems
during activation. While there is a 'zfs mount -a' command that
is run during early boot (via /lib/svc/method/fs-local), it doesn't
actually do what most people innocently think it does.
(What it seems to do in practice is mount additional ZFS filesystems from the root pool, if there is a root pool. Possibly it also mounts other ZFS filesystems that depend on additional root pool ZFS filesystems.)
I don't know where the magic for all of this lives. Perhaps it lives in the kernel. Perhaps it lives in some user level component that's run asynchronously on boot (much like how Linux's udev handles devices appearing). What I do know is that there is magic and this magic is currently causing me a major amount of heartburn.
Magic is a bad idea. Magic makes systems less manageable (and kernel magic is especially bad because it's completely inaccessible). Unix systems have historically got a significant amount of their power by more or less eschewing magic in favour of things like exposing the mechanics of the boot process. I find it sad to have ZFS be a regression on all of this.
(There is also regressions in the user level commands. For example, as
far as I can see there is no good way to import a pool without also
mounting and sharing its filesystems. These are actually three separate
operations at the system level, but the code for 'zpool import'
bundles them all together and provides no options to control this.)