Wandering Thoughts archives

2014-03-31

I'm done with building tools around 'zpool status' output

Back when our fileserver environment was young, I built a number of local tools and scripts that relied on 'zpool status' to get information about pools, pool states, and so on. The problem with using 'zpool status' is of course that it is not an API, it's something intended for presentation to users, and so as a result people feel free to change its output from time to time. At the time using zpool's output seemed like the best option despite this, or more exactly the best (or easiest) of a bad lot of options.

Well, I'm done with that.

We're in the process of migrating to OmniOS. As I've had to touch scripts and programs to update them for OmniOS's changes in the output of 'zpool status', I've instead been migrating them away from using zpool at all in favour of having them rely on a local ZFS status reporting tool. This migration isn't complete (some tools haven't needed changes yet and I'm letting them be), but it's already simplified my life in various ways.

One of those ways is that now we control the tools. We can guarantee stable output and we can make them output exactly what we want. We can even make them output the same thing on both our current Solaris machines and our new OmniOS machines so that higher level tooling is insulated from what OS version it's running on. This is very handy and not something that would be easy to do with 'zpool status'.

The other, more subtle way that this makes my life better is that I now have much more confidence that things are not going to subtly break on me. One problem with using zpool's output is that all sorts of things can change about it and things that use it may not notice, especially if the output starts omitting things to, for example, 'simplify' the default output. Since our tools are abusing private APIs they may well break (and may well break more than zpool's output), but when they break we can make sure that it's a loud break. The result is much more binary; if our tools work at all they're almost certainly accurate. A script's interpretation of zpool's output is not necessarily so.

(Omitting things by default is not theoretical. In between S10U8 and OmniOS, 'zfs list' went from including snapshots by default to excluding them by default. This broke some of our code that was parsing 'zfs list' output to identify snapshots, and in a subtle way; the code just thought there weren't any when there were. This is of course a completely fair change, since 'zfs list' is not an API and this probably makes things better for ordinary users.)

I accept that rolling our own tools has some additional costs and has some risks. But I'd rather own those costs and those risks explicitly rather than have similar ones arise implicitly because I'm relying on a necessarily imperfect understanding of zpool's output.

Actually, writing this entry has made me realized that it's only half of the story. The other half is going to take another entry.

ZFSNoMoreZpoolStatus written at 23:22:29; Add Comment

2014-03-19

Why I like ZFS's zfs send and zfs receive

Our new fileserver environment to be has reached the point where it needs real testing, so on Monday I took the big leap and moved my home directory filesystem to our first new fileserver in order to give it some real world usage. Actually that's a bit of a misnomer. I'd copied my home directory filesystem over to the new fileserver several weeks ago simply to put some real data into the system; what I did on Monday was re-synchronize the copy with the live version and then switch all of our actual Ubuntu servers to NFS mounting it from the new fileserver.

This is exhibit one of why I like zfs send and zfs receive, or more specifically why I like incremental zfs send and zfs receive. They make it both easy and efficient to copy, repeatedly update, and then synchronize filesystems like this during a move. You can sort of do the same thing with rsync at the user level but zfs send does it faster (sometimes drastically so), to some degree more reliably, and certainly more conveniently.

(As I've found the hard way, if there is enough of the wrong sort of activity in a filesystem an incremental rsync can take very close to the same amount of time as a non-incremental one. That was a painful experience. This doesn't happen with zfs send; you only ever pay for the amount of data actually changed.)

The other reason why I'm so fond of zfs send is this magic trick: incremental zfs send with snapshots is symmetric. Once you've synchronized two copies this way you can do it in either direction and you can reverse directions partway through. My home directory's move is almost certainly temporary (and it's certainly temporary if problems arise) and as long as I retain the filesystems and snapshots on both sides I can move back just as fast as I moved in the first place. I don't have to make a full copy and then synchronize it and so on; I can just make a new snapshot on the new fileserver and send back the difference between my last 'move over' snapshot and this first 'move back' snapshot. Speaking as someone who's currently basically jumping up and down on a new fileserver that should be good but hasn't been fully proven yet, that's a rather reassuring thing.

(In fact if I'm cautious I can update my old home directory every night or just periodically, so that I won't lose much even if the new environment goes down in total flames somehow.)

PS: Since I've now tested this in both directions, I can say that you can zfs send a filesystem from Solaris 10 update 8 to OmniOS and then zfs send snapshots from it from OmniOS back to S10U8 (provided that you don't 'zfs upgrade' the filesystem while it's on OmniOS; if you do, it'll become incompatible with S10U8's old ZFS version (or at least so the documentation says, I haven't tested this personally for the obvious reasons)).

Sidebar: how we freeze filesystems when moving them

The magic sequence we use is to unshare the filesystem (since all of our access to filesystems is over NFS), set readonly=on, and then make the final move snapshot. After the newly moved filesystem is up and running we set canmount=off on the old version and then let it sit until we have a backup of the moved one and everything is known to be good and so on.

We almost always do zfs receive into unmounted filesystems (ie 'zfs receive -u ...'.)

ZFSSendReceiveIsNice written at 00:12:24; Add Comment

2014-03-09

Solaris gives us a lesson in how not to write documentation

Here are links to the manpages for reboot in Solaris 8, Solaris 9, and Solaris 10 (or the more readable Illumos version, which is probably closer to the Solaris 11 version). They are all, well, manual-pagey, and thus most system administrators have well honed skills in how to read them. If you read any of these, it probably looks basically harmless. If you read them in succession you'll probably wind up feeling that they're all basically the same, although Solaris 10 has grown some new x86-related stuff.

This is an illusion and a terrible mistake, because at the very bottom of the Solaris 9, Solaris 10, and Illumos versions you will find the following new section (presented in its entirety):

NOTES

The reboot utility does not execute the scripts in /etc/rcnum.d or execute shutdown actions in inittab(4). To ensure a complete shutdown of system services, use shutdown(1M) or init(1M) to reboot a Solaris system.

Let me translate this for you: since Solaris 9, reboot shoots your system in the head instead of doing an orderly shutdown. Despite the wording earlier in the manpage that 'the reboot utility performs a sync(1M) operation on the disks, and then a multi-user reboot is initiated. See init(1M) for details', SMF (or the System V init system in Solaris 9) is not involved in things at all (and thus no multi-user reboot happens). Reboot instead simply SIGTERMs all processes. That stuff I quoted from the DESCRIPTION section is now a flat out lie.

This is a drastic change in reboot's behavior. It is at odds with reboot's behavior in Solaris 8 (as far as I know), the traditional System V init behavior, and reboot's behavior on other systems (including but not limited to Linux). Sun decided to bury this drastic behavior change in a casual little note at the bottom of the manpage, so far down that almost no one reads that far (partly because it is after all of the really boring boilerplate).

This is truly an epic example of how not to write documentation. Vital changes go at the start of your manpages, not the very end, and they and their effects should be very clearly described instead of hidden behind what is basically obfuscation.

(The right way to do it would have been a complete rewrite of the DESCRIPTION section and perhaps an update to the SYNOPSIS as well.)

By the way, this phrasing for the NOTES section is especially dangerous in Solaris 10 and onwards where SMF services normally handle shutdown actions, not /etc/rcnum.d scripts (or inittab actions). In anything using SMF it's possible to read this section but still not realize what reboot really does because it doesn't explicitly say that SMF is bypassed too.

Update: As pointed out in comments by Ade, this appears to be historical Solaris behavior (contrary to what I thought). However, it is not the (documented) behavior of other System V R4 systems such as SGI Irix and it is very likely to surprise people coming from other Unixes.

RebootDangerousManpage written at 22:19:47; Add Comment

2014-03-05

ZFS's problem with boot time magic

One of the problems with ZFS (on Solaris et al) is that in practice it involves quite a bit of magic. This magic is great when it works but is terrible when something goes wrong, because it leaves you with very little to work with to diagnose and fix your problems. Most of this magic revolves around the most problematic times in the life of ZFS, that being system shutdown and startup.

I've written before about boot time ZFS pool activation, so let's talk about how it would work in a non-magical environment. There are essentially two boot time jobs, activating pools and then possibly importing filesystems from the pools. Clearly these should be driven by distinct commands, one command to activate all non-active pools listed in /etc/zfs/zpool.cache (if possible) and then maybe one command to mount all unmounted ZFS filesystems. You don't really need the second command if pool activation also mounts filesystems the same way ZFS import does, but maybe you don't all of that happening during (early) boot and would rather defer both mounting and sharing until later.

ZFS on Solaris doesn't work this way. There is no pool activation command; pools just magically activate. And as I've found out, pools also apparently magically mount all of their filesystems during activation. While there is a 'zfs mount -a' command that is run during early boot (via /lib/svc/method/fs-local), it doesn't actually do what most people innocently think it does.

(What it seems to do in practice is mount additional ZFS filesystems from the root pool, if there is a root pool. Possibly it also mounts other ZFS filesystems that depend on additional root pool ZFS filesystems.)

I don't know where the magic for all of this lives. Perhaps it lives in the kernel. Perhaps it lives in some user level component that's run asynchronously on boot (much like how Linux's udev handles devices appearing). What I do know is that there is magic and this magic is currently causing me a major amount of heartburn.

Magic is a bad idea. Magic makes systems less manageable (and kernel magic is especially bad because it's completely inaccessible). Unix systems have historically got a significant amount of their power by more or less eschewing magic in favour of things like exposing the mechanics of the boot process. I find it sad to have ZFS be a regression on all of this.

(There is also regressions in the user level commands. For example, as far as I can see there is no good way to import a pool without also mounting and sharing its filesystems. These are actually three separate operations at the system level, but the code for 'zpool import' bundles them all together and provides no options to control this.)

ZFSBootMagicProblem written at 00:22:15; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.