Wandering Thoughts archives

2014-06-27

A retrospective on our Solaris ZFS-based NFS fileservers (part 2)

In yesterday's entry I talked about the parts of our Solaris ZFS fileserver environment that worked nicely over the six years we've run them. Today is for the other side, the things about Solaris that didn't go so well. You may have noticed that yesterday I was careful to talk specifically about the basics of ZFS working well. That is because pretty much all of the extra frills we tried failed or outright blew up in our faces.

The largest single thing that didn't work out anywhere near as we planned and wanted is failover. There are contributing factors beyond ZFS (see this for a full overview) but what basically killed even careful manual failover is the problem of very slow zpool imports. The saving grace of the situation is that we've only really needed failover a relatively small number of times because the fileservers have been generally quite reliable. The downside of losing failover is that the other name for failover is 'easy and rapid migration of NFS service' and there have been any number of situations where we could have used that. For example, we recently rebooted all of the fileservers because they'd been up over 650 days and we had some signs they might have latent problems. With fast, good 'failover' we could have done this effectively live without much user-visible impact (shift all NFS fileservice away from a particular machine, reboot it, shift its NFS fileservice back, repeat). Without that failover? A formal downtime.

The largest advertised ZFS feature that just didn't work was ZFS's support for spare devices. We wound up feeling that this was completely useless and built our own spares system (part 2, part 3). We also had problems with, for example, zpool status hanging in problem situations or just not being honest with us about the truth of the situation.

It turned out to be a significant issue in practice that ZFS has no API, ie no way for outside systems to reliably extract state information from it (a situation that continues to this day). Because we needed this information we were forced to develop ad-hoc and non-portable tools to extract by force from Solaris and this in turn caused further problems. One significant reason we never upgraded past Solaris 10 update 8, despite the existence of fixes we were interested in, was that upgrading would have required updating and re-validating all of these tools.

(These tools are also a large part of why we wouldn't take Solaris 11 even if Oracle offered it to us for free. We need these tools and these tools require source code access so we can reverse engineer this information.)

Overall our Solaris experiences has left me feeling that we were quite far from the (ZFS) usage cases that the Solaris developers expected. A lot of things didn't seem prepared to cope with, for example, how many 'disks' we have. Nothing actually broke significantly (at least once we stopped applying Solaris patches) but the entire environment felt fragile, like a too-tall building swaying as the wind builds up. We also became increasingly dubious about the quality of implementation of the changes that Sun (and then Oracle) was making to Solaris, adding another reason to stop applying patches and to never upgrade past Solaris 10U8.

(Allow me to translate that: Solaris OS developers routinely wrote and released patches and changes with terrible code that broke things for us and didn't work as officially documented. The Sun and Oracle reaction to this was a giant silent shrug.)

While we got away with our 'no patches, no updates, no changes' policy I'm aware that we were lucky; we simply never hit any of the known S10U8 bugs. I didn't (and don't) like running systems that I feel I can't update because things are sure to break and we definitely wound up doing that with our Solaris machines. I count that as something that did not go well.

In general, over time I've become increasingly uncomfortable about our default 'no updates on black box appliance style machines' policy, which we've followed on both the Solaris fileservers and the iSCSI backends. I kind of count it as an implicit failure in our current fileserver environment. For the next generation of fileservers and backends I'd really like to figure out a way to apply as many updates as possible in a safe way (I have some ideas but I'll save them for another entry).

None of these things that didn't work so well have been fatal or even painful in day to day usage. Some of them, such as the ZFS spares situation, have forced us to do things that improved the overall environment; having our own spares system has turned out to be a big win because it can be more intelligent and more aggressive than any general ZFS solution could be.

ZFSFileserverRetrospective02 written at 02:07:41; Add Comment

2014-06-25

A retrospective on our Solaris ZFS-based NFS fileservers (part 1)

We're in the slow process of replacing our original Solaris ZFS fileserver environment with a second generation environment. With our current fileservers enter their sunset period it's a good time to take an uncommon retrospective look back over their six years of operation and talk about what went well and what didn't quite do so. Today I'm going to lead with the good stuff about our Solaris machines.

(I'm actually a bit surprised that it's been six years, but that's what the dates say. I wrote the fileservers up in October of 2008 and they'd already been in operation for several months at that point.)

The headline result is that our fileserver environment has worked great overall. We've had six years of service with very little disruption and no data loss. We've had many disks die, we've had entire iSCSI backends fail, and through it all ZFS and everything else has kept trucking along. This is actually well above my expectations six years ago, when I had a very low view of ZFS's long-term reliability and expected to someday lose a pool to ZFS corruption over the lifetime of our fileservers.

The basics of ZFS have been great and using ZFS has been a significant advantage for us. From my perspective, the two big wins with ZFS have been flexible space management for actual filesystems and ZFS checksums and scrubs, which have saved us in ways large and small. Flexible space management has sometimes been hard to explain to people in a way that they really get, but it's been very nice to simply be able to make filesystems for logical reasons and not have to ask people to pre-plan how much space they get; they can use as little or more or less as much as they need.

Solaris in general and Solaris NFS in particular has been solid in normal use and we haven't seen any performance issues. We used to have some mysterious NFS mount permission issues (where a filesystem wouldn't mount or work on some systems) but they haven't cropped up on our systems for a few years from what I remember. Our Solaris 10 update 8 installs may not be the most featureful or up to date systems but in general they've given us no problems; they just sit in their racks and run and run and run (much like the iSCSI backends). I think it says good things that they reached over 650 days of uptime recently before we decided to reboot them as a sort of precaution after one crashed mysteriously.

Okay, I'll admit it: Solaris has not been completely and utterly rock solid for us. We've had one fileserver that just doesn't seem to like life, for reasons that we're not sure about; it is far more sensitive to disk errors and it's locked up several times over the years. Since we've replaced the hardware and reinstalled the software, my vague theory is that it's something to do with either or both of the NFS load it gets or the disks it's dealing with (it has most of our flaky 1TB Seagate disks, which fail at rates far higher than the other drives).

One Solaris feature deserves special mention. DTrace (and with it Solaris source code) turned out to be a serious advantage and very close to essential for solving an important performance problem we had. We might have eventually found our issue without DTrace but I'm pretty sure DTrace made it faster, and DTrace has also given us useful monitoring tools in general. I've come around to considering DTrace an important feature and I'm glad I get to keep it in our second generation environment (which will be using OmniOS on the fileservers).

I guess the overall summary is that for six years, our Solaris ZFS-based NFS fileservers have been boring almost all of the time; they work and they don't cause problems, even when crazy things happen. This has been especially true for the last several years, ie after we shook out the initial problems and got used to what to do and not to do.

(We probably could have made our lives more exciting for a while by upgrading past Solaris 10 update 8 but we never saw any reason to do that. After all, the machines worked fine with S10U8.)

That isn't to say that Solaris has been completely without problems and that everything has worked out for us as we planned. But that's for another entry (this one is already long enough).

Update: in the initial version of this entry I completely forgot to mention that the Solaris iSCSI initiator (the client) has been problem free for us (and it's obviously a vital part of the fileserver environment). There are weird corner cases but those happen anywhere and everywhere.

ZFSFileserverRetrospective01 written at 22:17:29; Add Comment

2014-05-21

How I wish ZFS pool importing could work

I've mentioned before that one of our problems is that explicit 'zpool import' commands are very slow in our environment, so slow that we don't try to do failover although we're theoretically set up to do it. At least back in the Solaris era and I assume still in the OmniOS one, this came about because of two reasons. First, when you run 'zpool import' (for basically any reason) it checks every disk you have, one at a time, to build up a mapping of what ZFS labels are where and so on. Back when I timed it, this seemed to take roughly a third of a second per visible 'disk' (a real disk or an iSCSI LUN). Second, when your zpool import command finishes it promptly throws away all of that slowly and expensively gathered information so the next 'zpool import' command you run has to do it all over again. Both of these together combine unpleasantly in typical failover scenarios. You might do one 'zpool import' to confirm that all the pools you want to import are fully visible and then 'zpool import' five or six pools (one at a time, because you can't import multiple pools at once with a normal 'zpool import' command). The resulting time consumption adds up fast.

What I would like is for a way to have ZFS pool imports fix both problems. Sequential disk probing is an easy fix; just don't do that. Scanning some number of disks in parallel ought to significantly speed things up and even modest levels of parallelism offer potentially big wins (eg, doing two disks in parallel could theoretically halve the time necessary).

There are two potential fixes for the problem of 'zpool import' throwing away all of that work. The simpler is to make it possible to import multiple pools in a single 'zpool import' operation. There's no fundamental obstacle in the code for this, it's just a small matter of creating a command line syntax for it and then basically writing a loop over the import operation (right now giving two pool names renames a pool on import and giving more than two is a syntax error). The bigger fix is to provide an option for zpool import to not throw away the work, letting it write out the accumulated information to a cache file and then reload it under suitable conditions (both should require a new command line switch). If the import process finds that the on-disk reality doesn't match the cache file's data, it falls back to doing the current full scan (checking disks in parallel, please).

At this point some people will be tempted to suggest ZFS cache files. Unfortunately these are not a solution for at least two reasons. First, you can't use ZFS cache files to accelerate a scan for what pools are available for import; a plain 'zpool import' doesn't take a '-c cachefile' argument. Second, there's no way to build or rebuild ZFS cache files without actually importing pools. This makes managing them very painful in practice, for example you can't have a single ZFS cache file with a global view of all pools available on your shared storage unless you import them all on one system and then save the resulting cache file.

(Scanning for visible pools matters in failover on shared storage because you really want to make sure that the machine you're failing over to can see all of the shared storage that it should. In fact I'd like a ZFS pool import option for 'do not import pools unless all of their devices are visible'; we'd certainly use it by default because in most situations in our environment we'd rather a pool not import at all than import with mirrors broken because eg an iSCSI target was accidentally not configured on one server.)

ZFSPoolImportWish written at 01:32:27; Add Comment

2014-05-02

An important addition to how ZFS deduplication works on the disk

My entry on how ZFS deduplication works on the disk turns out to have missed one important aspect of how deduplication affects the on-disk ZFS data. Armed with this information we can finally answer some long-standing uncertainties about ZFS deduplication.

As I mentioned in passing earlier, ZFS uses block pointers to describe where the actual data for blocks are. Block pointers have the data virtual addresses of up to three copies of the block's data, the block's checksum, and a number of other bits and pieces. Crucially, block pointers are specially marked if they were written with deduplication on. It is the deduplication flag in any particular block pointer that controls what happens when the block pointer is deleted. If the flag is on, the delete does a DDT lookup so that the reference counts can be maintained; if the flag is off, there's no DDT lookup needed.

(When the reference count of a DDT entry goes to zero, the DDT entry itself gets deleted. A ZFS pool always has DDT tables, even if they're empty.)

As mentioned in the first entry, deduplication has basically no effects on reads because reads of a dedup'd BP don't normally involve the DDT since the BP contains the DVAs of some copies of the block and ZFS will just read directly from these. However if there is a read error on a dedup'd BP, ZFS does a DDT lookup to see if there's another copy of the block available (for example in the 'ditto' copies).

(I'm waving my hands about deduplication's potential effects on how fragmented a file's data gets on the disk.)

Only file data is deduplicated. ZFS metadata like directories is not subject to deduplication and so block pointers for metadata blocks will never be dedup'd BPs. This is pretty much what you'd expect but I feel like mentioning it explicitly since I just checked this in the code.

So turning ZFS deduplication on does not irreversibly taint anything as far as I can see. Any data written while deduplication is on will be marked as a dedup'd BP and then when it's deleted you'll hit the DDT, but after deduplication is turned off and all of that data is deleted the DDT should be empty again. And if you never delete any of the data the only effect is that the DDT will sit there taking up some extra space. But you will take the potential deduplication hit when you delete data written while deduplication is on, even if you later turn it off, and this includes deleting snapshots.

Sidebar: Deduplication and ZFS scrubs

As you'd expect, ZFS scrubs and resilvers do check and correct DDT entries, and they check all DVAs that DDT entries point to (even ditto blocks, which are not directly referred to by any normal data BPs). The scanning code tries to do DDT and file data checks efficiently, basically checking DDT entries and the DVAs they point to once no matter how many references they have. The exact mechanisms are a little bit complicated.

(My paranoid instincts see corner cases with this code, but I'm probably wrong. And if they happened they would probably be the result of ZFS code bugs, not disk IO errors.)

ZFSDedupStorageII written at 01:49:29; Add Comment

2014-04-26

What I can see about how ZFS deduplication seems to work on disk

There is a fair amount of high level information about how ZFS deduplication works. There is much less that I could find about some low level details of how deduplicated blocks exist on disk and some implications of how the on-disk data structures are stored. Since I was just looking this up in the current Illumos source code, I want to jot down some notes before it all falls out of my head again.

The core dedup data structure is the DDT, which holds the core information for each deduplicated block: the block's checksum, the number of references to the block, and the on-disk addresses of some number of copies of the block. Don't ask me exactly how many copies of the block there can be out there in the world; my head gets confused trying to follow the code. The DDT is stored as part of the overall pool metadata (via the ZAP) and as such the DDT is copy on write, just like pretty much everything else in ZFS. This makes total sense and is what you need.

Note that the DDT is global to the pool; it is not tied to any particular filesystem. As a pool level object it is not captured in filesystem snapshots any more than, say, information about which disk blocks are free is.

ZFS records where blocks are on disk through the use of 'block pointers' (which also include things like the block's checksum). An interesting question is whether the block pointer for a dedup'd block refers to the DDT in any way. The answer is that it doesn't; it points directly to the on-disk addresses of up to three copies of the block. So files with deduplicated blocks are read without referring to the DDT, at least if all goes well.

If configured to do so, ZFS can store more than one copy of a sufficiently highly referenced data block. As more and more references add up, ZFS will sooner or later create a second and perhaps a third and fourth copy. I believe that these additional copies will be used to recover from failed reads of the original copy of the data block even for things that were written before they existed, although these things don't directly contain references to the on-disk addresses of these additional 'ditto' copies. If I'm correct, this implies that failed reads may cause DDT access in order to see if such ditto blocks exist. Of course you probably don't care about any extra overhead from this if it saves your data.

In general, the more I look at this code the less confident I am that I have any understanding of the effects and consequences of turning off ZFS deduplication on a filesystem after you've turned it on. I suppose this just echoes what I said back in ZFSDedupBadDocumentation.

(There are other things about ZFS dedup that I don't understand after reading the code, but I'm going to save them for an appropriate ZFS mailing list.)

ZFSDedupStorage written at 03:33:19; Add Comment

2014-04-02

I'm angry that ZFS still doesn't have an API

Yesterday I wrote a calm rational explanation for why I'm not building tools around 'zpool status' any more and said that it ended up being only half of the story. The other half is that I am genuinely angry that ZFS still does not have any semblance of an API, so angry that I've decided to stop cooperating with ZFS's non-API and make my own.

(It's not the hot anger of swearing, it's the slow anger of a blister that keeps reminding you about its existence with every step you take.)

For at least the past six years it has been blindingly obvious that ZFS should have an API so that people could build additional tools and solutions on top of it. For all that is sane, stock ZFS doesn't even have an alerting solution for pool problems. You can't miss that unless you're blind and say whatever you want about the ZFS developers, I'm sure that they're not blind. I am and have been completely agnostic about the exact format that this API could have taken, so long as it existed. Stable, documented, script-friendly output from ZFS tools? A documented C level library API? XML information dumps because everyone loves XML? A web API? Whatever. I could have worked with any of them.

Instead we got nothing. We got nothing when ZFS was with Sun and despite some vague signs of care we continue to get exactly nothing now that ZFS is effectively with Illumos (and I'm pretty sure that Oracle hasn't fixed the situation either). At this point it is clear that the ZFS developers have different priorities and in an objective sense do not care about this issue.

(Regardless of what you say, what you actually care about is shown by what you work on.)

This situation has thoroughly gotten under my skin now that moving to OmniOS is rubbing my nose in it again. So now I'm through with tacitly cooperating with it by trying to wrestle and wrangle the ZFS commands to do what I want. Instead I feel like giving 'zpool status' and its friends a great big middle finger and then throwing them down a well. The only thing I want to use them for now is as a relatively authoritative source of truth if I suspect that something is wrong with what my own tools are showing me.

(I call zpool status et al 'relatively authoritative' because it and other similar commands leave things out and otherwise mangle what you are seeing, sometimes in ways that cause real problems.)

I will skip theories about why the ZFS developers did not develop an API (either in Sun or later), partly because I am in a bad mood after writing this and so am inclined to be extremely cynical.

ZFSNoAPIAnger written at 00:12:03; Add Comment

2014-03-31

I'm done with building tools around 'zpool status' output

Back when our fileserver environment was young, I built a number of local tools and scripts that relied on 'zpool status' to get information about pools, pool states, and so on. The problem with using 'zpool status' is of course that it is not an API, it's something intended for presentation to users, and so as a result people feel free to change its output from time to time. At the time using zpool's output seemed like the best option despite this, or more exactly the best (or easiest) of a bad lot of options.

Well, I'm done with that.

We're in the process of migrating to OmniOS. As I've had to touch scripts and programs to update them for OmniOS's changes in the output of 'zpool status', I've instead been migrating them away from using zpool at all in favour of having them rely on a local ZFS status reporting tool. This migration isn't complete (some tools haven't needed changes yet and I'm letting them be), but it's already simplified my life in various ways.

One of those ways is that now we control the tools. We can guarantee stable output and we can make them output exactly what we want. We can even make them output the same thing on both our current Solaris machines and our new OmniOS machines so that higher level tooling is insulated from what OS version it's running on. This is very handy and not something that would be easy to do with 'zpool status'.

The other, more subtle way that this makes my life better is that I now have much more confidence that things are not going to subtly break on me. One problem with using zpool's output is that all sorts of things can change about it and things that use it may not notice, especially if the output starts omitting things to, for example, 'simplify' the default output. Since our tools are abusing private APIs they may well break (and may well break more than zpool's output), but when they break we can make sure that it's a loud break. The result is much more binary; if our tools work at all they're almost certainly accurate. A script's interpretation of zpool's output is not necessarily so.

(Omitting things by default is not theoretical. In between S10U8 and OmniOS, 'zfs list' went from including snapshots by default to excluding them by default. This broke some of our code that was parsing 'zfs list' output to identify snapshots, and in a subtle way; the code just thought there weren't any when there were. This is of course a completely fair change, since 'zfs list' is not an API and this probably makes things better for ordinary users.)

I accept that rolling our own tools has some additional costs and has some risks. But I'd rather own those costs and those risks explicitly rather than have similar ones arise implicitly because I'm relying on a necessarily imperfect understanding of zpool's output.

Actually, writing this entry has made me realized that it's only half of the story. The other half is going to take another entry.

ZFSNoMoreZpoolStatus written at 23:22:29; Add Comment

2014-03-19

Why I like ZFS's zfs send and zfs receive

Our new fileserver environment to be has reached the point where it needs real testing, so on Monday I took the big leap and moved my home directory filesystem to our first new fileserver in order to give it some real world usage. Actually that's a bit of a misnomer. I'd copied my home directory filesystem over to the new fileserver several weeks ago simply to put some real data into the system; what I did on Monday was re-synchronize the copy with the live version and then switch all of our actual Ubuntu servers to NFS mounting it from the new fileserver.

This is exhibit one of why I like zfs send and zfs receive, or more specifically why I like incremental zfs send and zfs receive. They make it both easy and efficient to copy, repeatedly update, and then synchronize filesystems like this during a move. You can sort of do the same thing with rsync at the user level but zfs send does it faster (sometimes drastically so), to some degree more reliably, and certainly more conveniently.

(As I've found the hard way, if there is enough of the wrong sort of activity in a filesystem an incremental rsync can take very close to the same amount of time as a non-incremental one. That was a painful experience. This doesn't happen with zfs send; you only ever pay for the amount of data actually changed.)

The other reason why I'm so fond of zfs send is this magic trick: incremental zfs send with snapshots is symmetric. Once you've synchronized two copies this way you can do it in either direction and you can reverse directions partway through. My home directory's move is almost certainly temporary (and it's certainly temporary if problems arise) and as long as I retain the filesystems and snapshots on both sides I can move back just as fast as I moved in the first place. I don't have to make a full copy and then synchronize it and so on; I can just make a new snapshot on the new fileserver and send back the difference between my last 'move over' snapshot and this first 'move back' snapshot. Speaking as someone who's currently basically jumping up and down on a new fileserver that should be good but hasn't been fully proven yet, that's a rather reassuring thing.

(In fact if I'm cautious I can update my old home directory every night or just periodically, so that I won't lose much even if the new environment goes down in total flames somehow.)

PS: Since I've now tested this in both directions, I can say that you can zfs send a filesystem from Solaris 10 update 8 to OmniOS and then zfs send snapshots from it from OmniOS back to S10U8 (provided that you don't 'zfs upgrade' the filesystem while it's on OmniOS; if you do, it'll become incompatible with S10U8's old ZFS version (or at least so the documentation says, I haven't tested this personally for the obvious reasons)).

Sidebar: how we freeze filesystems when moving them

The magic sequence we use is to unshare the filesystem (since all of our access to filesystems is over NFS), set readonly=on, and then make the final move snapshot. After the newly moved filesystem is up and running we set canmount=off on the old version and then let it sit until we have a backup of the moved one and everything is known to be good and so on.

We almost always do zfs receive into unmounted filesystems (ie 'zfs receive -u ...'.)

ZFSSendReceiveIsNice written at 00:12:24; Add Comment

2014-03-09

Solaris gives us a lesson in how not to write documentation

Here are links to the manpages for reboot in Solaris 8, Solaris 9, and Solaris 10 (or the more readable Illumos version, which is probably closer to the Solaris 11 version). They are all, well, manual-pagey, and thus most system administrators have well honed skills in how to read them. If you read any of these, it probably looks basically harmless. If you read them in succession you'll probably wind up feeling that they're all basically the same, although Solaris 10 has grown some new x86-related stuff.

This is an illusion and a terrible mistake, because at the very bottom of the Solaris 9, Solaris 10, and Illumos versions you will find the following new section (presented in its entirety):

NOTES

The reboot utility does not execute the scripts in /etc/rcnum.d or execute shutdown actions in inittab(4). To ensure a complete shutdown of system services, use shutdown(1M) or init(1M) to reboot a Solaris system.

Let me translate this for you: since Solaris 9, reboot shoots your system in the head instead of doing an orderly shutdown. Despite the wording earlier in the manpage that 'the reboot utility performs a sync(1M) operation on the disks, and then a multi-user reboot is initiated. See init(1M) for details', SMF (or the System V init system in Solaris 9) is not involved in things at all (and thus no multi-user reboot happens). Reboot instead simply SIGTERMs all processes. That stuff I quoted from the DESCRIPTION section is now a flat out lie.

This is a drastic change in reboot's behavior. It is at odds with reboot's behavior in Solaris 8 (as far as I know), the traditional System V init behavior, and reboot's behavior on other systems (including but not limited to Linux). Sun decided to bury this drastic behavior change in a casual little note at the bottom of the manpage, so far down that almost no one reads that far (partly because it is after all of the really boring boilerplate).

This is truly an epic example of how not to write documentation. Vital changes go at the start of your manpages, not the very end, and they and their effects should be very clearly described instead of hidden behind what is basically obfuscation.

(The right way to do it would have been a complete rewrite of the DESCRIPTION section and perhaps an update to the SYNOPSIS as well.)

By the way, this phrasing for the NOTES section is especially dangerous in Solaris 10 and onwards where SMF services normally handle shutdown actions, not /etc/rcnum.d scripts (or inittab actions). In anything using SMF it's possible to read this section but still not realize what reboot really does because it doesn't explicitly say that SMF is bypassed too.

Update: As pointed out in comments by Ade, this appears to be historical Solaris behavior (contrary to what I thought). However, it is not the (documented) behavior of other System V R4 systems such as SGI Irix and it is very likely to surprise people coming from other Unixes.

RebootDangerousManpage written at 22:19:47; Add Comment

2014-03-05

ZFS's problem with boot time magic

One of the problems with ZFS (on Solaris et al) is that in practice it involves quite a bit of magic. This magic is great when it works but is terrible when something goes wrong, because it leaves you with very little to work with to diagnose and fix your problems. Most of this magic revolves around the most problematic times in the life of ZFS, that being system shutdown and startup.

I've written before about boot time ZFS pool activation, so let's talk about how it would work in a non-magical environment. There are essentially two boot time jobs, activating pools and then possibly importing filesystems from the pools. Clearly these should be driven by distinct commands, one command to activate all non-active pools listed in /etc/zfs/zpool.cache (if possible) and then maybe one command to mount all unmounted ZFS filesystems. You don't really need the second command if pool activation also mounts filesystems the same way ZFS import does, but maybe you don't all of that happening during (early) boot and would rather defer both mounting and sharing until later.

ZFS on Solaris doesn't work this way. There is no pool activation command; pools just magically activate. And as I've found out, pools also apparently magically mount all of their filesystems during activation. While there is a 'zfs mount -a' command that is run during early boot (via /lib/svc/method/fs-local), it doesn't actually do what most people innocently think it does.

(What it seems to do in practice is mount additional ZFS filesystems from the root pool, if there is a root pool. Possibly it also mounts other ZFS filesystems that depend on additional root pool ZFS filesystems.)

I don't know where the magic for all of this lives. Perhaps it lives in the kernel. Perhaps it lives in some user level component that's run asynchronously on boot (much like how Linux's udev handles devices appearing). What I do know is that there is magic and this magic is currently causing me a major amount of heartburn.

Magic is a bad idea. Magic makes systems less manageable (and kernel magic is especially bad because it's completely inaccessible). Unix systems have historically got a significant amount of their power by more or less eschewing magic in favour of things like exposing the mechanics of the boot process. I find it sad to have ZFS be a regression on all of this.

(There is also regressions in the user level commands. For example, as far as I can see there is no good way to import a pool without also mounting and sharing its filesystems. These are actually three separate operations at the system level, but the code for 'zpool import' bundles them all together and provides no options to control this.)

ZFSBootMagicProblem written at 00:22:15; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.