2013-03-26
Reconsidering a ZFS root filesystem
A Twitter conversation from today:
@thatcks: Let's see if fsck can heal this Solaris machine or if I get to reinstall it from scratch. (Thanks to ILOMs I can do this from my desk.)
@bdha: fsck? Solaris? Sadface.
@thatcks: I have horror stories of corrupted zpool.cache files too. I don't know if you can boot a ZFS-root machine in that situation.
@bdha: I've been there. zpool.cache backups saved my ass.
Right now all of our Solaris fileservers have
(mirrored) UFS root filesystems instead of ZFS root filesystems and
in the past I've expressed some desire to see that continue in any
future ZFS fileservers we built. I've written about why before; the short version is that I've seen situations where
/etc/zfs/zpool.cache
had to be deleted and recreated, and I'm not sure
this is even possible if your root filesystem is a ZFS filesystem.
Using UFS for root filesystems avoids this chicken and egg problem.
(Of course the whole situation around zpool.cache
and ZFS pool
activation is a little bit mysterious, at least
in Solaris.)
Well, actually, that's not the only reason. The other reason is that I still think of ZFS as fragile, as something that will go from 'fine' to 'panics your system' under remarkably little provocation. UFS is much more old-fashioned and will soldier on even under relatively extreme circumstances (whether that's a wise idea is another question). Under most circumstances I would rather have our fileservers limping along than dead (even if the entire root filesystem becomes inaccessible, as an extreme example).
But all of this is basically supposition (and thus superstition). UFS certainly has its own problems (one of which I ran into today on our test server) and I've never actually tried out a modern Illumos-based system with ZFS root, both in normal operations and if I deliberately start breaking stuff (and I certainly hope that some of the problems I heard about years ago have been dealt with). It may well turn out that ZFS root based systems are easier to deal with and recover than I expect. They certainly have their own benefits (periodic scrubs are reassuring, for example).
(And to be honest, I think it's quite possible that Illumos will only really well support a ZFS root by the time we get to it. It's clear that ZFS root is where all of the enthusiasm is and where most people think we should be going.)
PS: root filesystem snapshot and snapshot rollback are not particularly
an advantage in our particular environment, since we basically don't
patch our fileservers. Of course periodic snapshots might save us in the
face of a corrupt zpool.cache
in the live filesystem.