A retrospective on our Solaris ZFS-based NFS fileservers (part 2)
In yesterday's entry I talked about the parts of our Solaris ZFS fileserver environment that worked nicely over the six years we've run them. Today is for the other side, the things about Solaris that didn't go so well. You may have noticed that yesterday I was careful to talk specifically about the basics of ZFS working well. That is because pretty much all of the extra frills we tried failed or outright blew up in our faces.
The largest single thing that didn't work out anywhere near as we
planned and wanted is failover. There are contributing factors
beyond ZFS (see this for a
full overview) but what basically killed even careful manual failover
is the problem of very slow
The saving grace of the situation is that we've only really needed
failover a relatively small number of times because the fileservers
have been generally quite reliable. The downside of losing failover
is that the other name for failover is 'easy and rapid migration
of NFS service' and there have been any number of situations where
we could have used that. For example, we recently rebooted all of
the fileservers because they'd been up over 650 days and we had
some signs they might have latent problems. With fast, good 'failover'
we could have done this effectively live without much user-visible
impact (shift all NFS fileservice away from a particular machine,
reboot it, shift its NFS fileservice back, repeat). Without that
failover? A formal downtime.
The largest advertised ZFS feature that just didn't work was ZFS's
support for spare devices. We wound up feeling
that this was completely useless and built our own spares system (part 2, part 3). We also had problems with, for example,
zpool status hanging in problem situations
or just not being honest with us about the truth of the situation.
It turned out to be a significant issue in practice that ZFS has no API, ie no way for outside systems to reliably extract state information from it (a situation that continues to this day). Because we needed this information we were forced to develop ad-hoc and non-portable tools to extract by force from Solaris and this in turn caused further problems. One significant reason we never upgraded past Solaris 10 update 8, despite the existence of fixes we were interested in, was that upgrading would have required updating and re-validating all of these tools.
(These tools are also a large part of why we wouldn't take Solaris 11 even if Oracle offered it to us for free. We need these tools and these tools require source code access so we can reverse engineer this information.)
Overall our Solaris experiences has left me feeling that we were quite far from the (ZFS) usage cases that the Solaris developers expected. A lot of things didn't seem prepared to cope with, for example, how many 'disks' we have. Nothing actually broke significantly (at least once we stopped applying Solaris patches) but the entire environment felt fragile, like a too-tall building swaying as the wind builds up. We also became increasingly dubious about the quality of implementation of the changes that Sun (and then Oracle) was making to Solaris, adding another reason to stop applying patches and to never upgrade past Solaris 10U8.
(Allow me to translate that: Solaris OS developers routinely wrote and released patches and changes with terrible code that broke things for us and didn't work as officially documented. The Sun and Oracle reaction to this was a giant silent shrug.)
While we got away with our 'no patches, no updates, no changes' policy I'm aware that we were lucky; we simply never hit any of the known S10U8 bugs. I didn't (and don't) like running systems that I feel I can't update because things are sure to break and we definitely wound up doing that with our Solaris machines. I count that as something that did not go well.
In general, over time I've become increasingly uncomfortable about our default 'no updates on black box appliance style machines' policy, which we've followed on both the Solaris fileservers and the iSCSI backends. I kind of count it as an implicit failure in our current fileserver environment. For the next generation of fileservers and backends I'd really like to figure out a way to apply as many updates as possible in a safe way (I have some ideas but I'll save them for another entry).
None of these things that didn't work so well have been fatal or even painful in day to day usage. Some of them, such as the ZFS spares situation, have forced us to do things that improved the overall environment; having our own spares system has turned out to be a big win because it can be more intelligent and more aggressive than any general ZFS solution could be.