2012-05-31
Thinking about why Solaris has failed here
First off, as I rambled yesterday, to say that Solaris has failed here is not to say that our use of Solaris has been a failure (or that our Solaris machines have been); our fileservers are stable and run (generally) without problems and work. But at the same time it's clear that Solaris has failed to catch on here. The only strong reason our fileservers run Solaris is ZFS, and we've not even running anything close to a current version of Solarism at that (cf); at this point our fileservers are less servers and more sealed black-box appliances. We've never had any interest in using Solaris for anything else and we're not very enthused about Solaris even for ZFS; to put it one way, we pretty much use ZFS despite Solaris not because of it.
(There's good alternatives to Solaris for ZFS nowadays, but as it is our fileservers work and switching just to get away from Solaris seems uncompelling. Or even insane.)
Recently I've been thinking about the question of why Solaris failed here. At one level (as Solaris fans will tell you) Solaris is not unattractive; it has a number of nice features, including ZFS. But it clearly did not hit it off with us. Why? I think there are two major reasons.
The first is that Solaris is, frankly, somewhat erratic, flaky and buggy; it's not the kind of simple and solid system that, say, OpenBSD is. OpenBSD doesn't do very much but what it does just works and works reliably, which makes it a pleasure to use within its areas. Our Solaris machines are pretty stable but getting them there took a significant amount of work and local invention, and the reason they're stable is that we don't change anything on them and we don't ask them to do things that we know are problematic. Since Solaris has problems where we have looked, it probably also has problems where we haven't had to look yet.
The second is that Solaris doesn't have the package availability of
many Linux distributions or FreeBSD, where you have a wide selection
of relatively current open source software that is maintained by the
vendor. There are outside efforts to provide open source packages for
Solaris (and we use one of them), but outside efforts can disappear
(and are not official) and so are intrinsically less trustworthy and
confidence inducing than something the vendor itself supports. Unless
Ubuntu absolutely explodes, a relatively current Apache is always only
going to be an apt-get away and someone else will always handle
security fixes for it. This is not something we could ever say about
Solaris.
(Part of the problem of third-party packaging is that Solaris itself ships with any number of very old versions of things as official packages. All of the resulting options are bad ones.)
A significant contributing factor is that Solaris is simply not that pleasant to administer, partly because it is almost totally without modern niceties and partly because what it has in the way of administrative tools are often terribly bad (try to tell me with a straight face that Solaris 10's patching system is at all tolerable). Solaris administration is full of things that might have been good ideas if they were competently implemented but, well, they weren't.
(I understand that some of this has changed in modern Solaris, but by this point that's too little, too late (for us).)
The combination of these attributes in a single OS means there's nothing here to really feel enthused about using Solaris for. It lacks a speciality and a niche that it can really dominate in the way that OpenBSD dominates firewalls and related networking stuff, and it lacks the wide package availability (and administrative convenience) that would make it easy for us to deploy a Solaris machine to do <X> for some <X> like 'provide IMAP'.
(Our directly user accessible machines are essentially forced to run Linux (and Ubuntu at that), but we have plenty of servers that users don't log in to and that could theoretically run almost anything.)
As always, everyone's circumstances are different. For instance, if you build your entire software stack from source yourself in order to maintain full control over it you don't care in the least about Solaris's lack of prepackaged open source software; if it existed, you wouldn't use it anyways.
2012-05-22
Our pragmatic experiences with (ZFS) disk errors in our infrastructure
I wrote before about when we replace disks based on errors (and then more on ZFS read errors). Today I want to talk about our pragmatic experiences in our fileserver infrastructure. The first and most important thing to understand about our experiences is that in our environment disk errors are indirect things. Because we are using iSCSI backends, ZFS does not have access to the actual SATA disk status; instead, all it gets is whatever the iSCSI backends report.
(I find it plausible and indeed likely that ZFS could behave somewhat differently if it was dealing directly with SATA disks and had better error reports available to it.)
On the backends themselves we see two levels of read errors, what I will call soft read errors and hard read errors. Both soft and hard read errors seem to generally result in SATA channel resets (which affect all disks on the channel); the difference between the two is that at the end of a soft error the read appears to succeed, while at the end of a hard error we see the Linux kernel log an actual read error (and then iSCSI relays the read error to Solaris and ZFS). On the backends, soft disk errors only report the ATA device name for the disk involved, which can make finding it a little bit interesting; hard read errors report the full name. Handling soft read errors can sometimes take long enough that Solaris sees an IO timeout and retries the IO (and logs a message about it), but usually the only sign on the fileservers themselves is slow IO.
(It's possible that some reads from soft errors are actually returning corrupted data and this is the cause of some of our checksum errors. However, I don't think we've seen a strong correlation between reported checksum errors in ZFS and soft read errors on the backends.)
Our experience is that SMART error reports (on the backends) are all but
useless. We do not always see SMART errors for hard read errors (much
less soft ones) and we see SMART errors reported on disks that have no
observable problems. At this point SMART reports are mostly useful for
catastrophic things like 'the disk disappeared'; however, we've seen
spurious reports even for those (our current theory is that a smartd
check at the wrong time during a SATA channel reset can fail to see
the disk).
As far as we've been able to see, hard read errors do get reported to Solaris and ZFS and do result in ZFS read errors. However, I admit that we haven't generally done forward checks here (noticing hard read errors on the backends and then seeing that the Solaris fileservers reported hard read errors at the same time); instead, we have tended to work backwards from ZFS read errors on the fileservers to see that they are mostly hard read errors on the backends.
(Offhand, I'm not sure if we've seen ZFS read errors without hard read errors on the backends. It's a good question and we have some records, but I'm going to defer carefully checking them to a potential future entry.)
We haven't seen ZFS write errors unless the actual disks go away entirely (eg, if we pull a live disk ZFS will light up with write errors in short order). I don't think we've noticed any backend reports about write errors on running disks.
Our old version of Solaris is generally okay with both soft and hard
read errors; soft errors sometimes cause IO timeouts and hard read
errors wind up with actual ZFS-visible read errors (sometimes after
timeouts), but that has mostly been it. The one exception is a single
Solaris fileserver install that got itself into an odd state that we
don't understand. Although it was theoretically identical to all of
our other fileservers, this single fileserver had a very bad reaction
to read errors at the ZFS level; after a while NFS became very slow or
non-responsive and all ZFS operations would usually eventually start
locking up entirely (even things like 'zpool status' for a pool not
experiencing IO problems). Once we identified the cause of its lockups,
we started aggressively replacing its backend disks the moment they
reported hard read errors. This machine had other iSCSI anomalies (eg,
it established iSCSI connections at boot very slowly) and we eventually
replaced its Solaris install, which seems to have made the problem go
away.
(Our troubleshooting was complicated by the fact that this is our only fileserver that uses 1.5 TB disks instead of 750 GB disks on the backends and almost all of our problem disks have been 1.5 TB disks. We weren't clear if it was just how ZFS reacted to this sort of slow hard read errors over iSCSI, something different about the disks, some hardware problem on the fileserver server, something different about the iSCSI backends it used, and so on.)