2013-11-10
Are those chassis fans actually still spinning?
This is kind of a sysadmin horror story. We have a number of external port-multiplier eSATA enclosures, used both for our iSCSI backends and for our disk-based backup system. These enclosures are now far from new, so we've had some mysterious failures with some of them. As a result of this we recently opened up a couple of them to poke around, with the hopes of reviving at least one chassis to full health.
These chassis have a number of internal fans for ventilation. What my co-workers found when they opened up these chassis was that some of these fans had seized up completely. At some point in the four or so years these enclosures have been operating, most of their fans had quietly died. We hadn't gotten any warnings about this because these enclosures are very basic and don't have any sort of overall health monitoring (if the fans themselves had been making noises before they died, we never noticed it over the general din of the machine room).
This is what you call an unfortunate unanticipated failure mode. In theory I suppose that we should have anticipated it; we knew from the start that there was no chassis health monitoring and fans do die eventually. In practice fan failures have been at least very uncommon on hardware where they actually get monitored (either directly or through 'fan failures are so bad the machine explodes') so we hadn't really thought about this before now.
Now, of course, we have a problem. We have a number of these chassis in live production service and we can't directly check the fans on a chassis without opening it up, which means taking them out of service. We may be able to indirectly observe the state of the fans by looking at hard drive temperatures, but there are a number of potential confounding effects there.
The larger scale effect of this is that I'm now nervously trying to think about any other fans that we're not directly monitoring and that we're just assuming are fine because the machine they're in hasn't died.
(Of course there's nothing new under the sun here; this is a variant of the well known 'do you actually get told if a disk dies in your redundant RAID array and your array stops being redundant any more' issue.)
My views on network booting as an alternative to system disks
In a comment on my entry on the potential downsides of SSD as system disks, zwd asked I'd considered skipping the need for system disks by just PXE booting the systems instead (as some Illumos distributions are now recommending). The short answer is no but I have enough thoughts about this to warrant a long answer.
My view is that network booting systems is at its best where you have a large and mostly homogenous set of servers that basically run a constant set of things with little local state or local configuration. In this environment you don't want to bother taking the time to install to the local disks on today's server and it simplifies life if you can upgrade machines just by rebooting them. With little local state the difficulty of having state in a diskless environment doesn't cause too much heartburn in practice and running a constant set of programs generally reduces the load on your 'system filesystem' fileserver and may make it practical to have an all-in-RAM system image.
With that said, in general a diskless environment is almost intrinsically more complicated than local disks in theory and definitely more complicated in practice today. While you have a spectrum of options none of them are as simple and as resilient as local disks; they all require some degree of external support and create complications around things like software upgrades. Some of the options require significant infrastructure. All of them create additional dependencies before your servers will boot. In a large environment the simplifications elsewhere make up for this.
We aren't a large environment. In fact we're a very bad case for netbooting. Our modest number of systems are significantly heterogenous, they have potentially significant local state, a given system often runs a wide variety of software (very wide, for systems users log in to), and we don't want to reboot them at all in normal conditions. Some servers are already dependent on central NFS fileservers but other servers we very much want to keep working even if the fileserver environment has problems and of course the components of the fileserver environment are a crucial central point that we want to work almost no matter what with as few external dependencies as possible (ideally none beyond 'there is a network'). Single points of failure that can potentially take down much of our infrastructure give us heartburn. On top of this, diskless booting is not something that I believe is well supported by the majority of the OSes and Linux distributions that we use; we'd almost certainly be going off the beaten and fully supported path in terms of installation and system management (and might have to build some tools of our own).
In short: we'd save very little (or basically nothing) by using network booted diskless servers and get a whole bunch of problems to go with it. We'd need additional boot servers and relatively heavy duty fileservers to serve up the system filesystems and store 'local' state and we'd have non-standard system management that would be more difficult than we have today. Even if I felt enthused about this (which I don't) it would be a very hard sell to my co-workers; they would rationally ask 'what are we getting for all of this extra complexity and overhead?' and I would have no good answer.
(We don't install or reinstall systems anywhere near often enough that 'faster and easier installs' would be a good answer.)