Are those chassis fans actually still spinning?

November 10, 2013

This is kind of a sysadmin horror story. We have a number of external port-multiplier eSATA enclosures, used both for our iSCSI backends and for our disk-based backup system. These enclosures are now far from new, so we've had some mysterious failures with some of them. As a result of this we recently opened up a couple of them to poke around, with the hopes of reviving at least one chassis to full health.

These chassis have a number of internal fans for ventilation. What my co-workers found when they opened up these chassis was that some of these fans had seized up completely. At some point in the four or so years these enclosures have been operating, most of their fans had quietly died. We hadn't gotten any warnings about this because these enclosures are very basic and don't have any sort of overall health monitoring (if the fans themselves had been making noises before they died, we never noticed it over the general din of the machine room).

This is what you call an unfortunate unanticipated failure mode. In theory I suppose that we should have anticipated it; we knew from the start that there was no chassis health monitoring and fans do die eventually. In practice fan failures have been at least very uncommon on hardware where they actually get monitored (either directly or through 'fan failures are so bad the machine explodes') so we hadn't really thought about this before now.

Now, of course, we have a problem. We have a number of these chassis in live production service and we can't directly check the fans on a chassis without opening it up, which means taking them out of service. We may be able to indirectly observe the state of the fans by looking at hard drive temperatures, but there are a number of potential confounding effects there.

The larger scale effect of this is that I'm now nervously trying to think about any other fans that we're not directly monitoring and that we're just assuming are fine because the machine they're in hasn't died.

(Of course there's nothing new under the sun here; this is a variant of the well known 'do you actually get told if a disk dies in your redundant RAID array and your array stops being redundant any more' issue.)

Written on 10 November 2013.
« My views on network booting as an alternative to system disks
Go's getopt problem »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Nov 10 23:22:28 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.