Are those chassis fans actually still spinning?

November 10, 2013

This is kind of a sysadmin horror story. We have a number of external port-multiplier eSATA enclosures, used both for our iSCSI backends and for our disk-based backup system. These enclosures are now far from new, so we've had some mysterious failures with some of them. As a result of this we recently opened up a couple of them to poke around, with the hopes of reviving at least one chassis to full health.

These chassis have a number of internal fans for ventilation. What my co-workers found when they opened up these chassis was that some of these fans had seized up completely. At some point in the four or so years these enclosures have been operating, most of their fans had quietly died. We hadn't gotten any warnings about this because these enclosures are very basic and don't have any sort of overall health monitoring (if the fans themselves had been making noises before they died, we never noticed it over the general din of the machine room).

This is what you call an unfortunate unanticipated failure mode. In theory I suppose that we should have anticipated it; we knew from the start that there was no chassis health monitoring and fans do die eventually. In practice fan failures have been at least very uncommon on hardware where they actually get monitored (either directly or through 'fan failures are so bad the machine explodes') so we hadn't really thought about this before now.

Now, of course, we have a problem. We have a number of these chassis in live production service and we can't directly check the fans on a chassis without opening it up, which means taking them out of service. We may be able to indirectly observe the state of the fans by looking at hard drive temperatures, but there are a number of potential confounding effects there.

The larger scale effect of this is that I'm now nervously trying to think about any other fans that we're not directly monitoring and that we're just assuming are fine because the machine they're in hasn't died.

(Of course there's nothing new under the sun here; this is a variant of the well known 'do you actually get told if a disk dies in your redundant RAID array and your array stops being redundant any more' issue.)


Comments on this page:

By Baruch Even at 2013-11-11 01:11:36:

What failed? The enclosure itself our the disk inside it?

As an alternative to the fan monitoring which you are missing you could monitor the disk temperatures. It's not a direct replacement but at least it can give out some sort of warning.

Baruch

From 46.144.78.131 at 2013-11-11 03:43:13:

as usual, cheap has potential hidden costs and you just found the one in your enclosures. I really feel for you, not a nice situation to be in.

I am all for lower costs but in cases like this I inform my manager(s) (in writing) I cannot vouch for the state of the hardware so when this stuff happens the blaming game does not pick me as their favourite target ;-) (I know this is not the goal of your post, but it usually is a consequence of situations like the one you describe).

I hope you can find a working solution without too much effort.

By cks at 2013-11-11 17:33:43:

While we've had disk failures in enclosures, what failed this time was the enclosure. In one case it was the power supply apparently getting flaky; in another case an eSATA port multiplier channel seems to be flaky. Opening up both cases to transplant the power supply from the latter to the former is what resulted in our discovery.

Written on 10 November 2013.
« My views on network booting as an alternative to system disks
Go's getopt problem »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Nov 10 23:22:28 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.