The cost of expensive hardware and the benefit of hindsight

November 13, 2013

One possible response to my entry about our discovery that we had dead chassis fans in our disk enclosures is to say that this is the cost of buying inexpensive enclosures; clearly we should have gotten a better grade of disk enclosures. This is literally true but to just leave it at that is to miss the big picture or actually several of them.

The first big picture is that did not buy these disk enclosures blindly. We bought one, opened it up, looked it over, tested it out, and when we liked what we saw during all of this we bought more. At no time during this evaluation process did it occur to anyone to say 'wait a minute, what happens if the fans start dying? how will we find out?'. As a result we didn't make any sort of conscious choice to live without chassis monitoring; instead it never crossed our minds that we might need it. I rather expect that this is a common thing; among other things, people have all sorts of cognitive biases to make us think that of course things always work.

(If you want to say that you always consider and monitor for fan failure, I'm going to ask you if all of your network switches alarm or report on fan failures, even the little ones and your bulk basic top of rack aggregator switches. Perhaps you buy expensive enough switches that this is true.)

The second big picture is that our disk enclosures work. They have worked for years and they continue to all work even today despite their seized fans. It's hard to avoid the objective conclusion that we made a good choice and possibly even the right choice even despite this silent fan failure issue. Have we had hard drives die in these disk enclosures? Yes, of course, but it's not clear if any potentially seized fans have caused more than usual to fail.

(Google's famous disk study actually found less correlation between drive temperature and failure rates than you might expect.)

But the most important big picture is about the costs of buying expensive hardware instead of less expensive hardware. Budgets are almost never infinite so the real cost of more expensive hardware is the opportunity cost. The money you spent on it is money that is not being spent elsewhere; it translates to buying less capacity, or not buying another piece of hardware for something else important, or less spares, or if your budget can be grown to cover all of that it means that someone else's budget gets cut to make up the shortfall and they get to do without something. More expensive hardware is never free; you are always giving up something to get it even if you can't see it. This is true even or especially if the more expensive hardware is objectively better than the cheaper hardware and objectively worth the extra cost. It's not just about whether you're paying a fair price for the extra benefits, it's about what you're giving up elsewhere to get them. And this is almost always situational; from the outside, you don't necessarily know what someone else's opportunity costs are.

(The most extreme version of this is when you have a fixed budget and a fixed objective and if you can't hit the objective within the budget there's no point in doing anything. At that point the perfect but over budget is very much the enemy of the just good enough but in budget.)

So even if we got to make the decision all over again, this time thinking about and knowing about the potential for fan failure, I'm not sure we'd choose any differently. It would be nice to have monitoring for the fans but I'm not sure it would be worth whatever else it would have cost us at the time.

(This is of course annoying to technical people, especially if we can see a risk that we are not mitigating. We want nice, good hardware. But our feelings of elegance play second fiddle to the needs of the organization and we must never forget that.)


Comments on this page:

By Vlad at 2013-11-20 15:49:24:

I was using S.M.A.R.T. to monitor the internal temperature of the hard drives. Its relatively easy to implement some sort of alerts then. But this approach is working only in the case if you could have an access to S.M.A.R.T. data on the disks.

Written on 13 November 2013.
« Go's getopt problem
One reason I like Go: it seems natural to avoid object churn »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Nov 13 01:21:03 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.