Hard drives really do wear out, so you need a a hardware budget

January 5, 2014

Some people are already laughing at the title of this entry but really, hear me out. Both I individually and where I work collectively have generally had really good luck with hardware. For the most part things just haven't broken and as a result we run both servers and hard drives well beyond what people consider their normal sane lifetimes. Generally everything has kept going and going and going, which is very convenient for a place that has traditionally had essentially no hardware budget.

Our current fileservers, their iSCSI backends, and most importantly basically all of their data disks date from mid-2008 and mid-2009. These disks are consumer 7200 RPM SATA drives, although ones we bought with five year warranties. You'll notice that the mid-2008 drives are definitely out of warranty and the mid-2009 drives are approaching the edge of it. Vaguely based on general experiences we've been sort of expecting things to tick on even (well) past that five year data; sure, we ought to replace the drives on general principles but that was a vague thing.

It hasn't worked out that way. Our drives have been dropping like flies lately. A significant part of this is with the mid-2009 drives, where we had the misfortune to buy the now infamously unreliable Seagate 1TB ST31000340AS model, but even our older mid-2008 drives are starting to die at an alarming rate. And our 1TB loss rate is not really sustainable either.

(By 'dropping like flies' I mean 'we lost two drives over the past two-week Christmas break and this has stopped being at all exceptional'. We currently have 72 drives in fileserver production.)

The good and lucky news is that this summer we got a budget for replacement hardware more or less out of the blue and as a result we're well along with work to replace all of the hardware involved, most especially the disks. But that's almost pure luck because until very recently we haven't been feeling any particular sense of urgency about the replacement project. Certainly it wasn't a 'things on fire' priority project; we assumed that we had plenty of time and the whole environment was in good shape and so on.

Obviously we were wrong, very nearly disastrously so. The lesson this teaches me is that we can't get this close to the edge the next time around. We need to maintain an ongoing hardware budget (hopefully achieved) and we really do have to renew our major hardware in advance of its nominal probable end-of-life date, never mind how much luck we've had in the past with eg SATA drives in individual servers. This requires advance planning and preparation and politics and most of all it calls for making a schedule well in advance.

(It needs politics because you have to argue, in what is always a time of tight budgets, for a merely precautionary expenditure of cash. We'll be buying new hardware not because we clearly have to but because it's probably a good idea.)

So let me put it in writing: we should be turning over our fileserver disks by the end of 2018, four years after we turn them over now in 2014. That's isn't running them to the ragged edge of their warranty period (we're getting drives with five year warranties), it's 'wasting' some amount of money by replacing what will probably be perfectly operable disks before they die, but I never again want to be in a situation where I'm racing against disk failures.

(I'm currently pondering applications of this idea to my home machine, where I am partly running on a SATA drive that is over five years old. It's in a mirror pair, but still.)

Comments on this page:

By Frank Ch. Eigler at 2014-01-05 08:25:27:

Chris, is pre-emptive replacement of the drives indicated by anything other calendar dates like SMART data? Why is maintaining a deep pool of spares not satisfactory?

By dozzie at 2014-01-05 08:36:48:

We need to maintain an ongoing hardware budget (hopefully achieved) and we really do have to renew our major hardware in advance of its nominal probable end-of-life date, [...]

Which, once achieved, will be an excellent source of hardware for small side projects that are not critical and their data could be lost without causing much trouble -- but this will require some discipline.

A thing to think about in the future, after getting the ongoing budget for refreshing hardware.

By cks at 2014-01-05 15:14:25:

There are three problems with a spares pool. First, a deep spares pool can soon become a shallow spares pool and then no spares pool, which is what's happening with us (we simply can't buy the same or equivalent drives any more because of the industry-wide shift to 4K sector drives). Second, the more of your drives are failing the more you risk a double failure on a mirror pair or a double or triple failure on a RAID-[56] array. Third, replacing failed drives all of the time is disruptive in various ways including through the additional IO load of mirror or RAID resynchronization.

Even without SMART data, an increased drive failure rate in aging drives in a pool of them is in my opinion a big flashing warning sign. You're probably ramping up the 'increased failure rate close to X age' part of the failure curves.

Written on 05 January 2014.
« One aspect of partial versus full entries on blog front pages
Some thoughts on blog front pages in the modern era »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jan 5 02:50:25 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.