Hard drives really do wear out, so you need a a hardware budget

January 5, 2014

Some people are already laughing at the title of this entry but really, hear me out. Both I individually and where I work collectively have generally had really good luck with hardware. For the most part things just haven't broken and as a result we run both servers and hard drives well beyond what people consider their normal sane lifetimes. Generally everything has kept going and going and going, which is very convenient for a place that has traditionally had essentially no hardware budget.

Our current fileservers, their iSCSI backends, and most importantly basically all of their data disks date from mid-2008 and mid-2009. These disks are consumer 7200 RPM SATA drives, although ones we bought with five year warranties. You'll notice that the mid-2008 drives are definitely out of warranty and the mid-2009 drives are approaching the edge of it. Vaguely based on general experiences we've been sort of expecting things to tick on even (well) past that five year data; sure, we ought to replace the drives on general principles but that was a vague thing.

It hasn't worked out that way. Our drives have been dropping like flies lately. A significant part of this is with the mid-2009 drives, where we had the misfortune to buy the now infamously unreliable Seagate 1TB ST31000340AS model, but even our older mid-2008 drives are starting to die at an alarming rate. And our 1TB loss rate is not really sustainable either.

(By 'dropping like flies' I mean 'we lost two drives over the past two-week Christmas break and this has stopped being at all exceptional'. We currently have 72 drives in fileserver production.)

The good and lucky news is that this summer we got a budget for replacement hardware more or less out of the blue and as a result we're well along with work to replace all of the hardware involved, most especially the disks. But that's almost pure luck because until very recently we haven't been feeling any particular sense of urgency about the replacement project. Certainly it wasn't a 'things on fire' priority project; we assumed that we had plenty of time and the whole environment was in good shape and so on.

Obviously we were wrong, very nearly disastrously so. The lesson this teaches me is that we can't get this close to the edge the next time around. We need to maintain an ongoing hardware budget (hopefully achieved) and we really do have to renew our major hardware in advance of its nominal probable end-of-life date, never mind how much luck we've had in the past with eg SATA drives in individual servers. This requires advance planning and preparation and politics and most of all it calls for making a schedule well in advance.

(It needs politics because you have to argue, in what is always a time of tight budgets, for a merely precautionary expenditure of cash. We'll be buying new hardware not because we clearly have to but because it's probably a good idea.)

So let me put it in writing: we should be turning over our fileserver disks by the end of 2018, four years after we turn them over now in 2014. That's isn't running them to the ragged edge of their warranty period (we're getting drives with five year warranties), it's 'wasting' some amount of money by replacing what will probably be perfectly operable disks before they die, but I never again want to be in a situation where I'm racing against disk failures.

(I'm currently pondering applications of this idea to my home machine, where I am partly running on a SATA drive that is over five years old. It's in a mirror pair, but still.)

Written on 05 January 2014.
« One aspect of partial versus full entries on blog front pages
Some thoughts on blog front pages in the modern era »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jan 5 02:50:25 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.