Track your disk failures

November 25, 2013

Here is something that we've been learning the hard way: if you have any sort of fileserver environment with a significant number of disks (and maybe even if you don't), you should be tracking all of your disk failures. What this tracking is for is identifying failure patterns in your environment, things like whether certain sorts of disks fail more often, or disks in certain enclosures, and so on.

The very basic information you should record is full details for every disk failure. What I'd record today is when it happened, what sort of disk it was, what enclosure and bay it failed in, and how it failed (read errors, write errors, total death, IO got really slow, or however it happened). You might also want to track SMART attributes and note if you got any sort of SMART notices beforehand (in the extreme, you'd track SMART notices too). You might also be able to record how old the disk was (based on warranty status and perhaps date of manufacture information). This doesn't need any sort of complicated database system, a text file is fine, but you should record the main information in a way that it can be extracted with grep and awk.

(If you have external disk enclosures, keeping such a log may also raise the issue of consistent identification for them. Locally we have swapped some enclosures around when various things happen, which at the very least means you're going to want to note in the log that 'host X had its enclosure swapped here'.)

Once you have the core information logged you should also keep track of some aggregated failure information (instead of just having people to generate it on demand from the log). I would track at least failures by disk type and failures by enclosure, because these are the two things that are most likely to be correlated (ie, where one sort of disk is bad or one enclosure has a problem you may have overlooked). Update this aggregated information any time you add something to the log, either by hand or by auto-generating the aggregated stats from the log.

(This may sound obvious to some people but trust me, it's an easy thing to overlook or just not think about when you're starting out on a grand fileserver adventure.)

Comments on this page:

By Colin at 2013-11-25 18:31:32:

Check your hard drive/SSD model numbers and shuffle them around.

I have heard of a situation where someone ordered 20 hard drives for a new server. They all came from the same batch and that batch had a fault.

So all 20 failed within a matter of 2/3 days. They almost lost the array, but were very diligent in swapping.

The moral of this story. If you are buying multiple hard drives grab them from different batch numbers if possible. If not at least shuffle the drives around so that the first 4 are not part of the same VG.

Written on 25 November 2013.
« Baidu's web spider ignores robots.txt (at least sometimes)
From CPython bytecode up to function objects (in brief) »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Nov 25 00:34:42 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.