Wandering Thoughts archives


It's a good idea to test your spare disks every so often

We have a reasonable sized fileserver infrastructure and as part of that, we have some hot spare disks sitting in iSCSI backends, waiting for the day that things will go wrong and they'll be needed (and that day does come). However, it's possible to be reasonably lucky and have that day not come for a fairly long time, especially for parts of your fleet.

There's a rule of thumb in programming to the effect that if you're not testing it, it's broken. Now, consider all of those spare disks, which are sitting quietly idle in your fileservers and iSCSI backends and storage pods and whatnot (for us, this includes the emergency spare supply of disks that are in our test environment, which sits idle most of the time). Are you testing them? If you aren't testing them, how confident are you that they haven't quietly gone bad at some point?

If life is nice, the disks will politely tell you with SMART errors or just stop some day. But we've had disks that weren't that nice; they sat in their drive bays quietly not reporting any problems right up until the point where we tried to do IO to them. Then and only then did they notice that they had problems and explode. In some cases it was worse, because the problems the drive had developed were things like really slow IO instead of any outright errors.

(Part of the fun of a spare drive developing really slow IO is that in the semi-chaos of a spare drive being activated, whether automatically or manually, you may not realize that the slowness is unnatural and due to the new drive instead of just being a product of your regular IO load plus the disruption of a spare being pulled in.)

I don't have any good answers on how to check your spare drives in any sort of easy and automated way (especially for slow IO as opposed to outright failures), but I do think that you should try to poke at them every so often even if it's by hand. Even a periodic dd read of a GB or so from each disk is better than nothing, since it does force the disk to be active, move the drive heads (for HDs), and so on.

(For local reasons beyond the scope of this entry, this mostly hasn't been an issue for us in our current fileserver infrastructure, but it's just become one again so it's on my mind a bit.)

PS: I believe that the better sort of RAID controller and SAN storage appliance does explicitly exercise spare drives every so often, exactly for this reason. Hopefully they also report any failures and you're monitoring for those reports (as opposed to the unit, say, simply turning on a buzzer and a blinking red light on the drive bay and then hoping that someone wanders through the machine room and spots it).

sysadmin/TestYourSpareDisks written at 00:03:15; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.