Forecasting drive failures is not always as useful as it sounds

January 29, 2021

Recently, I said that we've found a SMART attribute that can predict SSD failures in our environment (and it later did predict the failure of one more SSD). This sounds great, but in practice it's turned out to be less useful than it might seem. The reason for this is pretty simple; supposing that we have a good indication that a drive is going to fail at some time in the future, but not when, what should we do about it?

(I'm going to assume the drive doesn't have any actual problems, like periodic read errors, just some SMART attributes that make you expect it will fail in the future.)

The core issue is that in a well run server environment, preemptively replacing a probably failing disk is a tradeoff. It effectively moves a 'failure' forward in time and puts it in a situation where you (probably) control things. If you can't forecast when the disk is likely to fail (just that it's probably going to at some indefinite point in the future), you don't know how much you've moved the failure forward; you might have moved it a lot, losing a lot of useful life from the disk.

(In many cases you won't be able to return the disk for a warranty replacement merely because you don't like some SMART attributes. You may be able to put the disk in a test or scratch system and let it run to failure, then get it replaced under warranty.)

I say 'in a well run server environment' because in such an environment, the failure of a single disk should never cause serious problems like the loss of important data or having a vital system down. As has been observed many times, plenty of disks fail with no changes to SMART data at all and many more fail with no definite signs of problems; you cannot count on SMART to let you fix a failing disk before it causes an explosion.

(I'll admit that we're not perfect here.)

On the mirror side of this, the less certain it is that the drive is going to fail in the reasonably near future, the higher the costs of a preemptive replacement are. We might even replace a perfectly serviceable disk, unnecessarily costing ourselves some time and money for nothing (and perhaps some disruption too, since not all of our systems have hot swap disks). The easier it is to preemptively replace a probably failing disk without disruption or much cost, generally also the easier it is to do the same thing when the disk actually does fail.

In concrete terms, the SSD that failed recently had SMART signs we recognized at the end of November and it started triggering a warning at the start of January. We still let it run into the ground on its own. It simply didn't seem worth it to act earlier, and we weren't completely sure that the SSD would actually fail (and we certainly couldn't have forecasted when, for various reasons).

(If we had lots of money, sure, we would replace drives preemptively in this sort of situation and eat the cost. But system administration is partly about prioritization, about getting the most value for your limited resources.)

Written on 29 January 2021.
« Making tracking upstream Git repositories a bit quieter
Illustrating the importance of fully multi-core program building today »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jan 29 00:42:50 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.