SMART attributes can predict SSD failures under the right circumstances

January 15, 2021

In theory, disk drive SMART attributes should give us valuable information on how our disk drives are doing and how likely they are to fail (or how close they are to total failure). Whether or not this happens in practice is a somewhat open question, although in 2016 Backblaze described the SMART attributes they found useful for hard drives. Whether SMART means anything much for consumer SSDs is an even more unclear thing; we have seen both SMART errors mean effectively nothing and SSDs fail abruptly with no notice. However, recently we did find a correlation that appears to be real for some of the SSDs in our fileservers and that one can tell a plausible story about.

We started out with Crucial/Micron 2 TB SSDs in our fileservers in a mixture of models (recent replacements have been WD SSDs). We've now had a (slow) series of failures of Crucial MX500s where all of the drives had a steadily rising count for SMART attribute 172, which for these drives is 'Erase Fail Count'. The count for this attribute starts out at zero, ticks up steadily for a while in twos and fours, and then starts escalating rapidly in the week or even day before the drive fails completely. Normal drives all have a zero in this attribute, unlike other SMART attributes. The exact number for the erase fail count hasn't been strongly predictive, which is to say that drives have failed with various numbers in it, but the increasing speed of the increase is.

(On drives with this attribute above zero, it's correlated with the values of attribute 5, 'Reallocated NAND Blk Count', and attribute 196, 'Reallocated Event Count'. The correlation doesn't go the other way; we have MX500s with non-zero count in either or both that still have zero for the erase fail count and don't seem to be failing.)

If this SMART attribute's name and value are both honest, there's an obvious story about why this would be relatively predictive. If a SSD is trying to erase a NAND block and this operation fails to work correctly, the NAND block is now useless. Lose too many NAND blocks and your SSD has problems, and if there is an escalating rate of NAND block erase failures this is probably a bad sign.

Regardless of what this attribute really means and how it works in Crucial's MX500s, it's reassuring to at least have turned up some SMART attribute that predicts failures (at least on those of our SSDs that report anything for it). This is how SMART is supposed to work but so often doesn't.

(Like most places, our ability to find useful SMART attributes or clusters of them that predict drive failure is limited partly because we don't have many drive failures. This limits the data we have available and how much value there is in spending a lot of time analyzing it.)

Written on 15 January 2021.
« Understanding WireGuard's AllowedIPs setting (and a bit of tcpdump)
One reason to not trust SMART attribute data for consumer drives »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jan 15 23:13:04 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.