SMART drive self-tests seem potentially useful, but not too much

July 8, 2019

I've historically ignored all aspects of hard drive SMART apart, perhaps, from how smartd would occasionally email us to complain about things and sometimes those things would even be useful. There is good reason to be a SMART sceptic, seeing as many of the SMART attributes are underdocumented, SMART itself is peculiar and obscure, hard drive vendors have periodically had their drives outright lie about SMART things, and SMART attributes are not necessarily good predictors of drive failures (plenty of drives die abruptly with no SMART warnings, which can be unnerving). Certain sorts of SMART warnings are usually indicators of problems (but not always), but the absence of SMART warnings is no safety (see eg, and also Blackblaze from 2016). Also, the smartctl manpage is very long.

But, in the wake of our flaky SMART errors and some other events with Crucial SSDs here, I wound up digging deeper into the smartctl manpage and experimenting with SMART self-tests, where the hard drive tries to test itself, and SMART logs, where the hard drive may record various useful things like read errors or other problems, and may even include the sector number involved (which can be useful for various things). Like much of the rest of SMART, what SMART self-tests do is not precisely specified or documented by drive vendors, but generally it seems that the 'long' self-test will read or scan much of the drive.

By itself, this probably isn't much different than what you could do with dd or a software RAID scan. From my perspective, what's convenient about SMART self-tests is that you can kick them off in the background regardless of what the drive is being used for (if anything), they probably won't get too much in the way of your regular IO, and after they're done they automatically leave a record in the SMART log, which will probably persist for a fair while (depending on how frequently you run self-tests and so on).

On the flipside, SMART self-tests have the disadvantage that you don't really know what they're doing. If they report a problem, it's real, but if they don't report a problem you may or may not have one. A SMART self-test is better than nothing for things like testing your spare disks, but it's not the same as actually using them for real.

On the whole, my experimentation with SMART self-tests leaves me feeling that they're useful enough that I should run them more often. If I'm wondering about a disk and it's not being used in a way where all of it gets scanned routinely, I might as well throw a self-test at it to see what happens.

(They probably aren't useful and trustworthy enough to be worth scripting something so that we routinely run self-tests on drives that aren't already in software RAID arrays.)

PS: Much but not all of my experimentation so far has been on hard drives, not SSDs. I don't know if the 'long' SMART self-test on a SSD tests more thoroughly and reaches more bits of the drive internals than you can with just an external read test like dd, or conversely if it's less thorough than a full read scan.

Written on 08 July 2019.
« Straightforward web applications are now very likely to be stable in browsers
Systemd services that always restart should probably set a restart delay too »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jul 8 21:07:18 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.