SMART drive self-tests seem potentially useful, but not too much
I've historically ignored all aspects of hard drive SMART apart, perhaps, from how
smartd
would occasionally email us to complain about things and
sometimes those things would even be useful. There is good reason
to be a SMART sceptic, seeing as many of the SMART attributes are
underdocumented, SMART itself is peculiar and obscure, hard drive
vendors have periodically had their drives outright lie about SMART
things, and SMART attributes are not necessarily good predictors
of drive failures (plenty of drives die abruptly with no SMART
warnings, which can be unnerving). Certain sorts
of SMART warnings are usually indicators of problems (but not
always), but the absence of SMART warnings
is no safety (see eg,
and also Blackblaze from 2016).
Also, the smartctl
manpage is very long.
But, in the wake of our flaky SMART errors
and some other events with Crucial SSDs here, I wound up digging
deeper into the smartctl
manpage and experimenting with SMART
self-tests,
where the hard drive tries to test itself, and SMART logs, where
the hard drive may record various useful things like read errors
or other problems, and may even include the sector number involved
(which can be useful for various things). Like much of the rest of
SMART, what SMART self-tests do is not precisely specified or
documented by drive vendors, but generally it seems that the 'long'
self-test will read or scan much of the drive.
By itself, this probably isn't much different than what you could
do with dd
or a software RAID scan. From my perspective,
what's convenient about SMART self-tests is that you can kick them
off in the background regardless of what the drive is being used
for (if anything), they probably won't get too much in the way of
your regular IO, and after they're done they automatically leave a
record in the SMART log, which will probably persist for a fair
while (depending on how frequently you run self-tests and so on).
On the flipside, SMART self-tests have the disadvantage that you don't really know what they're doing. If they report a problem, it's real, but if they don't report a problem you may or may not have one. A SMART self-test is better than nothing for things like testing your spare disks, but it's not the same as actually using them for real.
On the whole, my experimentation with SMART self-tests leaves me feeling that they're useful enough that I should run them more often. If I'm wondering about a disk and it's not being used in a way where all of it gets scanned routinely, I might as well throw a self-test at it to see what happens.
(They probably aren't useful and trustworthy enough to be worth scripting something so that we routinely run self-tests on drives that aren't already in software RAID arrays.)
PS: Much but not all of my experimentation so far has been on hard
drives, not SSDs. I don't know if the 'long' SMART self-test on a
SSD tests more thoroughly and reaches more bits of the drive internals
than you can with just an external read test like dd
, or conversely
if it's less thorough than a full read scan.
|
|