SMART Threshold numbers turn out to not be useful for us in practice
I was recently reading Rachel Wenzel's Predicting Hard Drive Failure with Machine Learning, and one of the things it taught me is that drive vendors, as part of their drive SMART data, provide a magical 'threshold' number that is supposed to indicate when a SMART attribute has reached a bad value. This is not compared to the raw SMART value, but instead a normalized 'value' that is between 253 (best) and 1 (worst). We collect SMART data for all of our drives, but so far we only use the raw value, which is much more directly usable in almost all cases.
(For example, the raw SMART value for the drive's temperature is
almost always its actual temperature in Celsius. The nominal value
is not, and the mapping between actual temperature and the nominal
value varies by vendor and perhaps drive model. The raw value that
smartctl displays can also include more information; the drive
temperature for one of my SSDs shows '36 (Min/Max 15/57)', for
example. As the smartctl manpage documents, the conversion between
the raw value and the normalized version is done by the drive, not
by your software.)
The obvious thing to do was to extend our SMART data collection to include both the vendor provided threshold value and the normalized value (and then also perhaps the 'worst' value seen, which is also reported in the SMART data). So today I spent a bit of time working out how to do that in our data collection script, and then before I enabled it (and quadrupled our SMART metrics count in Prometheus), I decided to see if it was useful and could provide us good information. Unfortunately, the answer is no. There are a number of problems with the normalized SMART data in practice.
- Sometimes there simply is no threshold value provided by the drive;
smartctlwill display these as '---' values in its output.
- Uncommonly, all of the normalized numbers are 0 (for the current
value, the worst value, and the threshold). Since the minimum for
the current value is supposed to be 1, this is a signal that all
of them are not useful. This happens both for attributes that
don't really have a range, such as the drive's power on time, as
well as ones where I'd expect there to be a broad 'good to bad'
range, like the 'lifetime writes' on a SSD.
A closely related version of this is where the current value and the threshold are both zero, but the 'worst' value is some non-zero number (a common one on our drives is 100, apparently often the starting value). This is what a small number of our drives do for 'power on hours'.
- Not infrequently the threshold is 0, which normally should be the same
as 'there is no threshold', since in theory the current value can
never reach 0. Even if we ignore that, it's difficult to distinguish
a drive with a 0 threshold and a current value dropping to it from a
drive where everything is 0 to start with.
- Sometimes there's a non-zero current value along with a 'worst' value
that's larger than it. This 'worst' value can be 253 or another
default value like 100 or 200. It seems pretty clear that this is the
drive deciding it's not going to store any sort of 'worst value seen',
but it has to put something in that data field.
We also have some drives where the current and worst normalized values for the drive's temperature appear to just be the real temperature in Celsius, which naturally puts the worst above the current.
Once I ran through all of our machines with a data collection script, I found no drives in our entire current collection where the current value was above 0 but below or at the threshold value. We currently have one drive that we very strongly believe is actively failing based on SMART data, so I would have hoped that it turned up in this.
Given the lack of any visible signal combined with all of the oddities and uncertainties I've already discovered, I decided that it wasn't worth collecting this information. We have a low drive failure rate to start with (and a lot of people's experiences suggest that a decent percentage of drives fail with no SMART warnings at all), so the odds that this would ever yield a useful signal that wasn't a false alarm seemed low.
(Sadly this seems to be the fate of most efforts to use SMART attributes. Including Wenzel's; it's worth reading the conclusions section in that article.)
On the positive side, now I have a script to print out all of the
SMART attributes for all of the drives on a system in a format I can
process with standard Unix tools. Being able to just use
awk on SMART
attributes is already a nice step forward.