A spate of somewhat alarming flaky SMART errors on Crucial MX500 SSDs
We've been running Linux's smartd on all of our Linux machines for a long time now, and over that time it's been solidly reliable (with a few issues here and there, like not always handling disk removals and (re)insertions properly). SMART attributes themselves may or may not be indicative of anything much, but smartd does reliably alert on the ones that it monitors.
Except on our new Linux fileservers. For a significant amount
of time now,
smartd has periodically been sending us email about
various drives now having one 'currently unreadable (pending)
sectors' (which is SMART attribute 197). When we go look at the
affected drive with
smartctl, even within 60 seconds of the event
being reported, the drive has always reported that it now has no
unreadable pending sectors; the attribute is once again 0.
These fileservers use both SATA and SAS for connecting the drives, and we have an assorted mixture of 2TB SSDs; some Crucial MX300s, some Crucial MX500s, and some Micron 1100s. The errors happen for drives connected through both SATA and SAS, but what we hadn't noticed until now is that all of the errors are from Crucial MX500s. All of these have the same firmware version, M3CR010, which appears to be the only currently available one (although at one point Crucial apparently released version M3CR022, cf, but Crucial appears to have quietly pulled it since then).
These reported errors are genuine in one sense, in that it's not
smartd being flaky. We also track all the SMART attributes
through our new Prometheus system, and
it also periodically reports a temporary '1' value for various
MX500s. However, as far as I can see the Prometheus-noted errors
always go away right afterward, just as the
smartd ones do. In
addition, no other SMART attributes on an affected drive show any
unexpected changes (we see increases in eg 'power on hours' and
other things that always count up). We've also done mass reads,
SMART self-tests, and other things on these drives, always without
problems reported, and there are no actual reported read errors at
the Linux kernel level.
(And these drives are in use in ZFS pools, and we haven't seen any ZFS checksum errors. I'm pretty confident that ZFS would catch any corrupted data the drives were returning, if they were.)
Although I haven't done extensive hand checking, the reported errors
do appear to correlate with read and write IO happening on the
drive. In spot checks using Prometheus disk metrics, none of the
drives appeared to be inactive at the times that
us, and they may all have been seeing a combination of read and
write IO at the time. Almost all of our MX500 SSDs are in the two
in-production fileservers that have been reporting errors; we have
one that's in a test machine that's now basically inactive, and
while I believe it reported errors in the past (when we were testing
things with it), it hasn't for a while.
(Update: It turns out that I was wrong; we've never had errors reported on the MX500 in our test machine.)
I see at least two overall possibilities and neither of them are entirely reassuring. One possibility is that the MX500s have a small firmware bug where occasionally, under the right circumstances, they report an incorrect 'currently unreadable (pending) sectors' value for some internal reason (I can imagine various theories). The second is that our MX500s are detecting genuine unreadable sectors, but then quietly curing them somehow. This is worrisome because it suggests that the drives are actually suffering real errors or already starting to wear out, despite a quite light IO load and being in operation for less than a year.
We don't have any solutions or answers, so we're just going to have to keep an eye on the situation. All in all it's a useful reminder that modern SSDs are extremely complicated things that are quite literally small computers (multi-core ones at that, these days), running complex software that's entirely opaque to us. All we can do is hope that they don't have too many bugs (either software or hardware or both).
(I have lots of respect for the people who write drive firmware. It's a high-stakes environment and one that's probably not widely appreciated. If it all works people are all 'well, of course', and if any significant part of it doesn't work, there will be hell to pay.)