Why I'm usually unnerved when modern SSDs die on us
Tonight, one of the SSDs on our new Linux fileservers died. It's not the first SSD death we've seen and probably not the last one, but as almost always, I found it an unnerving experience because of a combination of how our SSDs tend to die, how much of a black box they are, and how they're solid-state devices.
Like most of the SSDs deaths that we've had, this one was very abrupt; the drive went from perfectly fine to completely unresponsive in at most 50 seconds or so, with no advance warning in SMART or anything else. One moment it was serving read and write IO perfectly happily (from all external evidence, and ZFS wasn't complaining about read checksums) and the next moment there was no Crucial MX300 at that SAS port any more. Or at least at very close to the next moment.
(The first Linux kernel message about failed IO operations came at 20:31:34 and the kernel seems to have declared the drive officially vanished at 20:32:15. But the actual drive may have been unresponsive from the start; the driver messages aren't clear to me.)
What unnerves me about these sorts of abrupt SSD failures is how inscrutable they are and how I can't construct a story in my head of what went wrong. With spinning HDs, drives might die abruptly but you could at least construct narratives about what could have happened to do that; perhaps the spindle motor drive seized or the drive had some other gross mechanical failure that brought everything to a crashing halt (perhaps literally). SSDs are both solid state and opaque, so I'm left with no story for what went wrong, especially when a drive is young and isn't supposed to have come anywhere near wearing out its flash cells (as this SSD was).
(When a HD died early, you could also imagine undetected manufacturing flaws that finally gave way. With SSDs, at least in theory that shouldn't happen, so early death feels especially alarming. Probably there are potential undetected manufacturing flaws in the flash cells and so on, though.)
When I have no story, my thoughts turn to unnerving possibilities, like that the drive was lying to us about how healthy it was in SMART data and that it was actually running through spare flash capacity and then just ran out, or that it had a firmware flaw that we triggered that bricked it in some way.
(We had one SSD fail in this way and then come back when it was pulled out and reinserted, apparently perfectly healthy, which doesn't inspire confidence. But that was a different type of SSD. And of course we've had flaky SMART errors with Crucial MX500s.)
Further, when I have no narrative for what causes SSD failures, it feels like every SSD is an unpredictable time bomb. Are they healthy or are they going to die tomorrow? It feels like I really have to hope in statistics, namely that not too many will fail not too fast before they can be replaced. And even that hope relies on an assumption that failures are uncorrelated, that what happened to this SSD isn't likely to happen to the ones on either side of it.
(This isn't just an issue in our fileservers; it's also something I worry about for the SSDs in my home machine. All my data is mirrored, but what are the chances of a dual SSD failure?)
In theory I know that SSDs are supposed to be much more reliable than spinning rust (and we have lots of SSDs that have been ticking along quietly for years). But after mysterious abrupt death failures like this, it doesn't feel like it. I really wish we generally got some degree of advance warning about SSDs failing, the way we not infrequently did with HDs (for instance, with one HD in my office machine, even though I ignored its warnings).
A spate of somewhat alarming flaky SMART errors on Crucial MX500 SSDs
We've been running Linux's smartd on all of our Linux machines for a long time now, and over that time it's been solidly reliable (with a few issues here and there, like not always handling disk removals and (re)insertions properly). SMART attributes themselves may or may not be indicative of anything much, but smartd does reliably alert on the ones that it monitors.
Except on our new Linux fileservers. For a significant amount
of time now,
smartd has periodically been sending us email about
various drives now having one 'currently unreadable (pending)
sectors' (which is SMART attribute 197). When we go look at the
affected drive with
smartctl, even within 60 seconds of the event
being reported, the drive has always reported that it now has no
unreadable pending sectors; the attribute is once again 0.
These fileservers use both SATA and SAS for connecting the drives, and we have an assorted mixture of 2TB SSDs; some Crucial MX300s, some Crucial MX500s, and some Micron 1100s. The errors happen for drives connected through both SATA and SAS, but what we hadn't noticed until now is that all of the errors are from Crucial MX500s. All of these have the same firmware version, M3CR010, which appears to be the only currently available one (although at one point Crucial apparently released version M3CR022, cf, but Crucial appears to have quietly pulled it since then).
These reported errors are genuine in one sense, in that it's not
smartd being flaky. We also track all the SMART attributes
through our new Prometheus system, and
it also periodically reports a temporary '1' value for various
MX500s. However, as far as I can see the Prometheus-noted errors
always go away right afterward, just as the
smartd ones do. In
addition, no other SMART attributes on an affected drive show any
unexpected changes (we see increases in eg 'power on hours' and
other things that always count up). We've also done mass reads,
SMART self-tests, and other things on these drives, always without
problems reported, and there are no actual reported read errors at
the Linux kernel level.
(And these drives are in use in ZFS pools, and we haven't seen any ZFS checksum errors. I'm pretty confident that ZFS would catch any corrupted data the drives were returning, if they were.)
Although I haven't done extensive hand checking, the reported errors
do appear to correlate with read and write IO happening on the
drive. In spot checks using Prometheus disk metrics, none of the
drives appeared to be inactive at the times that
us, and they may all have been seeing a combination of read and
write IO at the time. Almost all of our MX500 SSDs are in the two
in-production fileservers that have been reporting errors; we have
one that's in a test machine that's now basically inactive, and
while I believe it reported errors in the past (when we were testing
things with it), it hasn't for a while.
(Update: It turns out that I was wrong; we've never had errors reported on the MX500 in our test machine.)
I see at least two overall possibilities and neither of them are entirely reassuring. One possibility is that the MX500s have a small firmware bug where occasionally, under the right circumstances, they report an incorrect 'currently unreadable (pending) sectors' value for some internal reason (I can imagine various theories). The second is that our MX500s are detecting genuine unreadable sectors, but then quietly curing them somehow. This is worrisome because it suggests that the drives are actually suffering real errors or already starting to wear out, despite a quite light IO load and being in operation for less than a year.
We don't have any solutions or answers, so we're just going to have to keep an eye on the situation. All in all it's a useful reminder that modern SSDs are extremely complicated things that are quite literally small computers (multi-core ones at that, these days), running complex software that's entirely opaque to us. All we can do is hope that they don't have too many bugs (either software or hardware or both).
(I have lots of respect for the people who write drive firmware. It's a high-stakes environment and one that's probably not widely appreciated. If it all works people are all 'well, of course', and if any significant part of it doesn't work, there will be hell to pay.)