Why I'm usually unnerved when modern SSDs die on us
Tonight, one of the SSDs on our new Linux fileservers died. It's not the first SSD death we've seen and probably not the last one, but as almost always, I found it an unnerving experience because of a combination of how our SSDs tend to die, how much of a black box they are, and how they're solid-state devices.
Like most of the SSDs deaths that we've had, this one was very abrupt; the drive went from perfectly fine to completely unresponsive in at most 50 seconds or so, with no advance warning in SMART or anything else. One moment it was serving read and write IO perfectly happily (from all external evidence, and ZFS wasn't complaining about read checksums) and the next moment there was no Crucial MX300 at that SAS port any more. Or at least at very close to the next moment.
(The first Linux kernel message about failed IO operations came at 20:31:34 and the kernel seems to have declared the drive officially vanished at 20:32:15. But the actual drive may have been unresponsive from the start; the driver messages aren't clear to me.)
What unnerves me about these sorts of abrupt SSD failures is how inscrutable they are and how I can't construct a story in my head of what went wrong. With spinning HDs, drives might die abruptly but you could at least construct narratives about what could have happened to do that; perhaps the spindle motor drive seized or the drive had some other gross mechanical failure that brought everything to a crashing halt (perhaps literally). SSDs are both solid state and opaque, so I'm left with no story for what went wrong, especially when a drive is young and isn't supposed to have come anywhere near wearing out its flash cells (as this SSD was).
(When a HD died early, you could also imagine undetected manufacturing flaws that finally gave way. With SSDs, at least in theory that shouldn't happen, so early death feels especially alarming. Probably there are potential undetected manufacturing flaws in the flash cells and so on, though.)
When I have no story, my thoughts turn to unnerving possibilities, like that the drive was lying to us about how healthy it was in SMART data and that it was actually running through spare flash capacity and then just ran out, or that it had a firmware flaw that we triggered that bricked it in some way.
(We had one SSD fail in this way and then come back when it was pulled out and reinserted, apparently perfectly healthy, which doesn't inspire confidence. But that was a different type of SSD. And of course we've had flaky SMART errors with Crucial MX500s.)
Further, when I have no narrative for what causes SSD failures, it feels like every SSD is an unpredictable time bomb. Are they healthy or are they going to die tomorrow? It feels like I really have to hope in statistics, namely that not too many will fail not too fast before they can be replaced. And even that hope relies on an assumption that failures are uncorrelated, that what happened to this SSD isn't likely to happen to the ones on either side of it.
(This isn't just an issue in our fileservers; it's also something I worry about for the SSDs in my home machine. All my data is mirrored, but what are the chances of a dual SSD failure?)
In theory I know that SSDs are supposed to be much more reliable than spinning rust (and we have lots of SSDs that have been ticking along quietly for years). But after mysterious abrupt death failures like this, it doesn't feel like it. I really wish we generally got some degree of advance warning about SSDs failing, the way we not infrequently did with HDs (for instance, with one HD in my office machine, even though I ignored its warnings).