Why I'm usually unnerved when modern SSDs die on us

December 10, 2018

Tonight, one of the SSDs on our new Linux fileservers died. It's not the first SSD death we've seen and probably not the last one, but as almost always, I found it an unnerving experience because of a combination of how our SSDs tend to die, how much of a black box they are, and how they're solid-state devices.

Like most of the SSDs deaths that we've had, this one was very abrupt; the drive went from perfectly fine to completely unresponsive in at most 50 seconds or so, with no advance warning in SMART or anything else. One moment it was serving read and write IO perfectly happily (from all external evidence, and ZFS wasn't complaining about read checksums) and the next moment there was no Crucial MX300 at that SAS port any more. Or at least at very close to the next moment.

(The first Linux kernel message about failed IO operations came at 20:31:34 and the kernel seems to have declared the drive officially vanished at 20:32:15. But the actual drive may have been unresponsive from the start; the driver messages aren't clear to me.)

What unnerves me about these sorts of abrupt SSD failures is how inscrutable they are and how I can't construct a story in my head of what went wrong. With spinning HDs, drives might die abruptly but you could at least construct narratives about what could have happened to do that; perhaps the spindle motor drive seized or the drive had some other gross mechanical failure that brought everything to a crashing halt (perhaps literally). SSDs are both solid state and opaque, so I'm left with no story for what went wrong, especially when a drive is young and isn't supposed to have come anywhere near wearing out its flash cells (as this SSD was).

(When a HD died early, you could also imagine undetected manufacturing flaws that finally gave way. With SSDs, at least in theory that shouldn't happen, so early death feels especially alarming. Probably there are potential undetected manufacturing flaws in the flash cells and so on, though.)

When I have no story, my thoughts turn to unnerving possibilities, like that the drive was lying to us about how healthy it was in SMART data and that it was actually running through spare flash capacity and then just ran out, or that it had a firmware flaw that we triggered that bricked it in some way.

(We had one SSD fail in this way and then come back when it was pulled out and reinserted, apparently perfectly healthy, which doesn't inspire confidence. But that was a different type of SSD. And of course we've had flaky SMART errors with Crucial MX500s.)

Further, when I have no narrative for what causes SSD failures, it feels like every SSD is an unpredictable time bomb. Are they healthy or are they going to die tomorrow? It feels like I really have to hope in statistics, namely that not too many will fail not too fast before they can be replaced. And even that hope relies on an assumption that failures are uncorrelated, that what happened to this SSD isn't likely to happen to the ones on either side of it.

(This isn't just an issue in our fileservers; it's also something I worry about for the SSDs in my home machine. All my data is mirrored, but what are the chances of a dual SSD failure?)

In theory I know that SSDs are supposed to be much more reliable than spinning rust (and we have lots of SSDs that have been ticking along quietly for years). But after mysterious abrupt death failures like this, it doesn't feel like it. I really wish we generally got some degree of advance warning about SSDs failing, the way we not infrequently did with HDs (for instance, with one HD in my office machine, even though I ignored its warnings).


Comments on this page:

SSDs are indeed inscrutable. Worse: they are autonomous devices that do stuff behind the scenes no matter what they're told (or not told) to do by your drive controller. I admit, I use plenty of them, in pretty much every laptop, desktop, and server I own. But I'll never entirely trust them. I've been writing about this (and related issues) in my own blog: https://coverclock.blogspot.com/search/label/Storage .

If you saw how the sausage was made, you would be a bit horrified. Solid State drives work by trapping little bits of charge inside silicon nitride / silicon dioxide layers over a transistor. As the devices have scaled smaller and have now gone 3D, manufacturers have to use all sorts of tricks to make the devices work -- data whitening so there is no local bias in charge, error corrections, extra bits, etc. The actual data stored on the drive looks like noise, and is extracted through the magic of math.

It's amazing that they work at all!

Congrats on /.

Reading more... I strongly suspect it's controller failure. A number of cheap spinning drives in my ghetto home array have died suddenly with no SMART warning. Some seem to be unhappy with a SAS controller (rather than SATA ... repurposing them seems to be fine)... Others just seem to die.

Bad cold solder joints? Excess heat somewhere bad?

By Greg A. Woods at 2018-12-11 16:00:53:

Once upon a time it seemed to me as if the controller on a hard drive was almost more likely to fail than the mechanical bits (once head crashes were less of a likelihood). Indeed the drive would get spots where the rust was unreliable for whatever reason, but of course that doesn't cause a full-on failure -- just a bad sector. However a bad capacitor or weak diode somewhere in the digital logic could cause what appeared to be a complete failure of the drive, but by simply replacing the controller board, the drive could continue to be used without even any data loss. I rebuilt more than one MFM drive by just swapping the controller boards from drives with crashed heads.

Now that the whole drive is electronics through and through, failures are perhaps more apt to look like controller failures, even if strictly speaking it's not the actual cause, but rather just a symptom, e.g. the controller locked up because it can't talk to one of the memory chips any more, and the controller firmware author didn't take that particular failure mode into consideration. Controller firmware shouldn't fail badly, but there's a lot of firmware in all modern drives, spinning rust or solid state, and I wouldn't be surprised if it wasn't more buggy now than ever before.

By Shawn M. at 2018-12-11 16:08:02:

Unfortunately, SSDs are a real, physical object and fail. :(

I know that sounds glib, but physical objects fail. CPUs and RAM fail when they physically look fine; a microscopic investigation MIGHT yield the cause, but the effort outweighs the knowledge.

The value in this event in the reminder of the value and complexity of redundancy: Keep backups. RAID for uptime is a good idea. Source good components.

It sounds like you already do the last one. As an example, early SSDs were event less reliable, and I had an early run of OCZ SSDs fail - in a handful of consumer desktops and laptops, 5 over the course of the year.

I learned the same lessons then. And, to be fair, I can still always come up with an external circumstance for my data to be destroyed. All I can do is backup, failover, and monitor. (And make it stupidly easy to do those things, so I keep doing them!)

By Ryan P at 2018-12-11 20:34:53:

Apparently SpinRite from www.grc.com can sometimes resurrect SSDs also.

My guess? Go to https://www.swpc.noaa.gov/ and check the planetary k-index on December 10 :)

If I was running a data center, I would invest in electromagnetic shielding. IMHO solar activity accounts for a lot of electronic damage that get dismissed as random accidents.

By A grumpy ex-sysadmin at 2018-12-12 09:42:42:

I use the same workaround to make SSDs reliable that I used for mechanical HDs: Use software RAID (firmware RAID is opaque as it's not free software), and try to only have one of each make and model of device in each array.

You can't predict defects but you can reduce the chance of multiple simultaneous failures crashing an important system by not having a storage monoculture.

On the topic of opaque SSD issues, intel's powertop recommends settings which can cause intel SSDs in laptops with intel chipsets, to hang. The kernel repeatedly resets the SATA link for reasons seemingly unrelated to power management.

Specifically, powertop considers any setting for "link_power_management_policy" other than "min_power" to be "Bad". But, "min_power" is the setting that causes the problems. In 4.19 the default can be changed in .config, with CONFIG_SATA_MOBILE_LPM_POLICY=3 or "med_power_with_dipm" being the best setting for my laptops.

Written on 10 December 2018.
« A spate of somewhat alarming flaky SMART errors on Crucial MX500 SSDs
One situation where you absolutely can't use irate() in Prometheus »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Mon Dec 10 22:57:36 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.