A spate of somewhat alarming flaky SMART errors on Crucial MX500 SSDs

December 10, 2018

We've been running Linux's smartd on all of our Linux machines for a long time now, and over that time it's been solidly reliable (with a few issues here and there, like not always handling disk removals and (re)insertions properly). SMART attributes themselves may or may not be indicative of anything much, but smartd does reliably alert on the ones that it monitors.

Except on our new Linux fileservers. For a significant amount of time now, smartd has periodically been sending us email about various drives now having one 'currently unreadable (pending) sectors' (which is SMART attribute 197). When we go look at the affected drive with smartctl, even within 60 seconds of the event being reported, the drive has always reported that it now has no unreadable pending sectors; the attribute is once again 0.

These fileservers use both SATA and SAS for connecting the drives, and we have an assorted mixture of 2TB SSDs; some Crucial MX300s, some Crucial MX500s, and some Micron 1100s. The errors happen for drives connected through both SATA and SAS, but what we hadn't noticed until now is that all of the errors are from Crucial MX500s. All of these have the same firmware version, M3CR010, which appears to be the only currently available one (although at one point Crucial apparently released version M3CR022, cf, but Crucial appears to have quietly pulled it since then).

These reported errors are genuine in one sense, in that it's not just smartd being flaky. We also track all the SMART attributes through our new Prometheus system, and it also periodically reports a temporary '1' value for various MX500s. However, as far as I can see the Prometheus-noted errors always go away right afterward, just as the smartd ones do. In addition, no other SMART attributes on an affected drive show any unexpected changes (we see increases in eg 'power on hours' and other things that always count up). We've also done mass reads, SMART self-tests, and other things on these drives, always without problems reported, and there are no actual reported read errors at the Linux kernel level.

(And these drives are in use in ZFS pools, and we haven't seen any ZFS checksum errors. I'm pretty confident that ZFS would catch any corrupted data the drives were returning, if they were.)

Although I haven't done extensive hand checking, the reported errors do appear to correlate with read and write IO happening on the drive. In spot checks using Prometheus disk metrics, none of the drives appeared to be inactive at the times that smartd emailed us, and they may all have been seeing a combination of read and write IO at the time. Almost all of our MX500 SSDs are in the two in-production fileservers that have been reporting errors; we have one that's in a test machine that's now basically inactive, and while I believe it reported errors in the past (when we were testing things with it), it hasn't for a while.

(Update: It turns out that I was wrong; we've never had errors reported on the MX500 in our test machine.)

I see at least two overall possibilities and neither of them are entirely reassuring. One possibility is that the MX500s have a small firmware bug where occasionally, under the right circumstances, they report an incorrect 'currently unreadable (pending) sectors' value for some internal reason (I can imagine various theories). The second is that our MX500s are detecting genuine unreadable sectors, but then quietly curing them somehow. This is worrisome because it suggests that the drives are actually suffering real errors or already starting to wear out, despite a quite light IO load and being in operation for less than a year.

We don't have any solutions or answers, so we're just going to have to keep an eye on the situation. All in all it's a useful reminder that modern SSDs are extremely complicated things that are quite literally small computers (multi-core ones at that, these days), running complex software that's entirely opaque to us. All we can do is hope that they don't have too many bugs (either software or hardware or both).

(I have lots of respect for the people who write drive firmware. It's a high-stakes environment and one that's probably not widely appreciated. If it all works people are all 'well, of course', and if any significant part of it doesn't work, there will be hell to pay.)

Written on 10 December 2018.
« Firefox, WebExtensions, and Content Security Policies
Why I'm usually unnerved when modern SSDs die on us »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Dec 10 00:18:24 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.