A spate of somewhat alarming flaky SMART errors on Crucial MX500 SSDs

December 10, 2018

We've been running Linux's smartd on all of our Linux machines for a long time now, and over that time it's been solidly reliable (with a few issues here and there, like not always handling disk removals and (re)insertions properly). SMART attributes themselves may or may not be indicative of anything much, but smartd does reliably alert on the ones that it monitors.

Except on our new Linux fileservers. For a significant amount of time now, smartd has periodically been sending us email about various drives now having one 'currently unreadable (pending) sectors' (which is SMART attribute 197). When we go look at the affected drive with smartctl, even within 60 seconds of the event being reported, the drive has always reported that it now has no unreadable pending sectors; the attribute is once again 0.

These fileservers use both SATA and SAS for connecting the drives, and we have an assorted mixture of 2TB SSDs; some Crucial MX300s, some Crucial MX500s, and some Micron 1100s. The errors happen for drives connected through both SATA and SAS, but what we hadn't noticed until now is that all of the errors are from Crucial MX500s. All of these have the same firmware version, M3CR010, which appears to be the only currently available one (although at one point Crucial apparently released version M3CR022, cf, but Crucial appears to have quietly pulled it since then).

These reported errors are genuine in one sense, in that it's not just smartd being flaky. We also track all the SMART attributes through our new Prometheus system, and it also periodically reports a temporary '1' value for various MX500s. However, as far as I can see the Prometheus-noted errors always go away right afterward, just as the smartd ones do. In addition, no other SMART attributes on an affected drive show any unexpected changes (we see increases in eg 'power on hours' and other things that always count up). We've also done mass reads, SMART self-tests, and other things on these drives, always without problems reported, and there are no actual reported read errors at the Linux kernel level.

(And these drives are in use in ZFS pools, and we haven't seen any ZFS checksum errors. I'm pretty confident that ZFS would catch any corrupted data the drives were returning, if they were.)

Although I haven't done extensive hand checking, the reported errors do appear to correlate with read and write IO happening on the drive. In spot checks using Prometheus disk metrics, none of the drives appeared to be inactive at the times that smartd emailed us, and they may all have been seeing a combination of read and write IO at the time. Almost all of our MX500 SSDs are in the two in-production fileservers that have been reporting errors; we have one that's in a test machine that's now basically inactive, and while I believe it reported errors in the past (when we were testing things with it), it hasn't for a while.

(Update: It turns out that I was wrong; we've never had errors reported on the MX500 in our test machine.)

I see at least two overall possibilities and neither of them are entirely reassuring. One possibility is that the MX500s have a small firmware bug where occasionally, under the right circumstances, they report an incorrect 'currently unreadable (pending) sectors' value for some internal reason (I can imagine various theories). The second is that our MX500s are detecting genuine unreadable sectors, but then quietly curing them somehow. This is worrisome because it suggests that the drives are actually suffering real errors or already starting to wear out, despite a quite light IO load and being in operation for less than a year.

We don't have any solutions or answers, so we're just going to have to keep an eye on the situation. All in all it's a useful reminder that modern SSDs are extremely complicated things that are quite literally small computers (multi-core ones at that, these days), running complex software that's entirely opaque to us. All we can do is hope that they don't have too many bugs (either software or hardware or both).

(I have lots of respect for the people who write drive firmware. It's a high-stakes environment and one that's probably not widely appreciated. If it all works people are all 'well, of course', and if any significant part of it doesn't work, there will be hell to pay.)


Comments on this page:

By Reader McReader at 2018-12-10 02:27:56:

This is definitely normal. Drives constantly reallocate sectors, small defects shouldn't make the entire drive unusable. Typically RAID software is able to rebuild any data that might be lost.

The pending sector count will increase then decrease as the reallocation occurs and attribute 5 should increase. You can check the advanced sense code qualifiers to see if reallocation is happening. The log pages may also have a reallocation event counter you can check (smartctl -x).

By rephlex at 2018-12-11 22:22:49:

M3CR023 firmware is available now. I have a 1 TB MX500 which has been running firmware M3CR022 since mid-July on my Windows 10 laptop and haven't had any problems.

We currently have 300~ SSDs deployed across various machines, over time we've used Crucial MX100,200,300,500 BX100,BX200 Micron M600, Sandisk Extreme Pro, Intel 300/500 series, Intel 750 NVMe.

Until the Crucial MX500 - we have never seen problems with "Currently unreadable (pending) sectors", now within 4-8 weeks of use, we have at least 10 (probably more) of our 96 MX500 drives reporting this SMART error.

---

FYI – As per Redhat’s official documentation “What does the message "smartd[<pid>]: Device: /dev/sdd [SAT], 6 Currently unreadable (pending) sectors" mean and what to do about it?” - https://access.redhat.com/solutions/3919531

"However, some drives will not immediately remap such sectors when written; instead the drive will first attempt to write to the problem sector and if the write operation is successful then the sector will be marked good (in this case, the "Reallocation Event Count" (0xC4) will not be increased).

This is a serious shortcoming, for if such a drive contains marginal sectors that consistently fail only after some time has passed following a successful write operation, then the drive will never remap these problem sectors.

The drive firmware should perform this remapping and reduce the counter after each sector is remapped."

---

I have been in contact with Crucial/Micron support and they have stated:

"I consulted with our Engineering Department and based off of the SMART data provided the drive still appears to be healthy. While the "Current Pending Sector Count" you referenced is serious for HDDs it is not usually a direct indicator of failure for SSDs. Sectors in a pending state are currently being evaluated by our error correction algorithms and are likely to be resolved. We noted that there are no reallocated blocks and no uncorrectable errors. I would recommend viewing our article below that covers SMART data and how it can often be misread by third party programs. We also offer a Linux version of our SMART reporting tool which I will link as well."

"SSDs and SMART Data: https://www.crucial.com/csrusa/en/ssds-and-smart-data" - (not really helpful and IMO parts sound like a vendor excuse for firmware bugs)

"Storage Executive Software: https://www.micron.com/products/solid-state-drives/storage-executive-software" - (yet to try, but don't like using third-party tools like this, especially when they're not open source or even packaged into an RPM / available on a yum repo.

--

Links / Citations:

- https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes

- https://access.redhat.com/solutions/3919531

- https://www.thomas-krenn.com/en/wiki/SMART_tests_with_smartctl

- https://www.reddit.com/r/freenas/comments/akpjfm/smart_unreadable_pending_sectors_error/ - (Micron MX500 Pending Sectors discussion)

- https://utcc.utoronto.ca/~cks/space/blog/tech/SMARTAlarmingFlakyErrors - This discussion - (“A spate of somewhat alarming flaky SMART errors on Crucial MX500 SSDs”)

- https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/ch-ssd

Written on 10 December 2018.
« Firefox, WebExtensions, and Content Security Policies
Why I'm usually unnerved when modern SSDs die on us »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Dec 10 00:18:24 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.