A spate of somewhat alarming flaky SMART errors on Crucial MX500 SSDs

December 10, 2018

We've been running Linux's smartd on all of our Linux machines for a long time now, and over that time it's been solidly reliable (with a few issues here and there, like not always handling disk removals and (re)insertions properly). SMART attributes themselves may or may not be indicative of anything much, but smartd does reliably alert on the ones that it monitors.

Except on our new Linux fileservers. For a significant amount of time now, smartd has periodically been sending us email about various drives now having one 'currently unreadable (pending) sectors' (which is SMART attribute 197). When we go look at the affected drive with smartctl, even within 60 seconds of the event being reported, the drive has always reported that it now has no unreadable pending sectors; the attribute is once again 0.

These fileservers use both SATA and SAS for connecting the drives, and we have an assorted mixture of 2TB SSDs; some Crucial MX300s, some Crucial MX500s, and some Micron 1100s. The errors happen for drives connected through both SATA and SAS, but what we hadn't noticed until now is that all of the errors are from Crucial MX500s. All of these have the same firmware version, M3CR010, which appears to be the only currently available one (although at one point Crucial apparently released version M3CR022, cf, but Crucial appears to have quietly pulled it since then).

These reported errors are genuine in one sense, in that it's not just smartd being flaky. We also track all the SMART attributes through our new Prometheus system, and it also periodically reports a temporary '1' value for various MX500s. However, as far as I can see the Prometheus-noted errors always go away right afterward, just as the smartd ones do. In addition, no other SMART attributes on an affected drive show any unexpected changes (we see increases in eg 'power on hours' and other things that always count up). We've also done mass reads, SMART self-tests, and other things on these drives, always without problems reported, and there are no actual reported read errors at the Linux kernel level.

(And these drives are in use in ZFS pools, and we haven't seen any ZFS checksum errors. I'm pretty confident that ZFS would catch any corrupted data the drives were returning, if they were.)

Although I haven't done extensive hand checking, the reported errors do appear to correlate with read and write IO happening on the drive. In spot checks using Prometheus disk metrics, none of the drives appeared to be inactive at the times that smartd emailed us, and they may all have been seeing a combination of read and write IO at the time. Almost all of our MX500 SSDs are in the two in-production fileservers that have been reporting errors; we have one that's in a test machine that's now basically inactive, and while I believe it reported errors in the past (when we were testing things with it), it hasn't for a while.

(Update: It turns out that I was wrong; we've never had errors reported on the MX500 in our test machine.)

I see at least two overall possibilities and neither of them are entirely reassuring. One possibility is that the MX500s have a small firmware bug where occasionally, under the right circumstances, they report an incorrect 'currently unreadable (pending) sectors' value for some internal reason (I can imagine various theories). The second is that our MX500s are detecting genuine unreadable sectors, but then quietly curing them somehow. This is worrisome because it suggests that the drives are actually suffering real errors or already starting to wear out, despite a quite light IO load and being in operation for less than a year.

We don't have any solutions or answers, so we're just going to have to keep an eye on the situation. All in all it's a useful reminder that modern SSDs are extremely complicated things that are quite literally small computers (multi-core ones at that, these days), running complex software that's entirely opaque to us. All we can do is hope that they don't have too many bugs (either software or hardware or both).

(I have lots of respect for the people who write drive firmware. It's a high-stakes environment and one that's probably not widely appreciated. If it all works people are all 'well, of course', and if any significant part of it doesn't work, there will be hell to pay.)


Comments on this page:

By Reader McReader at 2018-12-10 02:27:56:

This is definitely normal. Drives constantly reallocate sectors, small defects shouldn't make the entire drive unusable. Typically RAID software is able to rebuild any data that might be lost.

The pending sector count will increase then decrease as the reallocation occurs and attribute 5 should increase. You can check the advanced sense code qualifiers to see if reallocation is happening. The log pages may also have a reallocation event counter you can check (smartctl -x).

By rephlex at 2018-12-11 22:22:49:

M3CR023 firmware is available now. I have a 1 TB MX500 which has been running firmware M3CR022 since mid-July on my Windows 10 laptop and haven't had any problems.

We currently have 300~ SSDs deployed across various machines, over time we've used Crucial MX100,200,300,500 BX100,BX200 Micron M600, Sandisk Extreme Pro, Intel 300/500 series, Intel 750 NVMe.

Until the Crucial MX500 - we have never seen problems with "Currently unreadable (pending) sectors", now within 4-8 weeks of use, we have at least 10 (probably more) of our 96 MX500 drives reporting this SMART error.

---

FYI – As per Redhat’s official documentation “What does the message "smartd[<pid>]: Device: /dev/sdd [SAT], 6 Currently unreadable (pending) sectors" mean and what to do about it?” - https://access.redhat.com/solutions/3919531

"However, some drives will not immediately remap such sectors when written; instead the drive will first attempt to write to the problem sector and if the write operation is successful then the sector will be marked good (in this case, the "Reallocation Event Count" (0xC4) will not be increased).

This is a serious shortcoming, for if such a drive contains marginal sectors that consistently fail only after some time has passed following a successful write operation, then the drive will never remap these problem sectors.

The drive firmware should perform this remapping and reduce the counter after each sector is remapped."

---

I have been in contact with Crucial/Micron support and they have stated:

"I consulted with our Engineering Department and based off of the SMART data provided the drive still appears to be healthy. While the "Current Pending Sector Count" you referenced is serious for HDDs it is not usually a direct indicator of failure for SSDs. Sectors in a pending state are currently being evaluated by our error correction algorithms and are likely to be resolved. We noted that there are no reallocated blocks and no uncorrectable errors. I would recommend viewing our article below that covers SMART data and how it can often be misread by third party programs. We also offer a Linux version of our SMART reporting tool which I will link as well."

"SSDs and SMART Data: https://www.crucial.com/csrusa/en/ssds-and-smart-data" - (not really helpful and IMO parts sound like a vendor excuse for firmware bugs)

"Storage Executive Software: https://www.micron.com/products/solid-state-drives/storage-executive-software" - (yet to try, but don't like using third-party tools like this, especially when they're not open source or even packaged into an RPM / available on a yum repo.

--

Links / Citations:

- https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes

- https://access.redhat.com/solutions/3919531

- https://www.thomas-krenn.com/en/wiki/SMART_tests_with_smartctl

- https://www.reddit.com/r/freenas/comments/akpjfm/smart_unreadable_pending_sectors_error/ - (Micron MX500 Pending Sectors discussion)

- https://utcc.utoronto.ca/~cks/space/blog/tech/SMARTAlarmingFlakyErrors - This discussion - (“A spate of somewhat alarming flaky SMART errors on Crucial MX500 SSDs”)

- https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/ch-ssd

By Lucretia S at 2020-03-07 17:37:13:

The brief changes of Pending Count from 0 to 1 to 0 correlate perfectly with the Crucial MX500 ssd's FTL controller occasionally writing thousands of pages in a brief burst. Here's how I learned that:

I've been running a .bat file on my Windows 10 pc that reads the ssd's SMART data every few seconds and logs the changes of the two SMART attributes (Host Pages Written and FTL Pages Written) that can be used to calculate the Write Amplification Factor (WAF). My logs show there are occasional bursts of FTL page writes, each burst is approximately a multiple of 37,000 pages, and the burst write speed is about 230 GBytes/second. (I presume the FTL reads as much as it writes during those bursts, and the combined read & write rate probably exceeds 460 GB/second.) I assume the cause of the FTL write bursts is a background process such as garbage collection, or static wear leveling, or copying from TLC that had been written in Fast SLC mode (Crucial's Dynamic Write Acceleration feature) to TLC in normal (3 bits per cell) mode. This morning I modified the .bat file so it also logs the Pending Count and set it to log every 2 seconds. It's clear from my log that the change of Pending Count from 0 to 1 correlates with the start of each FTL write burst, and the change from 1 to 0 correlates with the completion of the FTL write burst.

The reason I began logging SMART data and short term WAF values is because I noticed the decrease of Remaining Life (RL) of my ssd accelerated after I reduced the pc writing to the ssd (from about 1 MByte/second average to about 100 kBytes/second average). Of course I had expected the opposite effect on RL: the less the pc writes, the longer ought to be the ssd's life. The decrease of RL from 100% to 95% corresponded to the pc writing 5,782 GB to the ssd, an average of 1,156 GB per percent. Then I reduced the writing to ssd. The decrease from 95% to 94% corresponded with the pc writing only 390 GB, and the decrease from 94% to 93% corresponded with the pc writing only 138 GB. That seems paradoxical and I believe it exposed another bug in the Crucial firmware. WAF was averaging about 40 (with large deviations) after I reduced the pc writing to ssd. That's excessively high.

I then theorized that the excess background writing by the FTL might be reduced if the ssd were kept busy with a higher priority task. I found a great task for this purpose: the ssd extended selftest, which doesn't require any work by the cpu, and which apparently suspends whenever the pc reads or writes so it presumably doesn't hurt performance. I wrote another .bat file that runs extended selftests in an infinite loop using the 'smartctl.exe -t long' command, with a duty cycle that I can choose (by using the 'smartctl.exe -X' command to abort the selftest. My results show that selftests with a high duty cycle are very effective at managing WAF, and reduce the frequency of the occasional FTL write bursts. With selftests running with a duty cycle of 19.5 minutes of every 20, WAF has been averaging around 3.

Obvious side effects of near-nonstop ssd selftests are: (1) increased power consumption (which I estimate is about one watt), (2) average temperature higher by about 5 degrees C (but ssd temperature has been a very acceptable 40C), and (3) the ssd rarely enters a low power mode. I haven't yet benchmarked speed to see whether it's hurt or helped... it's possible that speed is helped a little because the ssd no longer needs to transition from low power mode to normal power mode. I don't know whether the higher temperature hurts or helps the ssd lifespan; although average temperature might be bad all else being equal, all else is not equal because a constant temperature and constant power might be better than temperature fluctuations and power spikes.

Regarding the effect of the ssd selftests on the Pending Count bug, the selftests greatly reduce the frequency of the Pending Count event since the selftests greatly reduce the frequency of the FTL write bursts. But the bursts aren't entirely eliminated. Perhaps a 100% duty cycle of selftests would eliminate the bursts and the Pending Count events, but that seems risky... it's possible that nonstop selftests would prevent necessary background processes from getting enough runtime and would lead to premature death of the ssd.

Written on 10 December 2018.
« Firefox, WebExtensions, and Content Security Policies
Why I'm usually unnerved when modern SSDs die on us »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Dec 10 00:18:24 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.