Watch out for quietly degrading SATA disks

May 25, 2010

In an ideal world, disks would either be working completely normally or obviously broken (either producing errors or being completely dead); if a drive wasn't actively reporting problems, you could assume it was working fine. I am here to tell you that sadly we do not live in that world, at least not with SATA drives.

What we've now seen several times is SATA drives that degraded quietly; they didn't particularly report errors, they just started performing terribly (by their usual standards). The most recent case was a 1 TB SATA drive whose sequential read rate off the raw disk dropped from 100 Mbytes/sec to 39 Mbytes/sec, but we've had others (and from multiple vendors), and I've seen similar reports from other people.

(At least in our case there were no warning signs from SMART reports, although the disk did report a read failure recently (not during the speed tests, I'll note). Possibly that counts as a very bad sign these days; I'm certainly aware that write errors are, as they mean that the disk has exhausted its ability to spare out bad sectors.)

Clearly, sometimes modern disks either fail quietly or just go bonkers. Equally clearly we can no longer count on status and error monitoring to turn up disks with problems; we're going to need to put together some sort of positive health check, where we periodically test disk performance and start raising alarms if any disk comes in below where it should. Making this reliable in the face of regular production IO to the disks will be interesting.

(It's possible that some of our apparently bad disks would be fine after being power-cycled and cooling down and so on. Re-testing the most recent failed disk is on my list of things to do sometime, to see if this issue is persistent. As a transient issue there are all sorts of possible explanations ranging from firmware bugs latching the drive into some peculiar state to excessive vibrations (we're now learning that these can visibly degrade drive performance). As a permanent issue, well, it could be something like too much bad sector sparing in action; I'm not certain if our current SMART monitoring software notices that.)


Comments on this page:

From 201.95.160.158 at 2010-05-26 07:13:05:

With a population of over 8k SATA disks, we've noticed that quite regularly. However we monitor the amount of media errors that are reported by the RAID controller and replace the disks as soon as possible. That specific threshold is not easy to calculate but we've come to the conclusion that over 100 media errors is usually bad (if they are scattered around, not localized).

The biggest problem is one a disk is working so badly but won't fail, so operations are retried on it for a long time. NFS clients will feel that.

Giovanni

From 69.113.211.148 at 2010-05-26 08:42:39:

When I first started dating my fiancee, her desktop computer was dragging ass like you wouldn't believe. Some of her friends had run all kinds of cleanup on it, pruning out all the traces of spyware, malware and pre-installed Dell crap and stopping just short of blowing away and reinstalling the whole OS.

I took a look at it, did some of the more obvious checks, and after a day or so, I finally decided to run HD Tune's benchmark on it.

It peaked at THREE megabytes/sec throughput.

Just like in your situation, there were no SMART errors, no device resets, nothing super off-kilter with the drive counters, and the only thing apparently wrong was that the disk was inexplicably running like crap but ostensibly chugging along just fine.

This is something that the vendors will need to begin tracking -- having it abstracted away from us by RAID controllers and SAN storage processors doesn't make the process easy for people trying to roll their own baseline monitor.

--Jeff

Written on 25 May 2010.
« Give your personal scripts good error messages
One benefit of relying on third-party (anti-)spam filtering »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue May 25 23:22:22 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.