Watch out for quietly degrading SATA disks
In an ideal world, disks would either be working completely normally or obviously broken (either producing errors or being completely dead); if a drive wasn't actively reporting problems, you could assume it was working fine. I am here to tell you that sadly we do not live in that world, at least not with SATA drives.
What we've now seen several times is SATA drives that degraded quietly; they didn't particularly report errors, they just started performing terribly (by their usual standards). The most recent case was a 1 TB SATA drive whose sequential read rate off the raw disk dropped from 100 Mbytes/sec to 39 Mbytes/sec, but we've had others (and from multiple vendors), and I've seen similar reports from other people.
(At least in our case there were no warning signs from SMART reports, although the disk did report a read failure recently (not during the speed tests, I'll note). Possibly that counts as a very bad sign these days; I'm certainly aware that write errors are, as they mean that the disk has exhausted its ability to spare out bad sectors.)
Clearly, sometimes modern disks either fail quietly or just go bonkers. Equally clearly we can no longer count on status and error monitoring to turn up disks with problems; we're going to need to put together some sort of positive health check, where we periodically test disk performance and start raising alarms if any disk comes in below where it should. Making this reliable in the face of regular production IO to the disks will be interesting.
(It's possible that some of our apparently bad disks would be fine after being power-cycled and cooling down and so on. Re-testing the most recent failed disk is on my list of things to do sometime, to see if this issue is persistent. As a transient issue there are all sorts of possible explanations ranging from firmware bugs latching the drive into some peculiar state to excessive vibrations (we're now learning that these can visibly degrade drive performance). As a permanent issue, well, it could be something like too much bad sector sparing in action; I'm not certain if our current SMART monitoring software notices that.)
Comments on this page:Written on 25 May 2010.