Sometimes you get lucky and apparently dead disks come back to life

February 2, 2017

You might wonder what happened after we lost both sides of a Linux software RAID mirror with no advance warning to us. At the time I said that we'd merely probably lost the mirror, because maybe one of the disks would come back to life after we power-cycled the machine. I was being pretty hopeful in that 'probably'; given that one disk had strongly failed and the other one had been giving read errors for some time, I thought it was unlikely that either would be so nice as to come back on us.

Well, apparently sometimes the world is nice and you get lucky. When my co-worker power cycled the server the next day, the disk which had been throwing read errors for some time before it died did in fact come back to life. In an even more startling development, it reported no read errors when it was used as the source to resync the array on to a new disk; at least in theory, this might mean that all data on it actually was intact and was recovered successfully.

(Since we're using ext4 on that server, not ZFS, we have no real way of knowing if something got quietly corrupted. Maybe there's garbage in the middle of some log file.)

This does raise some obvious questions, though. This server is a Dell R310 II (yes, really), and while these theoretically take four HDs this is one of the few of them where we're actually trying to do that. In general we haven't had the best of luck with our four-disk R310s; although I haven't entirely kept track, I believe we have lost a number of disks in them, more than we have in R310 II's with only one or two disks. And this particular server has definitely eaten disks before, although these particular disks were some of our unreliable 1 TB Seagates. Perhaps at least some of those progressive read errors were due to some environmental problem around the disk; maybe heat, maybe power, maybe some other glitch. In this theory, when we power cycled things and pulled the other drive, we relieved the stresses on that disk enough that it could return good data again (at least for a while).

(We replaced that disk too, of course; even if we could read from it once, we didn't trust it after it had given ongoing read errors. Better to be safe than sorry, especially with disks that are known to be prone to dying.)

What I take away from this is yet another reminder that modern disks are unpredictable and tricky. The days when things either worked or the disk was dead are long-over; these days there are all sorts of failure modes and ways for disks to get into trouble. All you can say when read errors start happening is that something certainly isn't going the way it should, but exactly what is not necessarily clear or easily figured out.

(And it could perfectly well be a combination of factors, where the R310 IIs are putting extra stress on disks that are weak to start with. The R310 IIs might be okay with more robust disks and these disks might do okay in another environment; put them together and it's a bad time.)

Comments on this page:

These things are weird in all sorts of novel and fascinating ways. Recently, one of the two hard drives in my RAID array started failing. Badly. Threw read errors all the time. I shrug, buy a new hard drive and carry on (I regularly back up this to an external hard-drive. It's not too much data, I just really like my music collection).

Two days later, poof, the other old hard drive starts throwing off errors. Fortunately, the new drive was fine. This was also early December, I was busy with a lot of stuff and figured this was a great time to re-evaluate whether I actually needed redundancy in my workstation.

I figured out a single hard drive + regular backups are actually all I need now (the system is five years old, and things were different five years ago), so I just destroyed the RAID array and restored the backup on the working hard drive.

However, I was too busy to actually yank out the old drive out of the machine. I left it there, despite being mildly annoyed by the fact that dmesg included a lot of complaints from the SATA module and by the occasional screeching.

I eventually forgot about it. Why? For some reason, it started working again.

I don't know how and when, I just realized at one point a few weeks ago that I don't see the sata module complaining about it in dmesg anymore.

Of course, I don't actually trust the damn disk at this point, but I did schedule a script that rsyncs it with the "main" hard drive, just to see how long it will last.

It's been working like a charm and happily syncing 1 TB of data twice a week for more than six weeks now.

By Colin McDermott at 2017-02-02 23:30:36:

It's happened to me in the past. But largely with older Quantum Bigfoot drives. (YES HORRID drives that probably sunk Quantum). When the cap's cooled you could turn the HDD on again.

Written on 02 February 2017.
« Email attachments of singleton nested zipfiles are suspicious
Link: Four Column ASCII »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Feb 2 00:16:02 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.