Sometimes you get lucky and apparently dead disks come back to life
You might wonder what happened after we lost both sides of a Linux software RAID mirror with no advance warning to us. At the time I said that we'd merely probably lost the mirror, because maybe one of the disks would come back to life after we power-cycled the machine. I was being pretty hopeful in that 'probably'; given that one disk had strongly failed and the other one had been giving read errors for some time, I thought it was unlikely that either would be so nice as to come back on us.
Well, apparently sometimes the world is nice and you get lucky. When my co-worker power cycled the server the next day, the disk which had been throwing read errors for some time before it died did in fact come back to life. In an even more startling development, it reported no read errors when it was used as the source to resync the array on to a new disk; at least in theory, this might mean that all data on it actually was intact and was recovered successfully.
(Since we're using ext4 on that server, not ZFS, we have no real way of knowing if something got quietly corrupted. Maybe there's garbage in the middle of some log file.)
This does raise some obvious questions, though. This server is a Dell R310 II (yes, really), and while these theoretically take four HDs this is one of the few of them where we're actually trying to do that. In general we haven't had the best of luck with our four-disk R310s; although I haven't entirely kept track, I believe we have lost a number of disks in them, more than we have in R310 II's with only one or two disks. And this particular server has definitely eaten disks before, although these particular disks were some of our unreliable 1 TB Seagates. Perhaps at least some of those progressive read errors were due to some environmental problem around the disk; maybe heat, maybe power, maybe some other glitch. In this theory, when we power cycled things and pulled the other drive, we relieved the stresses on that disk enough that it could return good data again (at least for a while).
(We replaced that disk too, of course; even if we could read from it once, we didn't trust it after it had given ongoing read errors. Better to be safe than sorry, especially with disks that are known to be prone to dying.)
What I take away from this is yet another reminder that modern disks are unpredictable and tricky. The days when things either worked or the disk was dead are long-over; these days there are all sorts of failure modes and ways for disks to get into trouble. All you can say when read errors start happening is that something certainly isn't going the way it should, but exactly what is not necessarily clear or easily figured out.
(And it could perfectly well be a combination of factors, where the R310 IIs are putting extra stress on disks that are weak to start with. The R310 IIs might be okay with more robust disks and these disks might do okay in another environment; put them together and it's a bad time.)
Comments on this page:
|
|