How you can abruptly lose your filesystem on a software RAID mirror

January 30, 2017

We almost certainly just completely lost a software RAID mirror with no advance warning (we'll know for sure when we get a chance tomorrow to power-cycle the machine in the hopes that this revives a drive). This comes as very much of a surprise to us, as we thought that this was not supposed to be possible short of simultaneous two drive failure out of the blue, which should be an extremely rare event. So here is what happened, as best we can reconstruct right now.

In December, both sides of the software RAID mirror were operating normally (at least as far as we know; unfortunately the filesystem we've lost here is /var). Starting around January 4th, one of the two disks began sporadically returning read errors to software RAID code, which caused the software RAID to redirect reads to the other side of the mirror but not otherwise complain to us about the read errors beyond logging some kernel messages. Since nothing showed up about these read errors in /proc/mdstat, mdadm's monitoring never sent us email about it.

(It's possible that SMART errors were also reported on the drive, but we don't know; smartd monitoring turns out not to be installed by default on CentOS 7 and we never noticed that it was missing until it was too late.)

In the morning of January 27th, the other disk failed outright in a way that caused Linux to mark it as dead. The kernel software RAID code noticed this, of course, and duly marked it as failed. This transferred all IO load to the first disk, the one that had been seeing periodic errors since January 4th. It immediately fell over too; although the kernel has not marked it as explicitly dead, it now fails all IO. Our mirrored filesystem is dead unless we can somehow get one or the other of the drives to talk to us.

The fatal failure here is that nothing told us about the software RAID code having to redirect reads from one side of the mirror to the other due to IO errors. Sure, this information shows up in kernel messages, but so does a ton of other unstructured crap; the kernel message log is the unstructured dumping ground for all sorts of things and as a result, almost nothing attempts to parse it for information (at least not in a standard, regular installation).

Well, let me amend that. It appears that this information is actually available through sysfs, but nothing actually monitors it (in particular mdadm doesn't). There is an errors file in /sys/block/mdNN/md/dev-sdXX/ that contains a persistent counter of corrected read errors (this information is apparently stored in the device's software RAID superblock), so things like mdadm's monitoring could track it and tell you when there were problems. It just doesn't.

(So if you have software RAID arrays, I suggest that you put together something that monitors all of your errors files for increases and alerts you prominently.)

Comments on this page:

By Anon at 2017-01-30 04:21:57:

Hmm I thought md failfast was only in the newest kernels - ...

Either way its punishing that there are no events signifying it is happening...

By cks at 2017-01-30 08:06:39:

I don't think this is failfast for either drive. One drive was so dead that the kernel dropped it, and the other drive was dead enough that dd reading from the /dev/sdX device got immediate errors (although the /dev entry was still there). So I think we had one out of the blue drive failure (on the second drive) and one progressive failure (on the first drive).

By Miksa at 2017-02-02 10:38:54:

Something to consider is how sensitive should software RAID be for disk errors? I like that md doesn't seem to be as trigger happy as hardware RAID.

One nasty experience I had few years back was with a HP server that had a large RAID-5 array. One drive had failed and had been replaced. During the resync another drive suffered some form of glitch and the RAID card threw the whole array in the dumpster without a second thought. Afterwards when testing, the glitching harddrive didn't show any problems. If the card had just tried a bit harder at resync maybe all of that data hadn't been lost.

Another case was a RAID-1 where the resync failed midway because of single bad sector, but luckily that could be saved with ddrescue. It might have even been a empty spot on the drive. Couldn't the RAID just mark that spot of the array as bad sector so it at least wouldn't be any worse than a single drive.

Admittedly mdadm should have put that array in some kind of fail state, even if it tries to use the failing drive as much as possible.

Written on 30 January 2017.
« How Unix erases things when you type a backspace while entering text
Why having CR LF as your line ending is a mistake »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jan 30 00:48:53 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.