Linux has at least two ways that disks can die

January 31, 2014

We lost a disk on one of our iSCSI backends last night. Normally when an iSCSI data disk dies on a backend, what happens at the observable system level is that the disk vanishes. If it used to be, say, sdk, then there is no sdk any more. I'm not quite sure what happens at the kernel level as far as our iSCSI target software goes, but the reference that the iSCSI target kernel module holds doesn't work any more. This is basically just the same as what happens when you physically pull a live disk and I assume that the same kernel and udev mechanisms are at work.

(When you swap out the dead disk and put a new one in, the new one shows up as a new disk under some name. Even if it winds up with the same sdX name it's sufficiently much a different device that our iSCSI target software still won't automatically talk to it; we have to carefully poke the software by hand.)

This is not what happened this time around. Instead the kernel seems to have basically thrown up its hands and declared the disk dead but not gone. The disk was still there in /dev et al and you could open the disk device, but any attempt to do IO to it produced IO errors. Physically removing the dead disk and inserting a new one did nothing to change this; there doesn't seem to have been any hotplug activity triggered or anything. All we got was a long run of errors like:

kernel: sd 4:0:0:0: [sdm] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdm, sector 504081380

(Kernel log messages suggest that possibly this happened because the kernel was unable to successfully reset the channel, but that's reading tea leaves very closely.)

I was going to speculate about this sort of split making sense, but I don't actually know what level of the kernel this DID_BAD_TARGET error comes from. So this could be a general kernel feature to declare disks as 'present but bad' or this could be a low level driver reporting a hardware status up the stack (or it could be something in between, where a low-level driver knows the disk is not there but this news got lost at a higher level).

Regardless of what and where this error means, we were still left with a situation where the kernel thought a disk was present when we had already physically removed it. In the end we managed to fix it by forcing a rescan of that eSATA channel with:

echo - - - >/sys/class/scsi_host/hostN/scan

That woke the kernel up to the disk being gone, at which point a newly inserted replacement disk was also recognized and we could go on as we usually do when replacing dead disks.

I'm going to have to remember these two different failure modes in the future. We clearly can't assume that all disk failures will be nice enough to cause the disk to disappear from the system, and thus we can't assume that all visible disks are actually working (and thus 'the system is showing N drives present as we expect' is not a full test).

(This particular backend has now been up for 632 days, and as a result of this glitch we are considering perhaps rebooting it. But reboots of production iSCSI backends are a big hassle, as you might imagine.)

Written on 31 January 2014.
« OmniOS (and by extension Illumos) is pretty much Solaris
Why I now believe that duck typed metaclasses are impossible in CPython »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jan 31 00:59:29 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.