Linux has at least two ways that disks can die
We lost a disk on one of our iSCSI backends last
night. Normally when an iSCSI data disk dies on a backend, what happens
at the observable system level is that the disk vanishes. If it used to
be, say, sdk
, then there is no sdk
any more. I'm not quite sure what
happens at the kernel level as far as our iSCSI target software goes,
but the reference that the iSCSI target kernel module holds doesn't
work any more. This is basically just the same as what happens when you
physically pull a live disk and I assume that the same kernel and udev
mechanisms are at work.
(When you swap out the dead disk and put a new one in, the new one shows up as a new disk under some name. Even if it winds up with the same sdX name it's sufficiently much a different device that our iSCSI target software still won't automatically talk to it; we have to carefully poke the software by hand.)
This is not what happened this time around. Instead the kernel seems
to have basically thrown up its hands and declared the disk dead but
not gone. The disk was still there in /dev
et al and you could
open the disk device, but any attempt to do IO to it produced IO
errors. Physically removing the dead disk and inserting a new one did
nothing to change this; there doesn't seem to have been any hotplug
activity triggered or anything. All we got was a long run of errors
like:
kernel: sd 4:0:0:0: [sdm] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdm, sector 504081380
(Kernel log messages suggest that possibly this happened because the kernel was unable to successfully reset the channel, but that's reading tea leaves very closely.)
I was going to speculate about this sort of split making sense, but I
don't actually know what level of the kernel this DID_BAD_TARGET
error comes from. So this could be a general kernel feature to declare
disks as 'present but bad' or this could be a low level driver reporting
a hardware status up the stack (or it could be something in between, where
a low-level driver knows the disk is not there but this news got lost at a
higher level).
Regardless of what and where this error means, we were still left with a situation where the kernel thought a disk was present when we had already physically removed it. In the end we managed to fix it by forcing a rescan of that eSATA channel with:
echo - - - >/sys/class/scsi_host/hostN/scan
That woke the kernel up to the disk being gone, at which point a newly inserted replacement disk was also recognized and we could go on as we usually do when replacing dead disks.
I'm going to have to remember these two different failure modes in the future. We clearly can't assume that all disk failures will be nice enough to cause the disk to disappear from the system, and thus we can't assume that all visible disks are actually working (and thus 'the system is showing N drives present as we expect' is not a full test).
(This particular backend has now been up for 632 days, and as a result of this glitch we are considering perhaps rebooting it. But reboots of production iSCSI backends are a big hassle, as you might imagine.)
Comments on this page:
|
|