The case of the disappearing ESATA disk
This is a mystery (ie I have no answers yet), and also a story of what I think is the perversity of hardware (I can't be sure yet). I'm writing it up partly because I rarely see sysadmins writing up our problems, with the result that I think it's easy to underestimate how weird things sometimes get out there.
We have a server with an external SATA disk enclosure. The enclosure has three port multiplier based (E)SATA channels, each with five drive bays on them; we currently have ten disks in the enclosure, all identical, taking up the full capacity of two channels. The server is running 64-bit Ubuntu 12.04. We recently moved the server from our test area to our production machine room, which was when we discovered the mystery: under specific circumstances, exactly one disk is not seen by the server.
If you power off the external enclosure and the server, the first time the server boots it will not see one specific disk bay on the enclosure. This is not just that the disk in the disk bay doesn't respond fast enough; the disk remains invisible no matter how long you let it sit. Rebooting the server will make the disk reappear, as will hotplugging the disk (pulling out its disk sled just enough to cut power, then pushing it back in). This doesn't happen if just the server itself is powered down; as long as the disk enclosure stays powered on, all is fine. So far this could be a whole list of things. Unfortunately this is where it gets weird. First, it's not the disk itself; we've swapped disks between bays and the problem stays with the specific bay. Next, it's not a straightforward hardware failure in the enclosure or anything directly related to it; at this point we've swapped the disk enclosure itself (with a spare), the ESATA cables, and the ESATA controller card in the server.
(To cut a long story short, it's quite possible that the problem has been there all along. Nor do we have any other copies of this model of disk enclosure around where we can be sure that they don't have the problem (since we have two more of these enclosures in production, this is making me nervous).)
One of the many things that really puzzles me about this is trying to come up with an explanation for why this could be happening. For instance, why does the disk become visible if we merely reboot the server?
I don't usually run into problems like these, which I'm generally very thankful for. But every so often something really odd comes up and apparently this is one of those times.
(Also, I guess power-fail tests are going to have to become a standard thing that we do before we put machines into production. If this kind of fault can happen once it can happen more than once, and we'd really like not to find out about it after the first time we have to power cycle all of this stuff in production.)
PS: Now you may be able to guess why I have a sudden new interest in how modern Linux assembles RAID arrays. It certainly hasn't helped testing that the drives have a RAID-6 array on them that we'd rather not have explode, especially when resyncs take about 24 hours.
Sidebar: Tests we should do
Since I've been coming up with these ideas in the course of writing this entry, I'm going to put them down here:
- Reorder the ESATA cables (changing the mapping between ESATA controller
card ports and the enclosure's channels). If the faulted bay
changed to the other channel it would mean that the problem isn't in
the enclosure but is something upstream.
- 'Hotswap' another drive on the channel to see if the invisible disk then becomes visible due to the full channel reset et al.
I'm already planning to roll more recent kernels than the normal Ubuntu 12.04 one on to the machine to see what happens, but that's starting to grasp at straws.