Hardware is weird (disk enclosure edition)

I've written before about our disappearing ESATA disk problem, but since I wrote that the situation has become weirder. In fact I think it makes a good illustration about just how odd hardware can be (and why I prefer working on software to banging my head against hardware).

Here is what happens. We have a 15-bay ESATA based external disk enclosure, with the 15 disks sensibly divided into three port multiplier based ESATA channels of five disks each. If the enclosure and the server connected to it were powered off, the enclosure was powered up (and let sit), and then the server powered up, one or more of the ten 4TB ESATA disks in the system were failing to be recognized. As initially set up, we had the ten disks in two channels and the third channel empty. Then we did some shuffling and got to the serious weirdness.

The failed recognition pattern was as follows: the first five disks on the first channel probed by Linux were recognized correctly, regardless of which physical channel it was on the enclosure. On the second and possibly third channel probed by Linux, the second disk present was not recognized (regardless of which physical slot it was in); it would be probed briefly but then Linux would be unable to get it to go and it disappeared (until the server was rebooted).

We initially saw the problem with some 4TB Hitachis. We tried sticking a 4TB WD SE drive into the deadly spot on the second (probed) channel and it too showed the same problem. However an ancient WD 80GB in the same spot worked perfectly; unlike the two sorts of 4TB drives, it was recognized fine.

(At this point we gave up because we had a stable system with the ten 4TB Hitachi drives we were committed to use. We don't care that we've sacrificed an 80 GB drive as basically a spacer or that we can't really use the remaining drive bays.)

It bugs me that I can't come up with any relatively rational explanation for what's going on here. It's possible that something is going on with spinup power draws, but if so it's very unusual. It also bugs me that I have basically no diagnostic tools to see what's going on; a real investigation would probably require a bunch of equipment to, eg, monitor power draws during disk probing.

(It's clear that there is something different about probing the disks when the enclosure has just been powered on as opposed to later on if the server just reboots. Even on the 'good' first probed channel, it takes significantly longer to probe all the disks on a cold power up. My vague theory is that this is because the disks aren't fully spinning up until the first time the host talks to them, but I have no idea if this is true and if modern disks behave this way.)

PS: I have no idea if the different channels are fed power by any means that separates power for one channel from power for the others. For all I know right now, all disks are powered off a single run from the power supply. I assume (but have not verified) that there are three de-multiplier daughter cards in the case, one for each ESATA channel and set of drives, and they are wired separately. The external ESATA cables are certainly physically separated enough to make that plausible.

