The case of the disappearing ESATA disk

November 30, 2013

This is a mystery (ie I have no answers yet), and also a story of what I think is the perversity of hardware (I can't be sure yet). I'm writing it up partly because I rarely see sysadmins writing up our problems, with the result that I think it's easy to underestimate how weird things sometimes get out there.

We have a server with an external SATA disk enclosure. The enclosure has three port multiplier based (E)SATA channels, each with five drive bays on them; we currently have ten disks in the enclosure, all identical, taking up the full capacity of two channels. The server is running 64-bit Ubuntu 12.04. We recently moved the server from our test area to our production machine room, which was when we discovered the mystery: under specific circumstances, exactly one disk is not seen by the server.

If you power off the external enclosure and the server, the first time the server boots it will not see one specific disk bay on the enclosure. This is not just that the disk in the disk bay doesn't respond fast enough; the disk remains invisible no matter how long you let it sit. Rebooting the server will make the disk reappear, as will hotplugging the disk (pulling out its disk sled just enough to cut power, then pushing it back in). This doesn't happen if just the server itself is powered down; as long as the disk enclosure stays powered on, all is fine. So far this could be a whole list of things. Unfortunately this is where it gets weird. First, it's not the disk itself; we've swapped disks between bays and the problem stays with the specific bay. Next, it's not a straightforward hardware failure in the enclosure or anything directly related to it; at this point we've swapped the disk enclosure itself (with a spare), the ESATA cables, and the ESATA controller card in the server.

(To cut a long story short, it's quite possible that the problem has been there all along. Nor do we have any other copies of this model of disk enclosure around where we can be sure that they don't have the problem (since we have two more of these enclosures in production, this is making me nervous).)

One of the many things that really puzzles me about this is trying to come up with an explanation for why this could be happening. For instance, why does the disk become visible if we merely reboot the server?

I don't usually run into problems like these, which I'm generally very thankful for. But every so often something really odd comes up and apparently this is one of those times.

(Also, I guess power-fail tests are going to have to become a standard thing that we do before we put machines into production. If this kind of fault can happen once it can happen more than once, and we'd really like not to find out about it after the first time we have to power cycle all of this stuff in production.)

PS: Now you may be able to guess why I have a sudden new interest in how modern Linux assembles RAID arrays. It certainly hasn't helped testing that the drives have a RAID-6 array on them that we'd rather not have explode, especially when resyncs take about 24 hours.

Sidebar: Tests we should do

Since I've been coming up with these ideas in the course of writing this entry, I'm going to put them down here:

  • Reorder the ESATA cables (changing the mapping between ESATA controller card ports and the enclosure's channels). If the faulted bay changed to the other channel it would mean that the problem isn't in the enclosure but is something upstream.

  • 'Hotswap' another drive on the channel to see if the invisible disk then becomes visible due to the full channel reset et al.

I'm already planning to roll more recent kernels than the normal Ubuntu 12.04 one on to the machine to see what happens, but that's starting to grasp at straws.


Comments on this page:

By Mike O'Connor at 2013-11-30 03:36:13:

I've been using a Norco eSata chassis for about 5 years (12 drives), it has always had a problem where I have to reboot the machine (not power cycle) before all the drives will be correctly detected.

The issue started with kernel 2.6.18 and I've just upgraded to 3.10.13 with the same issue.

I've setup a script which will auto reboot after each power cycle.

Its the reason why my next system will be a SAS chassis.

Mike

By Patrick at 2013-11-30 11:21:25:

As funny as this is it sounds like power issues to me. I ran into this before with internal drives, after relaunch the raid controller with one from another server and then moving the drives to a different machine we noticed the original machine still had the same problems after a random period of time.

I would venture to guys that this issue will be better until it starts to warm up in may or so then it will get worse.

By cks at 2013-11-30 14:39:27:

It's possible that it's a power issue, but I have several reasons for thinking not. We don't have the enclosure fully filled with disks, this happens even if we let the enclosure sit powered on for minutes before powering on the server, and everything works fine when the system is under full power load (with all disks active).

(Nor does this happen every time the disk itself is first powered on; the hotswap test effectively cuts power to the disk and then restores it.)

By Mike O'Connor at 2013-12-03 08:01:20:

Hi Patrick

I always turn on the drive chassis wait then turn on the main system. (done by a little timer box I found.

The issue seems to be the reset of the drives. The reset fails the first time though.

I own about 3 of these Norco boxes and they all have the same issue in some way. I think its the firmware in the Port multipliers.

Cheers

Written on 30 November 2013.
« How modern Linux software RAID arrays are assembled on boot (and otherwise)
Why 'hotplug' approaches to device handling are the right way »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Nov 30 02:02:01 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.