Sometimes the right thing to do is to stop (and even to give up)
I'm generally someone who is happy to keep chasing an oddity or a mystery, to keep plugging away at the problem to at least chart it out and perhaps figure out what is going on. I suspect that this is something that a lot of sysadmins feel; if there is something wrong, we itch to figure it out and put it to right. And the satisfaction of finally succeeding is an excellent feeling. But sometimes this is absolutely the wrong thing to do. Sometimes the right thing to do is to stop with the mystery not understood, or even to give up entirely.
I've been continuing to work away on our disappearing ESATA disk problem since I wrote about it; I've tried more things, gotten more specific information, and the whole thing has gotten weirder. But at the end of this past week we decided to stop all of that. I managed to get the system to a precariously balanced point where it's stable and that's that. In fact we're going further than just stopping with a stabilized system; in the longer run we're giving up on it entirely and will be migrating the whole thing to different hardware. We'll write off the disk enclosure as a loss (the server is a generic one and can be reused for other things).
The direct reason that this makes sense is that we have gone far enough to establish that something very odd is going on. Even if we continue investigating and discover exactly what the problem is we have no confidence that we'll be able to fix it, and in the mean time we have managed to stabilize the system as-is. Until we can at least identify the problem, we can't trust the enclosure in general. We could do a bunch of experiments to chart out what disks we can add to the enclosure where and still have an apparently stable system, but that wouldn't make us trust it and if we can't trust it we don't want to use it.
But the bigger reason to stop is the cost/benefit ratio of continuing to investigate the problem. I could easily spend a bunch of time and effort conducting experiments to map out the precise contours of the problem (and maybe find some clues to its cause). But by far the most likely result of these experiments is a pile of data on a disk enclosure that we no longer trust. In the best case we have minimal expansion in this enclosure and we're certainly not going to buy any more of them, so the smart choice is to say 'this is good enough, we've spent enough time on it'.
Or in short: sometimes you lose. When you are losing, the smart thing to do is to recognize that and lose fast. This is painful, since we don't like to lose, but it's also best. Try not to let it get to you.
(This would be more obvious if staff time was considered a cost on par with hardware, but universities almost never think about staff time that way.)
PS: yes, this entry is being written in part to make me feel better about throwing in the towel on this issue. We're all squishy humans with those awkward emotions.