Sometimes it can be hard to tell one cause of failure from another

February 22, 2017

I mentioned recently how a firmware update fixed a 3ware controller so that it worked. As it happens, my experiences with this machine nicely illustrates the idea that sometimes it can be hard to tell one failure from another, or to put it another way, when you have a failure it can be hard to tell what the actual cause is. So let me tell the story of trying to install this machine.

Like many places within universities, we don't have a lot of money, but we do have a large collection of old, used hardware. Rather than throw eg five year old hardware away because it's beyond its nominal service life, we instead keep around anything that's not actively broken (or at least that doesn't seem broken) and press it into use again in sufficiently low-priority situations. One of the things that we have as a result of this is an assorted collection of various sizes of SATA HDs. We've switched over to SSDs for most servers, but we don't really have enough money to use SSDs for everything, especially when we're reconditioning an inherited machine under unusual circumstances.

Or in other words, we have a big box of 250 GB Seagate SATA HDs that have been previously used somewhere (probably as SunFire X2x00 system disks), all of which had passed basic tests when they were put into the box some time ago. When I wanted a pair of system disks for this machine I turned to that box. Things did not go well from there.

One of the disks from the first pair had really slow IO problems, which of course manifested as a far too slow Ubuntu 16.04 install. After replacing the slow drive, the second install attempt ended with the original 'good' drive dropping off the controller entirely, apparently dead. The replacement for that drive turned out to also be excessively slow, which took me up to four 250 GB SATA drives, of which one might be good (and three slow failed attempts to bring up one of our Ubuntu 16.04 installs). At that point I gave up and used some SSDs that we had relatively strong confidence in, because I wasn't sure if our 250 GB SATA drives were terrible or if the machine was eating disks. The SSDs worked.

Before we did the 3ware firmware upgrade and it made other things work great, I would have confidently told you that our 250 GB SATA disks had started rotting and could no longer be trusted. Now, well, I'm not so sure. I'm perfectly willing to believe bad things about those old drives, but were my problems because of the drives, the 3ware controller's issues, or some combination of both? My guess now is on a combination of both, but I don't really know and that shows the problem nicely.

(It's not really worth finding out, either, since testing disks for slow performance is kind of a pain and we've already spent enough time on this issue. I did try the 'dead' disk in a USB disk docking station and it worked in light testing.)

Written on 22 February 2017.
« Some notes on moving a software RAID-1 root filesystem around (to SSDs)
ZFS bookmarks and what they're good for »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Feb 22 01:41:50 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.