An interesting hardware mystery

April 6, 2009

As I've written about, I have a problem machine. I have recently had the opportunity to stop using it as my office workstation and thus do some systematic testing on it, which has turned up some interesting yet mysterious results.

(It's remarkable but perhaps not surprising how much my mood has improved from getting a stable primary workstations and setting up these tests.)

The short summary is that the machine reliably crashes if I do significant disk activity and I am not running something that burns up CPU. (To be technical, I have only tested running the distributed.net client.)

The machine survives a whole bunch of tests; so far I have tried memtest86+, simply leaving it sitting idle, the distributed.net client ('dnetc'), dnetc plus continuous full speed bidirectional network traffic, dnetc plus lots of NFS activity, and dnetc plus repeatedly running bonnie++ and compiling the kernel. However, running just bonnie++ (with or without compiling the kernel) will kill the machine in short order. The most striking test I have done is to start dnetc, start bonnie++ and the kernel compile cycle, and then after a while kill the dnetc processes; the machine consistently panics within minutes.

(Since I normally run the distributed.net client on my machine but stopped after the Fedora 10 upgrade, this means that I may have had the hardware problem for quite some time without realizing it.)

All of this adds up to a puzzle: what bit of hardware is broken and needs to be replaced? If the failure mode was simpler there would probably be a clear likely suspect, but as it is I'm left scratching my head.

Sidebar: hardware details

The machine has an Asus M2N4-SLI, 2 GB of RAM, and I believe an AMD X2 4600+. It currently has a single SATA drive, but had two earlier (and the two drives are fine; they are running in my current office workstation). The graphics card is likely to be irrelevant, since this has happened with both an ATI X300 and now an nVidia of some description (running the open source drivers).


Comments on this page:

From 122.57.80.8 at 2009-04-06 02:29:10:

A few random thoughts

Hypothesis: Thermal problems
Rationale: Heavy disk activity would, I imagine heat up the HDD and probably some sort of IO controller. Perhaps high CPU activity causes sufficient extra air movement thanks to the CPU fan to counteract?
Possible method: Pop the case, setup a big electric fan blowing lots of air through it all the time and see if it recurs.

Hypothesis: Sketchy driver for hard disk controller.
Possible method: Repeat experiments in a variety of operating systems that talk to the controller in different ways, and see if there are any common elements.

-- Steve

From 130.217.250.13 at 2009-04-06 03:22:49:

Hypothesis: BIOS (acpi) does something stupid when it puts the cpu into low power mode which interacts badly with the harddrive doing a lot of activity. (irq from harddrive arriving /as/ the bios is going into, or ocming out of low power mode?)

Possible Methods: Disable ACPI, investigate BIOS upgrade changelog.

-- PerryLorier

From 209.71.241.13 at 2009-04-07 08:47:43:

I was thinking interrupt as well, but in the APIC direction. However, it seems you disabled that in the kernel parameters. Is it still the case ?

By cks at 2009-04-07 13:45:49:

The Fedora 10 kernels will actually run on that hardware without noapic and that's how I've been running them, but in a test now the crash still happened even with noapic specified. However, it may have taken longer to crash, so you may be on to something.

On the negative side: I have tested this machine with the Fedora 8 kernel (and thus noapic) but Fedora 10 everything else, and the same sort of crashes happened, although much slower.

From 206.125.167.45 at 2009-04-08 11:10:25:

Well, I had some strange problems on some systems with a PCI STAT card that would corrupt the disk write only when a lot of interrupts where generated. It's probably not your case, but worth checking nonetheless. And if you believe it's worth your time, there's plenty of kernel boot parameter (http://www.kernel.org/doc/Documentation/kernel-parameters.txt) affecting interrupt processing. I would start with APIC, but this is more a personnal hunch that anything else. I would also consider what was suggested in the previous comments as well.

But hey, I know life is not always full of time for fun and troubleshooting, hhmmm...?

/Pruneau

Written on 06 April 2009.
« Why I don't expect ARM-based netbooks to be a success
The technical problems with 'sender stores messages' schemes »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Apr 6 01:12:29 2009
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.