My sunk cost fallacy relationship with my home desktop

January 12, 2022

Over on Twitter I said something about my current home desktop:

Another mysterious lockup on my home desktop, not long after I brought my new NVME drives into a software RAID mirror, which could be coincidence.

(My relationship with this machine is rather sunk cost fallacy.)

This machine has a number of symptoms, including locking up when it's cold, some of which may be remediated by now but others definitely aren't. It's quite unclear where the problem or problems are. At this point I'm required to suspect all of the different parts of the machine except the CPU cooler (and I might take a chance on the RAM). Including the case, unfortunately, because the machine has a persistent issue where apparently pressing on the area with the front ports can reset the machine.

(For a long time I thought I was accidentally hitting the reset button despite it being a bit recessed, but recent events have convinced me that it's not just that. Also, in the past simply plugging in a USB cable into the front ports has triggered resets. For obvious reasons I don't experiment with this much, and it seems tangled up in some USB Linux kernel software issues that also cause reboots.)

At this point I can't say that I trust this machine. It mostly runs fine as long as I'm careful with it, but maybe there are still issues (given the lockup I tweeted about). The logical thing to do is to write this machine off as a sunk cost and replace it almost entirely, especially as I want to migrate from my current SSDs to my new NVMe drives and I have no other machine that takes NVMe drives; if I migrate and the machine becomes more unreliable, I will have real problems.

However, this machine dates from early 2018 so it's only about three years old now. Three years is a pretty aggressive replacement cycle for desktop machines today, especially when I bought it as a relatively good machine that I was expecting to last me for at least five years. And more importantly, there's the sunk cost fallacy. I want this machine to work, and I want to persuade myself that magically it will work well enough for me not to do anything (or at least anything substantial). Just as I expected back in August of 2020, I've done nothing so far and just coasted along, and so far that has actually worked out in the sense that I've avoided both total failure and too many issues (although I had one alarming incident). It's easier to do nothing than to act.

(Two of my current excuses are that in general computer hardware seems to be in short supply, and there are a bunch of technology transitions going on where the new technology is expensive but the old technology has little future.)

PS: Since August of 2020 I have reseated some bits and pieces, which seemed to do some good and also was necessary because at one point the machine froze and refused to boot with an apparent memory issue. That was an alarming incident, especially since I discovered it at the start of a workday.

PPS: Another issue is that since I assembled my home desktop from parts and it doesn't work reliably, now I get to wonder if I screwed up the assembly somehow and if I'd do something wrong again if I built another desktop. In theory I should have confidence in my ability to do this, since I also built my work desktop from parts, using many of the same ones. In practice none of us are entirely rational beings, regardless of what we'd like.


Comments on this page:

By Jani at 2022-01-13 06:22:13:

I assume you've tried basic elimination, by testing with different parts & peripherals removed?

It's a hassle of course, especially after you've already settled in to using the system, not to mention how tedious it is to run memtest for days on end just to achieve sufficient confidence that you've proved a negative.

I'm not as hardcore about this as Jeff Atwood, but I've learned not to discount anything as potentially being the cause or at least a trigger for all kinds of issues. Besides obvious things like RAM modules, OTOH I've had at least one internal memory card reader, one USB bluetooth dongle and one (SATA) SSD turn out to be the troublemaker; usually something much easier to replace than motherboard+almost everything. (And unfortunately it's usually been easier to just replace or just get rid of the triggering component rather than hope for a software fix, even if the issue ultimately was with software.)

The cold lockups or the case being touch-sensitive don't sound like one of those easily solvable ones, but since there's a plethora of symptoms, perhaps some still might be, and are just hard to see from the mix.

By cks at 2022-01-13 12:04:41:

Since this is my home desktop and I have no replacement for it, I can't take it out of active use for troubleshooting for very long (especially these days). And there's almost nothing I can disconnect while still keeping the machine in active use.

(Technically speaking I guess I could take out half of the RAM and disconnect all the front port connectors except for the power switch. But that's unattractive and seems unlikely to do much.)

By sackerm at 2022-01-13 17:30:40:

I've experienced something similar with one of our 1P EPYC servers.

It's got an MDADM RAID of spinning disks, with an mdadm RAID of NVMe drives on top via bcache. The bcache drive is exposed as iSCSI via LIO.

We've run into a similar issue (across many kernel versions-- as far back as Ubuntu 5.4.x and as recent as Ubuntu 5.14-oem). Seems like every few weeks to a month, the iSCSI target will die, and the LIO service is so stuck that it even hangs up boot because it can't safely exit. Some older kernels would log errors to the console, and would seem to completely break (IE the block subsystem just seemed to... die).

The system itself boots off of an internal SSD which is a BOSS card, and it keeps running just fine, MDADM reports no issues, but LIO hard locks and sends a DataTimeOut back to the servers using it for iSCSI, then basically drops offline. Everything else seems to keep running just fine-- the system is responsive over SSH, apt works and can apply package updates...

The only fix, so far, is to reboot it. I wonder if this is actually a kernel bug with MDADM and NVMe drives that occasionally causes a lockup?

My guess is that one or more of the standoffs keeping the bottom of your motherboard from touching the case is missing or broken; you are getting intermittent shorts.

Thermal expansion differences (cold case, warm motherboard) and mechanical movement (pressing near the front ports) would be the likely mechanism of action.

If your case is largely made of plastic, that would rule out this hypothesis. If it's steel, you could try:

  • Changing the orientation of the case

  • Replacing the motherboard standoffs

  • Putting a sheet of nonconductive material between the motherboard and the case
By Greg Marshall at 2022-01-19 12:49:16:

This may sound strange but in the past year I am slowly wondering is keeping up to date is best policy for personal laptops/PCs. Learned it from my mac friendly boss always used to buy a laptop/iPhone/iPad every year (just as the warranty expires) - even if everything is fine - it fetches good price.

In the last 3 years I have sold my personal latitude USFF PC+monitor (in local eBay) every Christmas time after I pick up (openbox/refurbished) discounted devices (of a older generation) around Christmas time. That way I found one can get reasonable price for the old hardware - as after 4 years - it is so difficult to sell anything older for a good price - for example SATA drives or even M2.SATA.

Written on 12 January 2022.
« Some things about Prometheus Alertmanager's notification metrics
In practice, there are two types of window managers in modern X »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jan 12 23:05:02 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.