It's probably not the hardware, a sysadmin lesson

September 22, 2021

We just deployed a new OpenBSD 6.9 machine the other day, and after it was deployed we discovered that it seemed to have serious problems with keeping time properly. The OpenBSD NTP daemon would periodically declare that the clock was unsynchronized, when it was adjusting the clock it was frequently adjusting it by what seemed to be very large amounts (by NTP standards), reporting numbers like '-0.244090s', and most seriously every so often the time would wind up completely off by tens of minutes or more. Nothing like this has happened on any of our other OpenBSD machines, especially the drastic clock jumps.

Once we noticed this, we flailed around looking at various things and wound up reforming the machine's NTP setup to be more standard (it was different for historical reasons). But nothing cured the problem, and last night its clock wound up seriously off again. After all of this we started suspecting that there was something wrong with the machine's hardware, or perhaps with its BIOS settings (I theorized wildly that the BIOS was setting it to go into a low power mode that OpenBSD's timekeeping didn't cope with).

Well, here's a spoiler: it wasn't the hardware, or at least the drastic time jumps aren't the hardware. Although we'll only know for sure in a few days, we're pretty sure we've identified their cause, and it's due to some of our management scripts (that are doing things well outside the scope of this entry).

When we have a mysterious problem and we just can't understand it despite all our attempts to investigate things, it's tempting to decide that it's a hardware problem. And sometimes it actually is. But a lot of the time it's actually software, just as a lot of the time what you think has to be a compiler bug is a bug in your code.

(If it's a hardware problem it's not something you can fix, so you can stop spending your time digging and digging into software while getting nowhere and frustrating yourself. This is also the appeal of it being a compiler bug, instead of your bug; if it's your bug, you need to keep on with that frustrating digging to find it.)

Written on 22 September 2021.
« Why we care about being able to (efficiently) reproduce machines
Go generics have a new "type sets" way of doing type constraints »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Sep 22 21:38:29 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.