Our BMCs are not great at keeping accurate time

August 7, 2022

A BMC is an extra embedded computer on your server motherboard (as opposed to the extra CPU that's embedded in your CPU or chipset). As separate computers running a separate operating system, BMCs keep time independently from your server's time (the host time, reusing terminology from virtualization), and this time is visible in places like the IPMI event log (as well as just asking what time the BMC has via IPMI). I've been tracking the reported BMC time against the server's own view of time on our servers for a while, and recently I summarized the results of that on Twitter:

Having tracked how much an IPMI's idea of time drifts across our modest fleet for a while, well, we're now resetting the IPMI time once a day.

The BMC time drift we're seeing isn't terrible, but it's enough to add up significantly if left uncorrected for very long, especially on some of our servers. It's possible that some servers see comparable clock drift numbers (as estimated by summing ntpdate time adjustments), but server clocks are much more easily kept correct through NTP and other mechanisms.

(It also looks like many of our servers have clocks that clearly drift less than their BMCs.)

In an ideal world, we would have all of our BMCs synchronize their clocks via NTP. In this world, not all of our servers have BMCs with dedicated network ports (only our most recent Dell 1U servers do), and on top of that not all vendors include NTP as a free feature on all of their BMCs. Charging licensing costs for some features is a good way to insure that we don't use them, unless the licensing cost is trivial (which, spoiler, it hasn't been when we asked, although we usually ask for the KVM over IP licensing cost).

In theory our BMC clock drift probably doesn't matter very much. Our servers almost never put anything into their IPMI event logs and we almost never look at them. But having started looking at this (and putting the data in our Prometheus metrics server), I couldn't just let it go. There's probably a lesson here, one related to turning over rocks.

(Server clock drift does matter for all sorts of reasons, including that we want all of our servers to agree on the time. I'm not sure that estimating server clock drift from the sum of ntpdate time adjustments is all that accurate, but it's the data source I have available.)

Written on 07 August 2022.
« The pervasive effects of C's malloc() and free() on C APIs
Two example Grafana Loki log queries to get things from ntpdate logs »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Aug 7 22:32:18 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.