Alerting on our NTP servers having a high NTP stratum hasn't been useful
One of the concepts and jargon of NTP (the Network Time Protocol) is a NTP server's stratum, which is roughly how far away (in NTP servers) you are from an external source of time. External sources of time are stratum 0, NTP servers that are directly connected to them are stratum 1, servers that talk to those are stratum 2, and so on. For load reasons, organizations with stratum 0 time sources often put an extra level of NTP server in between the public and those time sources; an internal NTP server talks directly to the clocks (and is at stratum 1), while the organization's public NTP servers that you can use are at stratum 2 (or higher, depending on the architecture). The Natial Research Council Canada's public NTP servers are at stratum 2, for example.
We have long had a set of internal NTP servers that are the local time source for our servers (for various reasons). These synchronize to various off-network NTP servers, both inside and outside the university. As part of monitoring our environment, I wrote some tools to provide Prometheus metrics for NTP state and get these into our Prometheus environment. Once I had metrics, I wrote some alerts, including an alert for the NTP stratum of our internal NTP servers being 'too high', which I believe started out at stratum 6 (that's foreshadowing). For a long time things were quiet, and then recently the alert started going off every so often; one particular internal server was winding up at various high stratum numbers. Every time I investigated, this was at one level a legitimate occurrence; our NTP server (that we were alerting about) was one stratum higher than the off-network server it was synchronized to (as it's supposed to be), but that off-network server was inexplicably at a relatively high stratum.
I tried adjusting the NTP stratum level that would trigger the alert a few times, but by now I've come around to the idea that this alert isn't useful in general and I'm probably going to remove it. We already have other alerts that will trigger if our local NTP server can't synchronize its time to anything or has time that's clearly off from everything else, so the NTP stratum alert is really telling us either that something weird is going on with our upstream NTP servers or our NTP server daemon has a bug in its stratum calculation (which isn't very likely).
Both of these are problems and maybe we should investigate the one that we know is happening, but neither of them are really problems that we can do anything about. If there's a bug in the NTP server daemon we're using, our options are limited, and the off-network servers that are behaving oddly aren't under our control at all so our only option is to maybe stop using them. However, it's not clear if we should do so. The NTP system has direct indications of the quality of remote NTP servers and is carefully designed to reject 'false tickers', sources of bad time. The server's stratum is not one of these markers of good or bad time, all it tells you is how many hops away from a true clock the server thinks it is. A NTP server can be a perfectly good source of time despite a high stratum, and in our cases the affected off-network servers were; despite the high stratum, our local NTP server was using the off-network server with the high stratum as its time source (otherwise our local server wouldn't have had its high stratum).
If we kept this alert I'd want to try to dig into the off-network time servers that have this elevated stratum, because there's something mysterious going on there and that means there are potential problems. But there are mysteries and potential problems in many places and I have to choose my quests. I can't try to chase down every anomaly, even if there's a fixable problem at the root of every one of them. I need to pick the ones that matter to us, and someone else's NTP server having a high stratum is not one of those.
|
|