2020-06-06
Why sysadmins don't like changing things, illustrated
System administrators are famously reluctant to change anything unless they have to; once a system works, they like to freeze it that way and not touch it. This is sometimes written off as irrational over-concern, and to be honest sometimes it is; you can make a fetish out of anything. However, it isn't just superstition and fetish. We can say general things like on good systems, you control stability by controlling changes and note that harmless changes aren't always actually harmless, but surely if you take appropriate care you can monitor your systems while applying controlled changes, promptly detect and understand any problems, and either fix them or roll back.
Well, let me tell you a story about that, and about spooky subtle action at a distance. (A story that I mentioned in passing recently.)
We have a Prometheus based monitoring and alerting system, that among other things sends out alert notifications, which come from a Prometheus component called the Alertmanager. Those alert notifications include the start and end times of the alerts (for good reasons), and since we generally deal in local time, these are in local time. Or at least they're supposed to be. Quite recently a co-worker noticed that these times were wrong; after a short investigation, it was obvious that they were in UTC. Further investigation showed that they hadn't always been in UTC time; ever since we started with Prometheus in late 2018 they'd been in local time, as we expected, and then early in May they'd changed to UTC.
We have reasonably good records of what we've changed on our systems, so I could go back to what we'd changed on the day when the alert times switched from local time to UTC, and I could also look at the current state of the system. What I expected to find was one of four changes; the system switching timezones for some reason, an Ubuntu package update of a time related package, an update to Alertmanager itself (with a change related to this behavior), or that the systemd service for Alertmanager was putting it into UTC time. I found none of them. Instead the cause of the timezone shift in our alert messages was an update to the Prometheus daemon, and the actual change in Prometheus was not even in its release notes (I found it only by searching Git commit logs, which led me to here).
Here is an undesirable change in overall system behavior that we didn't notice for some time and that was ultimately caused by us upgrading something that wasn't obviously related to the issue. The actual cause of the behavior change was considered so minor that it didn't appear in the release notes, so even reading them (both before and after the upgrade) didn't give us any sign of problems.
This shows us, once again, that you can't notice all changes in behavior immediately, not in practice, you can't predict them in advance from due diligence like reading release notes and trying things out on test systems, and they aren't always from things that you expect; a change in one place can cause spooky action at a distance. Our alert time stamps are formatted in Alertmanager when it generates alerts, but it turned out through a long chain of actions that a minor detail of how they were created inside Prometheus made a difference in our setup.