Systemd auto-restarts of units can hide problems from you

July 31, 2023

Today, more or less by coincidence, I discovered that the Prometheus host agent on our Linux machines was periodically crashing with an internal Go runtime error (which had already been noticed by other people and filed as issue #2705). You might wonder how we could not notice the host agent for our monitoring, metrics, and alerting system doing this, and part of the answer is that the systemd service has a setting of 'Restart=always'.

(We inherited this setting from the Ubuntu package's .service unit, which got it from the Debian package. We don't use the Ubuntu package any more, but we used its .service file as the starting point for ours, and it's broadly sensible to automatically restart the host agent if something goes wrong.)

There are a surprisingly large number of things that you probably won't notice going away briefly. If you don't look into the situation, it might seem like a short connectivity blip, or even be hidden from you by programs automatically retrying connections or operations. Telling systemd to auto-restart these things will thus tend to hide their crashes from you, which may be surprising. Still, auto-restarting and hiding crashes is likely better than having the service be down until you can restart it by hand. We certainly would rather have intermittent, crash-interrupted monitoring of our machines than not have monitoring for (potentially) some time.

Whether you want to monitor for this sort of thing (and how) is an open question. It's certainly possible that this is one of the times where your monitoring isn't going to be comprehensive, because it's infrequent enough, low impact enough, and hard enough to craft a specific alert.

(I'm not certain if I'm going to bother trying to craft an alert for this, partly because there's not quite enough information exposed in the Prometheus host agent's systemd metrics to make it easy, or at least for me to be confident that it's easy. You do get the node_systemd_service_restart_total metric, which counts how many times a Restart= is triggered, but that doesn't necessarily say why and some things are restarted normally, such as 'getty' services.)

Even if we don't add a specific alert, in the future I'm going to want to remember to check for this when we're doing things like rolling out a new version of a program (such as the Prometheus host agent). It wouldn't hurt to look at the logs or the metrics, just in case. Of course there's a near endless number of things you can look at just in case, but having stubbed my toe on this once I may be more twitchy here for a while.

Written on 31 July 2023.
« Our alerts and monitoring can never be comprehensive
Turning off the sidebar of Firefox's built in PDF viewer »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jul 31 22:04:34 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.