2023-08-02
The Prometheus host agent's metrics for systemd unit restarts
I recently wrote about how systemd's auto-restart of units can hide problems, where we discovered this was hiding failures of the Prometheus host agent itself. This raises the question of how and if we can monitor for this sort of thing happening with our Prometheus setup. The answer turns out to be more or less yes.
The host agent has a systemd collector, which as of 1.6.1 isn't enabled by default (you enable it with '--collector.systemd'). This collector has several additional pieces of information it can collect from systemd; with '--collector.systemd.enable-restarts-metrics' it will collect metrics on 'restarts', and with '--collector.systemd.enable-start-time-metrics' it will collect metrics on the start times of units. The first option enables a node_systemd_service_restart_total metric and the second enables a node_systemd_unit_start_time_seconds metric.
The unit start time metric is pretty straightforward; it's the Unix timestamp of when the unit was last started, or '0' if the unit has never been started. This includes units that have started but exited, so you'll see the start time of a whole bunch of boot time units. For units that aren't supposed to restart, you can detect persistent restarts by an alert rule like this, although you'll definitely want to be selective about what units you alert on (which I've omitted from this example):
- alert: AlwaysRecentRestarts expr: (time() - node_systemd_unit_start_time_seconds) < (60*2) for: 10m
(This is a similar idea to detecting recent reboots; I'm using time()
instead of node_time_seconds so I don't have to wrestle with
label issues.)
The node_systemd_service_restart_total metric counts the number of times a systemd unit has been restarted by a 'Restart=' trigger since the last time the unit was started or restarted normally. In george's comment on my entry, this is 'involuntary' restarts versus 'voluntary' ones, and the information comes from the systemd 'NRestarts' unit property.
Because this metric is reset to zero if you manually restart a unit,
in Prometheus terms you may want to consider this a gauge, not a
counter. However for many purposes using rate()
instead of
delta()
probably makes for an alert that's more likely to trigger if things
keep restarting. You might want to write a PromQL alert expression
like this:
rate( node_systemd_service_restart_total[10m] ) > 3 and ( node_systemd_service_restart_total > 0 )
The second clause avoids triggering the alert if you've manually restarted the service since the last automatic restart.
Looking at metrics for our Ubuntu machines, I see a small number of services that appear to auto-restart as an expected thing, particularly 'getty@' and 'serial-getty@' services. Your local environment may have others, so you probably want to check your local systems to see what your services are like.
Whether you want to alert on too many automatic restarts (whatever 'too many' is for you), frequent restarts, or the inability of a service to stay up for long is something that you'll have to decide yourself. Our particular case wouldn't have triggered either of the example rules I've given here, because the Prometheus host agent wasn't crashing all that often (probably less than once a day, although I didn't really check). Only an alert on 'there have been too many automatic restarts of this' would have picked up the problem.
(Our case is tricky because the host agent can die and be restarted in situations that are more or less expected, like the host being out of memory. We don't really want to get a cascade of alerts about that.)