Resetting the backoff restart delay for a systemd service
Suppose, not hypothetically, that your Linux machine is your DSL
PPPoE gateway, and you run the PPPoE software through a simple
script to invoke pppd that's run as a systemd .service unit. Pppd itself will exit if the link
fails for some reason,
but generally you want to automatically try to establish it again.
One way to do this (the simple way) is to set the systemd unit to
'Restart=always
', with a restart delay.
Things like pppd generally benefit from a certain amount of backoff
in their restart attempts, rather than restarting either slowly or
rapidly all of the time. If your PPP(oE) link just dropped out
briefly because of a hiccup, you want it back right away, not in
five or ten minutes, but if there's a significant problem with the
link, retrying every second doesn't help (and it may trigger things
in your service provider's systems). Systemd supports this sort of
backoff if you set 'RestartSteps
'
and 'RestartMaxDelaySec'
to appropriate values. So you could wind up with, for example:
Restart=always RestartSec=1s RestartSteps=10 RestartMaxDelaySec=10m
This works fine in general, but there is a problem lurking. Suppose that one day you have a long outage in your service but it comes back, and then a few stable days later you have a brief service blip. To your surprise, your PPPoE session is not immediately restarted the way you expect. What's happened is that systemd doesn't reset its backoff timing just because your service has been up for a while.
To see the current state of your unit's backoff, you want to look
at its properties, specifically 'NRestarts
' and especially
'RestartUSecNext
', which is the delay systemd will put on for the
next restart. You see these with 'systemctl show <unit>
', or
perhaps 'systemctl show -p NRestarts,RestartUSecNext <unit>
'.
To reset your unit's dynamic backoff time, you run 'systemctl
reset-failed <unit>
'; this is the same thing you may need to do
if you restart a unit too fast and the start stalls.
(I don't know if manually restarting your service with 'systemctl restart <unit>' bumps up the restart count and the backoff time, the way it can cause you to run into (re)start limits.)
At the moment, simply doing 'systemctl reset-failed' doesn't seem to be enough to immediately re-activate a unit that is slumbering in a long restart delay. So the full scale, completely reliable version is probably 'systemctl stop <unit>; systemctl reset-failed <unit>; systemctl start <unit>'. I don't know how you see that a unit is currently in a 'RestartUSecNext' delay, or how much time is left on the delay (such a delay doesn't seem to be a 'job' that appears in 'systemctl list-jobs', and it's not a timer unit so it doesn't show up in 'systemctl list-timers').
If you feel like making your start script more complicated (and it runs as root), I believe that you could keep track of how long this invocation of the service has been running, and if it's long enough, run a 'systemctl reset-failed <unit>' before the script exits. This would (manually) reset the backoff counter if the service has been up for long enough, which is often what you really want.
(If systemd has a unit setting that will already do this, I was unable to spot it.)
Comments on this page:
|
|