Wandering Thoughts archives

2024-09-30

Resetting the backoff restart delay for a systemd service

Suppose, not hypothetically, that your Linux machine is your DSL PPPoE gateway, and you run the PPPoE software through a simple script to invoke pppd that's run as a systemd .service unit. Pppd itself will exit if the link fails for some reason, but generally you want to automatically try to establish it again. One way to do this (the simple way) is to set the systemd unit to 'Restart=always', with a restart delay.

Things like pppd generally benefit from a certain amount of backoff in their restart attempts, rather than restarting either slowly or rapidly all of the time. If your PPP(oE) link just dropped out briefly because of a hiccup, you want it back right away, not in five or ten minutes, but if there's a significant problem with the link, retrying every second doesn't help (and it may trigger things in your service provider's systems). Systemd supports this sort of backoff if you set 'RestartSteps' and 'RestartMaxDelaySec' to appropriate values. So you could wind up with, for example:

Restart=always
RestartSec=1s
RestartSteps=10
RestartMaxDelaySec=10m

This works fine in general, but there is a problem lurking. Suppose that one day you have a long outage in your service but it comes back, and then a few stable days later you have a brief service blip. To your surprise, your PPPoE session is not immediately restarted the way you expect. What's happened is that systemd doesn't reset its backoff timing just because your service has been up for a while.

To see the current state of your unit's backoff, you want to look at its properties, specifically 'NRestarts' and especially 'RestartUSecNext', which is the delay systemd will put on for the next restart. You see these with 'systemctl show <unit>', or perhaps 'systemctl show -p NRestarts,RestartUSecNext <unit>'. To reset your unit's dynamic backoff time, you run 'systemctl reset-failed <unit>'; this is the same thing you may need to do if you restart a unit too fast and the start stalls.

(I don't know if manually restarting your service with 'systemctl restart <unit>' bumps up the restart count and the backoff time, the way it can cause you to run into (re)start limits.)

At the moment, simply doing 'systemctl reset-failed' doesn't seem to be enough to immediately re-activate a unit that is slumbering in a long restart delay. So the full scale, completely reliable version is probably 'systemctl stop <unit>; systemctl reset-failed <unit>; systemctl start <unit>'. I don't know how you see that a unit is currently in a 'RestartUSecNext' delay, or how much time is left on the delay (such a delay doesn't seem to be a 'job' that appears in 'systemctl list-jobs', and it's not a timer unit so it doesn't show up in 'systemctl list-timers').

If you feel like making your start script more complicated (and it runs as root), I believe that you could keep track of how long this invocation of the service has been running, and if it's long enough, run a 'systemctl reset-failed <unit>' before the script exits. This would (manually) reset the backoff counter if the service has been up for long enough, which is often what you really want.

(If systemd has a unit setting that will already do this, I was unable to spot it.)

linux/SystemdResettingUnitBackoff written at 22:48:53;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.