Resetting the backoff restart delay for a systemd service

September 30, 2024

Suppose, not hypothetically, that your Linux machine is your DSL PPPoE gateway, and you run the PPPoE software through a simple script to invoke pppd that's run as a systemd .service unit. Pppd itself will exit if the link fails for some reason, but generally you want to automatically try to establish it again. One way to do this (the simple way) is to set the systemd unit to 'Restart=always', with a restart delay.

Things like pppd generally benefit from a certain amount of backoff in their restart attempts, rather than restarting either slowly or rapidly all of the time. If your PPP(oE) link just dropped out briefly because of a hiccup, you want it back right away, not in five or ten minutes, but if there's a significant problem with the link, retrying every second doesn't help (and it may trigger things in your service provider's systems). Systemd supports this sort of backoff if you set 'RestartSteps' and 'RestartMaxDelaySec' to appropriate values. So you could wind up with, for example:

Restart=always
RestartSec=1s
RestartSteps=10
RestartMaxDelaySec=10m

This works fine in general, but there is a problem lurking. Suppose that one day you have a long outage in your service but it comes back, and then a few stable days later you have a brief service blip. To your surprise, your PPPoE session is not immediately restarted the way you expect. What's happened is that systemd doesn't reset its backoff timing just because your service has been up for a while.

To see the current state of your unit's backoff, you want to look at its properties, specifically 'NRestarts' and especially 'RestartUSecNext', which is the delay systemd will put on for the next restart. You see these with 'systemctl show <unit>', or perhaps 'systemctl show -p NRestarts,RestartUSecNext <unit>'. To reset your unit's dynamic backoff time, you run 'systemctl reset-failed <unit>'; this is the same thing you may need to do if you restart a unit too fast and the start stalls.

(I don't know if manually restarting your service with 'systemctl restart <unit>' bumps up the restart count and the backoff time, the way it can cause you to run into (re)start limits.)

At the moment, simply doing 'systemctl reset-failed' doesn't seem to be enough to immediately re-activate a unit that is slumbering in a long restart delay. So the full scale, completely reliable version is probably 'systemctl stop <unit>; systemctl reset-failed <unit>; systemctl start <unit>'. I don't know how you see that a unit is currently in a 'RestartUSecNext' delay, or how much time is left on the delay (such a delay doesn't seem to be a 'job' that appears in 'systemctl list-jobs', and it's not a timer unit so it doesn't show up in 'systemctl list-timers').

If you feel like making your start script more complicated (and it runs as root), I believe that you could keep track of how long this invocation of the service has been running, and if it's long enough, run a 'systemctl reset-failed <unit>' before the script exits. This would (manually) reset the backoff counter if the service has been up for long enough, which is often what you really want.

(If systemd has a unit setting that will already do this, I was unable to spot it.)


Comments on this page:

Have you checked if there are issues open for these issues in systemd? I think it's reasonable to ask for this to be done automatically, even with a 4th setting RestartResetBackoffTime or something like that...

By dcortez at 2024-10-01 13:04:47:

if there's a significant problem with the [PPPoE] link, retrying every second doesn't help (and it may trigger things in your service provider's systems).

I think Bell Canada's DSL authentication systems do (or used to) have some kind of "back-off time" of their own, such that they'll refuse authentication for a while after too many attempts. Have you noticed this too?

I haven't been paying attention lately, but OpenWRT has a tendency to "hammer" the login server, and doesn't seem to have any back-off setting. When I was initially setting it up, I occasionally had some login delays/failures and suspected this as the reason. It does eventually succeed.

By Ben Hutchings at 2024-10-02 13:20:25:

For the case of pppd you can alternatively include the configuration options:

persist
maxfail 0

but I don't know how sensible pppd's retry behaviour is.

By cks at 2024-10-02 14:20:11:

Bell Canada's DSL authentication systems do have a back-off requirement of their own; if you're trying to reconnect too fast, they'll stall you for a minute or two. I'm not sure if they lengthen this stall time if you keep trying (and failing), but if they do I don't think it gets too long.

You could in theory combine pppd's retries (for fast restarts after brief blips) and systemd's restarts (to try again periodically in larger outages), but that might get a bit tricky, especially if you combine it with Bell Canada's "whoa there, take a break" stuff. It's something for me to think about, at least.

(Another brute force option is just to have a cron job that does 'systemctl reset-failed <unit>' every day or every twelve hours or something. I may do this as a very simple workaround.)

Written on 30 September 2024.
« Brief notes on making Prometheus's SNMP exporter use additional SNMP MIB(s)
Two views of what a TLS certificate verifies »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Mon Sep 30 22:48:53 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.