Daemons and the pragmatics of unexpected error values from system calls

January 7, 2019

When I wrote about the danger of being overly specific in the errno values you look for years ago, I used the example of a SMTP server daemon that died when it got an unexpected error from accept(). Recently, John Wiersba asked in a comment:

I'm not clear what you're suggesting here. Isn't logging the error code and aborting the right thing to do with unexpected errors? [...]

In practice, there are two situations in Unix programs, especially in daemons. The first situation is where a system call is more or less done once, is not expected to fail at all, and cannot really be fixed if it does fail. Here you generally want to fail out on any error. The second situation is where the system call may fail for transient reasons. One case is certainly accept(), since accept() is trying to return two sorts of errors, but there are plenty of other cases where a system call may fail temporarily and then work later (as dozzie mentioned in comments to yesterday's entry on accept()).

In the second situation, you cannot tell transient errors from persistent ones, not in general, because Unixes add both transient and persistent errno values to system calls over time. In a program run by hand you can often punt; you assume that all errno values you don't specifically recognize mean persistent errors, exit on them, and leave it up to the user to run you again and hope that this time around it will work. In a daemon you don't have this luxury, so the pragmatic question is whether it's more likely that your daemon has hit a new transient errno value or a new persistent one.

My view is that in most environments, the more likely, better, and safer answer for a daemon is that the unrecognized new errno value is a transient error. You already know that transient errors are possible for this system call and you're handling some of them, and you know that over sufficiently large amounts of time your list of transient errno values will be incomplete. Often you don't really expect the system call to ever fail with a persistent error, because your program is not supposed to do things like close the wrong file descriptor. In the unlikely event that you hit an unrecognized persistent error and keep retrying futilely, you'll burn extra CPU and perhaps spam logs. If you exit instead, in the much more likely event that you hit an unrecognized transient error, you'll take down the daemon (as happened for our SMTP server).

(If you do expect a certain amount of persistent errors even in the normal operation of your daemon, you may want a different answer.)

PS: Even for non-daemon programs, 'exit and let the user try again' is not necessarily the best or the most usable answer. As a hypothetical example, if your program first tries to make an IPv6 connection and then falls back to trying an IPv4 one if it gets one of a limited set of errnos, a new or just unexpected 'this IPv6 connection will never work' errno will probably make your program unusable.

(For instance, you might be running on one of the uncommon Linux machines that has IPv6 dual binding turned off, giving you some new errno values you hadn't seen before.)

Comments on this page:

By jhi@iki.fi at 2019-01-08 00:32:36:

To avoid log spam, especially with a daemon, one should also consider aggregating the errors, so that duplicate errors are squashed (but counted), and the entries with counts output every X seconds.

By Jenny D at 2019-01-08 02:44:57:

I like the Erlang way - let it crash, and then return to a known good state and keep working.

I'd add that in these cases it's usually a good idea to exponentially backoff. At the very least this tends to keep crashloops from saturating shared resources (like disks full of logs, cpu, etc.)

By John Wiersba at 2019-01-09 00:48:35:

@Arthur Axel fREW Schmidt: I agree that exponential backoff can sometimes be appropriate and useful, but it's hard to imagine that a daemon should implement exponential backoff for every unexpected system call failure. Besides much more complicated code (and therefore more bugs), that would likely result in a kind of self-inflicted DOS, when a daemon doesn't abort and get immediately restarted, but instead stays alive for several seconds/minutes/hours hoping that an unanticipated error can be recovered by retrying.

Written on 07 January 2019.
« accept(2)'s problem of trying to return two different sorts of errors
Link: The IOCCC 2018 "Best of show" program »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jan 7 21:21:30 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.