Daemons and the pragmatics of unexpected error values from system calls

January 7, 2019

When I wrote about the danger of being overly specific in the errno values you look for years ago, I used the example of a SMTP server daemon that died when it got an unexpected error from accept(). Recently, John Wiersba asked in a comment:

I'm not clear what you're suggesting here. Isn't logging the error code and aborting the right thing to do with unexpected errors? [...]

In practice, there are two situations in Unix programs, especially in daemons. The first situation is where a system call is more or less done once, is not expected to fail at all, and cannot really be fixed if it does fail. Here you generally want to fail out on any error. The second situation is where the system call may fail for transient reasons. One case is certainly accept(), since accept() is trying to return two sorts of errors, but there are plenty of other cases where a system call may fail temporarily and then work later (as dozzie mentioned in comments to yesterday's entry on accept()).

In the second situation, you cannot tell transient errors from persistent ones, not in general, because Unixes add both transient and persistent errno values to system calls over time. In a program run by hand you can often punt; you assume that all errno values you don't specifically recognize mean persistent errors, exit on them, and leave it up to the user to run you again and hope that this time around it will work. In a daemon you don't have this luxury, so the pragmatic question is whether it's more likely that your daemon has hit a new transient errno value or a new persistent one.

My view is that in most environments, the more likely, better, and safer answer for a daemon is that the unrecognized new errno value is a transient error. You already know that transient errors are possible for this system call and you're handling some of them, and you know that over sufficiently large amounts of time your list of transient errno values will be incomplete. Often you don't really expect the system call to ever fail with a persistent error, because your program is not supposed to do things like close the wrong file descriptor. In the unlikely event that you hit an unrecognized persistent error and keep retrying futilely, you'll burn extra CPU and perhaps spam logs. If you exit instead, in the much more likely event that you hit an unrecognized transient error, you'll take down the daemon (as happened for our SMTP server).

(If you do expect a certain amount of persistent errors even in the normal operation of your daemon, you may want a different answer.)

PS: Even for non-daemon programs, 'exit and let the user try again' is not necessarily the best or the most usable answer. As a hypothetical example, if your program first tries to make an IPv6 connection and then falls back to trying an IPv4 one if it gets one of a limited set of errnos, a new or just unexpected 'this IPv6 connection will never work' errno will probably make your program unusable.

(For instance, you might be running on one of the uncommon Linux machines that has IPv6 dual binding turned off, giving you some new errno values you hadn't seen before.)

Written on 07 January 2019.
« accept(2)'s problem of trying to return two different sorts of errors
Link: The IOCCC 2018 "Best of show" program »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jan 7 21:21:30 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.