Daemons and the pragmatics of unexpected error values from system calls
When I wrote about the danger of being overly specific in the errno
values you look for years ago, I used the example
of a SMTP server daemon that died when it got an unexpected error from
accept()
. Recently, John Wiersba asked in a comment:
I'm not clear what you're suggesting here. Isn't logging the error code and aborting the right thing to do with unexpected errors? [...]
In practice, there are two situations in Unix programs, especially
in daemons. The first situation is where a system call is more or
less done once, is not expected to fail at all, and cannot really
be fixed if it does fail. Here you generally want to fail out on
any error. The second situation is where the system call may fail
for transient reasons. One case is certainly accept()
, since
accept()
is trying to return two sorts of errors,
but there are plenty of other cases where a system call may fail
temporarily and then work later (as dozzie mentioned in comments to
yesterday's entry on accept()
).
In the second situation, you cannot tell transient errors from persistent ones, not in general, because Unixes add both transient and persistent errno values to system calls over time. In a program run by hand you can often punt; you assume that all errno values you don't specifically recognize mean persistent errors, exit on them, and leave it up to the user to run you again and hope that this time around it will work. In a daemon you don't have this luxury, so the pragmatic question is whether it's more likely that your daemon has hit a new transient errno value or a new persistent one.
My view is that in most environments, the more likely, better, and safer answer for a daemon is that the unrecognized new errno value is a transient error. You already know that transient errors are possible for this system call and you're handling some of them, and you know that over sufficiently large amounts of time your list of transient errno values will be incomplete. Often you don't really expect the system call to ever fail with a persistent error, because your program is not supposed to do things like close the wrong file descriptor. In the unlikely event that you hit an unrecognized persistent error and keep retrying futilely, you'll burn extra CPU and perhaps spam logs. If you exit instead, in the much more likely event that you hit an unrecognized transient error, you'll take down the daemon (as happened for our SMTP server).
(If you do expect a certain amount of persistent errors even in the normal operation of your daemon, you may want a different answer.)
PS: Even for non-daemon programs, 'exit and let the user try again'
is not necessarily the best or the most usable answer. As a
hypothetical example, if your program first tries to make an IPv6
connection and then falls back to trying an IPv4 one if it gets one
of a limited set of errno
s, a new or just unexpected 'this IPv6
connection will never work' errno
will probably make your program
unusable.
(For instance, you might be running on one of the uncommon Linux
machines that has IPv6 dual binding turned off, giving you some new errno
values you hadn't seen before.)
|
|