accept(2)'s problem of trying to return two different sorts of errors

January 6, 2019

A long time ago, I wrote about the dangers of being overly specific in the errno values you looked for, with the specific case being a daemon that exited because an accept() system call got an ECONNRESET that it didn't expect. Recently, John Wiersba left a comment on that entry asking what else the original programmer should have done, given an unexpected error from accept(). In thinking about the issues, I realized that part of the problem is that accept() is actually returning two different sorts of errors and the Unix API doesn't provide it any good way to let people tell the two different sorts apart.

(These days accept() is standardized to return ECONNABORTED instead of ECONNRESET in these circumstances, although this may not be universal.)

The two sorts of errors that accept() is trying to return are errors in the accept() call, such as a bad file descriptor (EBADF, ENOTSOCK) or a bad parameter (EFAULT), and errors in the new connection that accept() may or may not be returning (EAGAIN, ECONNABORTED, etc). One of the differences between the two is that the first sort of errors are probably permanent unless fixed by the program somehow and generally indicate an internal program error, while the second sort of errors will go away if you correctly loop through your accept() sequence again.

A sensibly behaving network daemon should definitely not exit when it gets the second sort of error; it should instead just continue on with its processing loop. However, it's perfectly sensible and probably broadly correct to exit if you get the first sort of error, especially if it's an unknown error and you have no idea how to correct it in your code. If someone has closed a file descriptor on you or it's become a non-socket somehow, continuing will generally just get you an un-ending stream of the same error over and over (and burn CPU, and perhaps flood logs). Exiting is a perfectly sensible way out and often really the only thing you can do.

However, you can't reliably distinguish between these two types of errors unless you believe you can know all of the possible errnos for one or the other of them. Given the general habit of Unixes of adding more errno returns for system calls over time, the practical reality is that you can't. This unfortunately leaves authors of Unix network daemons sort of up in the air; they have to pick one way or the other, and either way might give the wrong answer in some circumstances.

(Perhaps accept() should never have returned the second sort of errors, leaving them all to be discovered on a subsequent use of the file descriptor it returned. But that ship sailed a very long time ago; accept() returning these sorts of errors is even in the Single UNIX Specification for accept().)

I suspect that accept() is not the only the only system call with this sort of split in types of errors (although I can't think of any others off the top of my head). But thankfully I don't think there are too many others, because accept()'s pattern of operation is an unusual one.

PS: The Linux accept() manpage actually has a warning about Linux's behavior here, in the RETURN VALUE section. Linux opts to immediately return a lot of errors detected on the new socket, while other Unixes generally postpone some of them. But note that any Unix can return ECONNABORTED.


Comments on this page:

By dozzie at 2019-01-07 05:48:55:

@cks:

I suspect that accept() is not the only the only system call with this sort of split in types of errors (although I can't think of any others off the top of my head).

read()/write() in non-blocking mode, similarly send*()/recv*(). SysV IPC is another place. setuid(), execve(), fork() are also hairy in this regard, and can return a transient EAGAIN on hitting process limits, among the other things.

I think it's a quite common pattern in unix API that there are two types of errors, transient and permanent.

By loreb at 2019-01-08 14:59:50:

djb used a function called error_temp(errno) to tell if an errno value corresponds to a temporary/permanent failure; it survives in skalibs, libstddjb & similar projects.

I believe windows has/used to have something similar but I'm unable to find it right now (it was on the old new thing in case anyone with better google-fu wants to try); nowadays they suggest to check GetLastError without specifying all the possible errors, which is a funny way to say "this list is not exhaustive, we may add new error cases in the future".

By John Wiersba at 2019-01-09 00:32:45:

DJB says: A hard error is persistent: file not found, read-only file system, symbolic link loop, etc. A soft error is usually transient: out of memory, out of disk space, I/O error, disk quota exceeded, connection refused, host unreachable, etc.

It's not clear to me that the distinction between transient and permanent is all that useful. Clearly EAGAIN/EWOULDBLOCK, and maybe EBUSY or ENOSPC or EDQUOT, indicate some kind of retry is in order. Perhaps EINTR can be safely followed by a retry under many (all?) conditions. Some errors, like ECONNABORTED or ECONNRESET can indicate that simply abandoning a specific connection is appropriate, hopefully with no resource leakage.

But it seems that most other errors are non-recoverable and should be handled by aborting with an appropriate log message. Certainly, I would think that any error code that pops up unexpectedly and without advance consideration should be handled that way. If it hasn't been planned for in advance, how can it be handled with any assurance of reliability or correctness?

Written on 06 January 2019.
« Linux network-scripts being deprecated is a problem for my home PPPoE link
Daemons and the pragmatics of unexpected error values from system calls »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jan 6 23:11:46 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.