Learning from Unicorn: the accept() thundering herd non-problem

December 4, 2009

For a long time, one of my acquired articles of faith has been that you needed to worry about and avoid the thundering herd accept() problem. (This is the problem where if you have a bunch of processes all waiting in accept() and a single connection comes in, many kernels will wake all of the processes up only to have most of them immediately go back to sleep.)

One of the things that Unicorn (via Ryan Tomayko and others) has taught me about preforking servers is that the thundering herd problem doesn't matter (under sane circumstances), because it is only a problem when the system is idle and when the system is idle you generally don't care about the extra overhead. When the system is busy, almost all of your worker processes are busy working, not sitting in accept(); the more busy your system, the more you care about the extra potential overhead but the less workers are sitting in accept() and so the less overhead there actually is. At the limit your system is so loaded that there is always a connection waiting for accept() and you have no herd at all, no matter how many worker processes you have.

(And this assumes that your kernel is susceptible to the thundering herd problem at all. The kernel is perfectly capable of only waking one process per pending connection, instead of waking them all and letting the scheduler pick a winner.)

Now, this does depend on how many workers you have and how many of them wind up idle. However, there's two things that mitigate this. First, generally you don't want to have many more workers than you have processors, so on most systems you're not going to have many processes in general. Second, modern systems already have relatively low overheads for a full thundering herd situation.

Sidebar: some numbers on the overhead

You might ask how low is relatively low. I did a quick test, and on a reasonably modern but not spectacular 64-bit dual core Linux machine, my Python test harness ran at most only 26 microseconds slower per accept() call when there were 256 waiting processes. On older, slower FreeBSD hardware I also have access to, the overhead for 256 waiting processes rose to a whole 62 microseconds.

The test harness has a server program that forks N copies, each of which accepts() and then immediately closes the resulting socket, and a client program that repeatedly creates a socket, connect()s to the server, and then does a recv() on the socket (which will exit when the server closes it). I timed how long the client program took for various numbers of server processes and worked from there.


Comments on this page:

From 65.172.155.230 at 2009-12-04 14:56:32:

There are a lot of things you are missing here, IMO.

  1. Thundering herd in accept() has been fixed in Linux/FreeBSD for years now. If you have 256 processes waiting in accept() and one connection, only one will be woken up.

  2. IIRC (it's been a while) the big problem with accept was that some crappy Unixes would do really weird stuff if N processes hit accept() at once, so there were really horrible workarounds like using poll() and then taking locks (for things that cared, like Apache-httpd). Also Apache-httpd may have "needed" to poll()+locks so it could wait on more than just the listening socket (maybe if you had SSL too, as I said it's been a while).

  3. Apache-httpd used lots more processes than cores.

  4. Apache-httpd was big, esp. with things like mod_perl loaded.

  5. Even without the locking overhead, due to #4 the biggest problem was all the memory those 256 procs. touched ... a lot of which wasn't shared, stuff might be swapped in and your CPU cache was definitely blown.

...also, obviously, HW 5-10 years ago sucked compared to today :). So I wouldn't guarantee your conclusion is wrong now anyway.

After accept() got fixed, the only time I remember hearing about it much is wrt. threading+locks where someone used a broadcast lock (and the lock was in the fast path).

By cks at 2009-12-05 23:36:42:

My impression is that Apache needed (and needs) a lot of processes because it has a one process per client model and clients may be more or less arbitrarily slow (on a dialup modem, at the end of a slow satellite link, behind a very congested line, etc). So you had to allow as many Apache processes as you'd ever get simultaneous slow clients, more or less, and this could be a larger number for even a moderate-sized site.

Modern things like Unicorn assume that they're being run behind a light weight reverse proxy that handles all of that, and so their workers will clear requests basically at full native speed without having to worry about slow clients. This makes it possible to run many fewer of them, basically N per core where N is some small number.

Written on 04 December 2009.
« The problem with the OpenSolaris source repository
What version of Python is included in various current OSes »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Dec 4 00:07:20 2009
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.