2009-12-20
Some things about getting useful output from time
No version of time has ever had what you could call a genuinely useful
default output format, at least not for benchmarking and testing. The
POSIX (and System V) standard format has all of the information you want
in the right format (the times in simple seconds), but spreads it across
multiple lines; the BSD format used by csh puts the times in 'human
readable' format of hours, minutes, and seconds, and includes a bunch
of additional information (much of which is inaccurate, useless, or
outright made up by the kernel).
Fortunately, GNU time lets you set a different format. Since I keep looking this up, here is what I want to use:
time -f '%e real %U user %S kernel' ....
I want the times in pure seconds, not in 'human readable' converted form, because I am almost invariably performing math on them in order to find out the information that I'm really interested in.
(Perhaps somewhere there is a convenient Python module for doing this
with human readable numbers, but frankly bc works fine with seconds so
my motivation to go looking is low.)
FreeBSD's time program has no format specifier option, but defaults
to essentially this format. Solaris's time program also has no format
specifier option and defaults to a 'human readable' multi-line format;
the best you can do is make it use the POSIX format with 'time -p'.
On basically all systems, csh and tcsh have their own time builtin
which uses the BSD csh format or some variant of it (and can be
customized to some extent).
Now, let's talk about why I say that time makes up some of its
numbers.
time generally gets its numbers from getrusage() (okay,
technically it gets them from wait4(), which returns the same
information). However, various versions of getrusage() explicitly
document that some information is plain not returned or is not what you
think it is. The Linux manpage for getrusage() says, for example:
[...] Not all fields are meaningful under Linux. In linux 2.4 only the fields
ru_utime,ru_stime,ru_minflt, andru_majfltare maintained. Since Linux 2.6,ru_nvcswandru_nivcsware also maintained.
(Note that time determines the wall clock time itself.)
This corresponds to time's user and system time, major and minor page
faults, and voluntary and involuntary time switches (%w and %c, not
printed by default), and it assumes that the Linux kernel does a good
job of keeping track of this information, which I would be dubious of.
The kernel has good intentions, but doing completely accurate accounting
can be hard and is not necessarily a high priority.
(Current Linux 2.6 kernels seems to also produce numbers for IO input and output, provided that you have the right kernel configuration options turned on. Again, the trustworthiness of those numbers is unknown.)
Currently, FreeBSD 8.0 says:
The numbers
ru_inblockandru_oublockaccount only for real I/O; data supplied by the caching mechanism is charged only to the first process to read or write the data.
I'm not going to quote from the Solaris 10 getrusage() manpage, but it has a
similar long list of caveats and cautions.
Now, it's possible that time will return real and useful
information for things you are interested in (beyond the basic time
information). But before you rely on its numbers for performance
measurement, you certainly need to check to make sure that they
are accurate and useful. This should include not just reading the
getrusage() manpage for your OS, but actually constructing synthetic
programs that perform known operations and making sure that time
reports accurate information for what they do.
2009-12-04
Learning from Unicorn: the accept() thundering herd non-problem
For a long time, one of my acquired articles of faith has been that
you needed to worry about and avoid
the thundering herd accept() problem. (This is the problem where if
you have a bunch of processes all waiting in accept() and a single
connection comes in, many kernels will wake all of the processes up only
to have most of them immediately go back to sleep.)
One of the things that Unicorn (via
Ryan Tomayko and others)
has taught me about preforking servers is that the thundering herd
problem doesn't matter (under sane circumstances), because it is only
a problem when the system is idle and when the system is idle you
generally don't care about the extra overhead. When the system is busy,
almost all of your worker processes are busy working, not sitting in
accept(); the more busy your system, the more you care about the extra
potential overhead but the less workers are sitting in accept() and
so the less overhead there actually is. At the limit your system is so
loaded that there is always a connection waiting for accept() and you
have no herd at all, no matter how many worker processes you have.
(And this assumes that your kernel is susceptible to the thundering herd problem at all. The kernel is perfectly capable of only waking one process per pending connection, instead of waking them all and letting the scheduler pick a winner.)
Now, this does depend on how many workers you have and how many of them wind up idle. However, there's two things that mitigate this. First, generally you don't want to have many more workers than you have processors, so on most systems you're not going to have many processes in general. Second, modern systems already have relatively low overheads for a full thundering herd situation.
Sidebar: some numbers on the overhead
You might ask how low is relatively low. I did a quick test, and on a
reasonably modern but not spectacular 64-bit dual core Linux machine,
my Python test harness ran at most only 26 microseconds slower per
accept() call when there were 256 waiting processes. On older,
slower FreeBSD hardware I also have access to, the overhead for 256
waiting processes rose to a whole 62 microseconds.
The test harness has a server program that forks N copies, each of which
accepts() and then immediately closes the resulting socket, and a
client program that repeatedly creates a socket, connect()s to the
server, and then does a recv() on the socket (which will exit when the
server closes it). I timed how long the client program took for various
numbers of server processes and worked from there.