2012-10-11
Controlling Linux TCP socket send buffer sizes
Suppose that you are dealing with some piece of code that uses TCP sockets and does not set an explicit send buffer size. You vaguely remember that the default send buffer size can be tuned, but nothing immediately obvious turns up. This is the situation I found myself in recently.
It turns out there are two answers. The first one is
/proc/sys/net/ipv4/tcp_wmem, which is mostly written up in
Documentation/networking/ip-sysctl.txt in
the kernel source. You want to change the middle number (which is
in bytes). Note that this is not documented where you might expect
to find it, in Documentation/sysctl/net.txt (although
Documentation/sysctl/README sort of has a pointer to ip-sysctl.txt).
The second answer, the one I found out the hard way, is that you probably don't want to change this even if you think you do. You see, the low default number doesn't matter in practice because the Linux kernel auto-tunes the TCP send buffer size on the fly (unless your code explicitly sets the send buffer size). Assuming that your network is working right, a TCP socket's send buffer size will normally open right up to the amount of data that the connection can handle and your code should never notice any cases of insufficient send buffer size.
The corollary is that unless you are doing something unusual with your networking, noticing an insufficient send buffer size is actually a sign of underlying network problems. The reason your send buffer is too small is that something is going on the network that makes the kernel think it can't tune the buffer size up. You can increase the send buffer size anyways by increasing the default (even increasing it drastically), but all this will do is push the problem down a layer where you can't see it as easily any more. You'll still have a network problem. What you really want to do is find the network problem.
(If you think that you don't have a network problem, you should be able to come up with a convincing explanation of why and how normal TCP send buffer size scaling doesn't work in your situation. Of course, as always experimental results trump my meanderings so feel free to try it even if you don't have such an explanation; just don't be surprised if nothing improves.)
2012-10-05
Notes on Linux's blktrace
Blktrace is a tool that captures detailed traces of what's happening
in the kernel's block layer. In other words, think of it as a
tcpdump for disk IO. This is just the kind of thing you need to
deal with the fact that averages are misleading; since blktrace can tell you about the
timing of every request, you can use it as the starting point for
all sorts of detailed analysis. Or you can simply use it to verify
that you don't have any timing outliers in your disk IO. Blktrace
requires that your kernel be vaguely modern and have support for
block tracing enabled, but both of these are routine. I think that
most distributions have prepackaged versions of it that are just a
package manager command away; if not, see here.
Blktrace needs no special support or configuration apart from having
debugfs mounted on /sys/kernel/debug, and if you don't have that
blktrace will tell you so you can fix it with a 'mount -t debugfs
debugfs /sys/kernel/debug'.
My brief recent
experiences with blktrace have all been positive. It just works
and it doesn't seem to demand anything much from the system. I felt
comfortable applying it to a production machine and I didn't observe
any effects of doing so.
Here are some scattered notes on it:
- the manpage you really want to read for information on what you can
get out of blktrace is the blkparse manpage. The blktrace manpage is less interesting.
Unfortunately the blkparse manpage doesn't completely document
the output you'll see. When in doubt, you'll have to read the
source.
- '
blktrace -k' apparently doesn't always work and sometimes leaves you unable to do more block tracing until you reboot. Avoid it; just ^C yourblktracecommands. - if you have a multi-CPU machine and blktrace has written multiple
output files per disk, you should only specify one CPU's file for
each disk in '
blkparse -i <file> ...'. blkparse will derive the other CPU's files from your single one and load them all. If you innocently give all of the files in-iarguments, blkparse will process each of them repeatedly and print duplicate records in the trace data. Speaking from personal experience, this can lead to a fair degree of head-scratching.(The easy way to do this is to just run it as '
blkparse [options] *.0'.) blkparsealso produces a summary report at the end of the trace records. This is easy to miss if, like me, you make no attempt to read the entire trace output but instead start applying filtering via grep and awk and so on.- The 'read depth' and 'write depth' numbers are the maximum queue
depth ever observed. They say nothing about the average (or the
distribution of values).
- the IO completion time deltas produced by
blkparse -tare relative to when the request was issued to the hardware driver. If you care enough about this, it's easy enough to change the blkparse code to do whatever you need. (In an ideal world there would be a command line option for this.) - if you just use '
blkparse -t' without specifying any formats, the time deltas are in nanoseconds, not the microseconds that you'll get with a custom format and the%uformatting option. - in general, the default output for the various sorts of request
records contains stuff that you currently can't duplicate with
format strings.
- the 'sequence numbers' are not unique, and I've seen the clock
at least seem to go backwards (I was using a custom format, so
maybe that had something to do with it). However, the order that
blkparseprints things in seems to be the right one.
See also this blog entry on it from 2009 (via @spyced).
I may have more notes and commentary later, when the dust around here has settled and I've explored blktrace more. But on the positive side, it was very easy to use blktrace and blkparse to verify that our problem was not with the disks and being masked by an average.
(Never let it be said that I don't take requests. In fact I love requests since they mean I don't have to come up with entry ideas myself.)