Wandering Thoughts archives

2012-10-11

Controlling Linux TCP socket send buffer sizes

Suppose that you are dealing with some piece of code that uses TCP sockets and does not set an explicit send buffer size. You vaguely remember that the default send buffer size can be tuned, but nothing immediately obvious turns up. This is the situation I found myself in recently.

It turns out there are two answers. The first one is /proc/sys/net/ipv4/tcp_wmem, which is mostly written up in Documentation/networking/ip-sysctl.txt in the kernel source. You want to change the middle number (which is in bytes). Note that this is not documented where you might expect to find it, in Documentation/sysctl/net.txt (although Documentation/sysctl/README sort of has a pointer to ip-sysctl.txt).

The second answer, the one I found out the hard way, is that you probably don't want to change this even if you think you do. You see, the low default number doesn't matter in practice because the Linux kernel auto-tunes the TCP send buffer size on the fly (unless your code explicitly sets the send buffer size). Assuming that your network is working right, a TCP socket's send buffer size will normally open right up to the amount of data that the connection can handle and your code should never notice any cases of insufficient send buffer size.

The corollary is that unless you are doing something unusual with your networking, noticing an insufficient send buffer size is actually a sign of underlying network problems. The reason your send buffer is too small is that something is going on the network that makes the kernel think it can't tune the buffer size up. You can increase the send buffer size anyways by increasing the default (even increasing it drastically), but all this will do is push the problem down a layer where you can't see it as easily any more. You'll still have a network problem. What you really want to do is find the network problem.

(If you think that you don't have a network problem, you should be able to come up with a convincing explanation of why and how normal TCP send buffer size scaling doesn't work in your situation. Of course, as always experimental results trump my meanderings so feel free to try it even if you don't have such an explanation; just don't be surprised if nothing improves.)

TCPSendbufferDefaultSize written at 00:28:31; Add Comment

2012-10-05

Notes on Linux's blktrace

Blktrace is a tool that captures detailed traces of what's happening in the kernel's block layer. In other words, think of it as a tcpdump for disk IO. This is just the kind of thing you need to deal with the fact that averages are misleading; since blktrace can tell you about the timing of every request, you can use it as the starting point for all sorts of detailed analysis. Or you can simply use it to verify that you don't have any timing outliers in your disk IO. Blktrace requires that your kernel be vaguely modern and have support for block tracing enabled, but both of these are routine. I think that most distributions have prepackaged versions of it that are just a package manager command away; if not, see here. Blktrace needs no special support or configuration apart from having debugfs mounted on /sys/kernel/debug, and if you don't have that blktrace will tell you so you can fix it with a 'mount -t debugfs debugfs /sys/kernel/debug'.

My brief recent experiences with blktrace have all been positive. It just works and it doesn't seem to demand anything much from the system. I felt comfortable applying it to a production machine and I didn't observe any effects of doing so.

Here are some scattered notes on it:

  • the manpage you really want to read for information on what you can get out of blktrace is the blkparse manpage. The blktrace manpage is less interesting. Unfortunately the blkparse manpage doesn't completely document the output you'll see. When in doubt, you'll have to read the source.

  • 'blktrace -k' apparently doesn't always work and sometimes leaves you unable to do more block tracing until you reboot. Avoid it; just ^C your blktrace commands.

  • if you have a multi-CPU machine and blktrace has written multiple output files per disk, you should only specify one CPU's file for each disk in 'blkparse -i <file> ...'. blkparse will derive the other CPU's files from your single one and load them all. If you innocently give all of the files in -i arguments, blkparse will process each of them repeatedly and print duplicate records in the trace data. Speaking from personal experience, this can lead to a fair degree of head-scratching.

    (The easy way to do this is to just run it as 'blkparse [options] *.0'.)

  • blkparse also produces a summary report at the end of the trace records. This is easy to miss if, like me, you make no attempt to read the entire trace output but instead start applying filtering via grep and awk and so on.
  • The 'read depth' and 'write depth' numbers are the maximum queue depth ever observed. They say nothing about the average (or the distribution of values).

  • the IO completion time deltas produced by blkparse -t are relative to when the request was issued to the hardware driver. If you care enough about this, it's easy enough to change the blkparse code to do whatever you need. (In an ideal world there would be a command line option for this.)

  • if you just use 'blkparse -t' without specifying any formats, the time deltas are in nanoseconds, not the microseconds that you'll get with a custom format and the %u formatting option.

  • in general, the default output for the various sorts of request records contains stuff that you currently can't duplicate with format strings.

  • the 'sequence numbers' are not unique, and I've seen the clock at least seem to go backwards (I was using a custom format, so maybe that had something to do with it). However, the order that blkparse prints things in seems to be the right one.

See also this blog entry on it from 2009 (via @spyced).

I may have more notes and commentary later, when the dust around here has settled and I've explored blktrace more. But on the positive side, it was very easy to use blktrace and blkparse to verify that our problem was not with the disks and being masked by an average.

(Never let it be said that I don't take requests. In fact I love requests since they mean I don't have to come up with entry ideas myself.)

BlktraceNotes written at 02:07:45; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.