10G Ethernet and network buffer sizes (at least on Linux)
I spent a chunk of today checking out the performance of 10G networking on what we hope will be our new iSSCSI backend hardware and in the process I made an interesting discovery: at 10G speeds, the size of your program's network buffers can matter. In fact it can matter a lot.
I'm used to a network environment where it doesn't really matter how
write() to the network at once as long as it's not stupidly
small. Going much smaller than the 1500 byte MTU on 1G Ethernet will
cost you TCP performance but that's about it. Certainly once you hit the
MTU you're in the clear and any competent modern machine and network
will deliver wire bandwidth. With 10G Ethernet this doesn't seem to
be true at all. Getting close to theoretical wire speeds even in a
relatively tight sending loop seems to require network buffer sizes that
are much bigger than the MTU. I was seeing performance increases all the
way out to 128 KB
(128 KB is the interesting size for me because 128 KB is the common ZFS block size and thus the usual size for ZFS disk IOs. iSCSI overhead makes that a bit more than 128 KB on the wire but not that much.)
I don't know exactly why this is so, but one obvious theory is that 10G Ethernet transmit speeds are so ferociously fast that you need relatively large buffer sizes to keep the operating system's transmit buffers filled. If I'm doing the math right, it takes about 12 microseconds to transmit a 1500 byte Ethernet packet on 1G Ethernet; in order to keep the network saturated your program needs to push around that much write data into the kernel every 12 microseconds to avoid stalling the network. By contrast 10G will write about 16 KB to the network over that same 12 microseconds, so you need to push about ten times more data into the kernel to keep the network fed and avoid stalls.
(This assumes that the transmit paths for 10G are no more efficient than for 1G. This may be a bad assumption; I wouldn't be surprised if there is more hardware acceleration and thus less kernel overhead for 10G.)
If my theory is right what matters here is not your program's total bandwidth into the kernel but latency. If you write with small buffers you must provide more data to the kernel almost immediately in order to avoid a stall. If you write with large buffers you buy yourself latency insurance and can afford more of a delay before your program gives the kernel more data. I'm not entirely satisfied with this explanation but it's the best I have right now.
(It may be lucky that my test sending program is not necessarily the absolutely fastest thing that it could be and has some degree of extra overhead because of how I'm using it. If I'm right I wouldn't necessarily have seen this effect with a really fast and minimal test program.)