10G Ethernet and network buffer sizes (at least on Linux)

October 26, 2013

I spent a chunk of today checking out the performance of 10G networking on what we hope will be our new iSSCSI backend hardware and in the process I made an interesting discovery: at 10G speeds, the size of your program's network buffers can matter. In fact it can matter a lot.

I'm used to a network environment where it doesn't really matter how much you write() to the network at once as long as it's not stupidly small. Going much smaller than the 1500 byte MTU on 1G Ethernet will cost you TCP performance but that's about it. Certainly once you hit the MTU you're in the clear and any competent modern machine and network will deliver wire bandwidth. With 10G Ethernet this doesn't seem to be true at all. Getting close to theoretical wire speeds even in a relatively tight sending loop seems to require network buffer sizes that are much bigger than the MTU. I was seeing performance increases all the way out to 128 KB write() buffers.

(128 KB is the interesting size for me because 128 KB is the common ZFS block size and thus the usual size for ZFS disk IOs. iSCSI overhead makes that a bit more than 128 KB on the wire but not that much.)

I don't know exactly why this is so, but one obvious theory is that 10G Ethernet transmit speeds are so ferociously fast that you need relatively large buffer sizes to keep the operating system's transmit buffers filled. If I'm doing the math right, it takes about 12 microseconds to transmit a 1500 byte Ethernet packet on 1G Ethernet; in order to keep the network saturated your program needs to push around that much write data into the kernel every 12 microseconds to avoid stalling the network. By contrast 10G will write about 16 KB to the network over that same 12 microseconds, so you need to push about ten times more data into the kernel to keep the network fed and avoid stalls.

(This assumes that the transmit paths for 10G are no more efficient than for 1G. This may be a bad assumption; I wouldn't be surprised if there is more hardware acceleration and thus less kernel overhead for 10G.)

If my theory is right what matters here is not your program's total bandwidth into the kernel but latency. If you write with small buffers you must provide more data to the kernel almost immediately in order to avoid a stall. If you write with large buffers you buy yourself latency insurance and can afford more of a delay before your program gives the kernel more data. I'm not entirely satisfied with this explanation but it's the best I have right now.

(It may be lucky that my test sending program is not necessarily the absolutely fastest thing that it could be and has some degree of extra overhead because of how I'm using it. If I'm right I wouldn't necessarily have seen this effect with a really fast and minimal test program.)


Comments on this page:

The "netmap" project has shown that kernels could get a lot more performance out of high-speed NICs by cleaning up the code and data structures:

http://info.iet.unipi.it/~luigi/netmap/

By James (trs80) at 2013-10-26 23:41:51:

Speaking of buffers, have a look at the buffer size of your switches. The recommendations I've heard is to not have two switches with small buffers connected. This may be more of an issue for network designs like leaf/spine, and blade chassis where the internal switches have small buffers.

Written on 26 October 2013.
« Modern disk write caches and how they get dealt with (a quick overview)
Some things I've learned from transitioning a website to HTTPS »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Oct 26 02:01:55 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.