A basic step in measuring and improving network performance

May 13, 2012

There is a mistake that I have seen people make over and over again when they attempt to improve, tune, or even check network performance under unusual circumstances. Although what set me off now is this well intentioned article, I've seen the same mistake in people setting off to improve their iSCSI performance, NFS performance, and probably any number of other things that I've forgotten by now.

The mistake is skipping the most important basic step of network performance testing: the first thing you have to do is make sure that your network is working right. Before you can start tuning to improve your particular case or start measuring the effects of different circumstances, you need to know that your base case is not suffering from performance issues of its own. If you skip this step, you are building all future results on a foundation of sand and none of them are terribly meaningful.

(They may be very meaningful for you in that they improve your system's performance right now, but if your baseline performance is not up to what it should be it's quite possible that you could do better by addressing that.)

In the very old days, the correct base performance level you could expect was somewhat uncertain and variable; getting networks to run fast was challenging for various reasons. Fortunately those days have long since passed. Today we have a very simple performance measure, one valid for any hardware and OS from at least the past half decade if not longer:

Any system can saturate a gigabit link with TCP traffic.

As I've written before in passing, if you have two machines with gigabit Ethernet talking directly to each other on a single subnet you should be able to get gigabit wire rates between them (approximately 110 MBytes/sec) with simple testing tools like ttcp. If you cannot get this rate between your two test machines, something is wrong somewhere and you need to fix it before there's any point in going further.

(There are any number of places where the problem could be, but one definitely exists.)

I don't have an answer for what the expected latency should be (as measured either by ping or by some user-level testing tool), beyond that it should be negligible. Our servers range from around 150 microseconds down to 10 microseconds, but there's other traffic going on, multiple switch hops, and so on. Bulk TCP tends to smooth all of that out, which is part of why I like it for this sort of basic tests.

As a side note, a properly functioning local network has basically no packet loss whatsoever. If you see any more than a trace amount, you have a problem (which may be that your network, switches, or switch uplinks are oversaturated).

The one area today where there's real uncertainty in the proper base performance is 10G networking; we have not yet mastered the art of casually saturating 10G networks and may not for a while. If you have 10G networks you are going to have to do your own tuning and measurements of basic network performance before you start with higher level issues, and you may have to deliberately tune for your specific protocol and situation in a way that makes other performance worse.

Comments on this page:

From at 2012-05-14 21:46:46:

As someone who's tried to saturate a gig network recently, it isn't always as easy as you make it sound. If you're using a pair of machines for send/receive, they should have their buffer sizes increased to make sure they can source/sink the volume of traffic (especially if you have a large RTT). Additionally, you should set the packet size to something large so you're sending as few (big) packets as possible.

A modern machine with big buffers and a large packet size can indeed flood a gig network without breaking a sweat. But if you use small packets (64 bytes, for example), you'd need to send 2 million per second to saturate a gig link, and most software network stacks can't generate this kind of traffic.

Using your recommendation of TTCP, you'd want to say:

 ttcp -t -s -l 1440 -b 2097152 -n 2000000 targethost.example.com

That sets the packet size to 1440 (default is 8192, which is fine for TCP but too big for UDP on standard 1500-byte ethernet), and increases the buffer size to 2MB instead of the system default (which might be only 32kB!). We spent a while one time blaming our network gear when it turned out the endpoints weren't tuned properly to send/receive the volume of traffic we wanted to test...


By cks at 2012-05-14 23:05:37:

My experience is that plain TCP on ordinary machines (especially Linux) can saturate gigabit Ethernet with no special tuning needed. Ttcp itself generally needs an increased buffer count (-n parameter) in order for a test to run long enough to give you a real feel for the bandwidth, but that's it.

(My qualification is 'on the same subnet'; once you get into multi-hop setups things get much more tangled. This also assumes no other traffic going on at the same time.)

My personal view is that any modern OS that is not autotuned for gigabit networking in terms of OS level buffering and so on is broken, but I suppose that this is a bias (based on not using such systems).

Written on 13 May 2012.
« The death of paging on the web
My experiment with Firefox Nightly builds: a failure »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun May 13 00:40:17 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.