A basic step in measuring and improving network performance

May 13, 2012

There is a mistake that I have seen people make over and over again when they attempt to improve, tune, or even check network performance under unusual circumstances. Although what set me off now is this well intentioned article, I've seen the same mistake in people setting off to improve their iSCSI performance, NFS performance, and probably any number of other things that I've forgotten by now.

The mistake is skipping the most important basic step of network performance testing: the first thing you have to do is make sure that your network is working right. Before you can start tuning to improve your particular case or start measuring the effects of different circumstances, you need to know that your base case is not suffering from performance issues of its own. If you skip this step, you are building all future results on a foundation of sand and none of them are terribly meaningful.

(They may be very meaningful for you in that they improve your system's performance right now, but if your baseline performance is not up to what it should be it's quite possible that you could do better by addressing that.)

In the very old days, the correct base performance level you could expect was somewhat uncertain and variable; getting networks to run fast was challenging for various reasons. Fortunately those days have long since passed. Today we have a very simple performance measure, one valid for any hardware and OS from at least the past half decade if not longer:

Any system can saturate a gigabit link with TCP traffic.

As I've written before in passing, if you have two machines with gigabit Ethernet talking directly to each other on a single subnet you should be able to get gigabit wire rates between them (approximately 110 MBytes/sec) with simple testing tools like ttcp. If you cannot get this rate between your two test machines, something is wrong somewhere and you need to fix it before there's any point in going further.

(There are any number of places where the problem could be, but one definitely exists.)

I don't have an answer for what the expected latency should be (as measured either by ping or by some user-level testing tool), beyond that it should be negligible. Our servers range from around 150 microseconds down to 10 microseconds, but there's other traffic going on, multiple switch hops, and so on. Bulk TCP tends to smooth all of that out, which is part of why I like it for this sort of basic tests.

As a side note, a properly functioning local network has basically no packet loss whatsoever. If you see any more than a trace amount, you have a problem (which may be that your network, switches, or switch uplinks are oversaturated).

The one area today where there's real uncertainty in the proper base performance is 10G networking; we have not yet mastered the art of casually saturating 10G networks and may not for a while. If you have 10G networks you are going to have to do your own tuning and measurements of basic network performance before you start with higher level issues, and you may have to deliberately tune for your specific protocol and situation in a way that makes other performance worse.

Written on 13 May 2012.
« The death of paging on the web
My experiment with Firefox Nightly builds: a failure »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun May 13 00:40:17 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.