Today I (re-)learned that top's output can be quietly system dependent

November 30, 2018

I'll start with a story that is the background. A few days ago I tweeted:

Current status: zfs send | zfs recv at 33 Mbytes/sec. This will take a while, and the server with SSDs and 10G networking is rather bored.

(It's not CPU-limited at either end and I don't think it's disk-limited. Maybe too many synchronous reads or something.)

I was wrong about this being disk-limited, as it turned out, and then Allan Jude had the winning suggestion:

Try adding '-c' to your SSH invocation.

See also: <pdf link>

(If you care about 10G+ SSH, you want to read that PDF.)

This made a huge difference, giving me basically 1G wire speeds for my ZFS transfers. But that difference made me scratch my head, because why was switching SSH ciphers making a difference when ssh wasn't CPU-limited in the first place? I came up with various theories and guesses, until today I had a sudden terrible suspicion. The result of testing and confirming that suspicion was another tweet:

Today I learned or re-learned a valuable lesson: in practice, top output is system dependent, in ways that are not necessarily obvious. For instance, CPU % on multi-CPU systems.

(On some systems, CPU % is the percent of a single CPU; on some it's a % of all CPUs.)

You see, the reason that I had confidently known that SSH wasn't CPU-limited on sending machine, which was one of our OmniOS fileservers, is that I had run top and seen that the ssh process was only using 25% of the CPU. Case closed.

Except that OmniOS top and Linux's top report CPU usage percentages differently. On Linux, CPU percentage is relative to a single CPU, so 25% is a quarter of one CPU, 100% is all of it, and over 100% is a multi-threaded program that is using up more than one CPU's worth of CPU time. On OmniOS, the version of top we're using comes from pkgsrc (in what is by now a very old version), and that version reports CPU percentage relative to all CPUs in the machine. Our OmniOS fileservers are 4-CPU machines, so that '25% CPU' was actually 'all of a single CPU'. In other words, I was completely wrong about the sending ssh not being CPU-limited. Since ssh was CPU limited after all, it's suddenly no surprise why switching ciphers sped things up to basically wire speed.

(Years ago I established that the old SunSSH that OmniOS was using back then was rather slow, but then later we upgraded to OpenSSH and I sort of thought that I could not worry about SSH speeds any more. Well, I was wrong. Of course, nothing can beat not doing SSH at all but instead using, say, mbuffer. Using mbuffer also means that you can deliberately limit your transfer bandwidth to leave some room for things like NFS fileservice.)

PS: There are apparently more versions than you might think. On the FreeBSD 10.4 machine I have access to, top reports CPU percentage in the same way Linux does (100% is a single-threaded process using all of one CPU). Although both the FreeBSD version and our OmniOS version say they're the William LeFebvre implementation and have similar version numbers, apparently they diverged significantly at some point, probably when people had to start figuring out how to make the original version of top deal with multi-CPU machines.

Written on 30 November 2018.
« I've learned that sometimes the right way to show information is a simple one
Checking to see if a process is alive (on Linux) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Nov 30 23:01:36 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.