Today I (re-)learned that top
's output can be quietly system dependent
I'll start with a story that is the background. A few days ago I tweeted:
Current status: zfs send | zfs recv at 33 Mbytes/sec. This will take a while, and the server with SSDs and 10G networking is rather bored.
(It's not CPU-limited at either end and I don't think it's disk-limited. Maybe too many synchronous reads or something.)
I was wrong about this being disk-limited, as it turned out, and then Allan Jude had the winning suggestion:
Try adding '-c aes128-gcm@openssh.com' to your SSH invocation.
See also: <pdf link>
(If you care about 10G+ SSH, you want to read that PDF.)
This made a huge difference, giving
me basically 1G wire speeds for my ZFS transfers. But that difference
made me scratch my head, because why was switching SSH ciphers
making a difference when ssh
wasn't CPU-limited in the first
place? I came up with various theories and guesses, until today I
had a sudden terrible suspicion. The result of testing and confirming
that suspicion was another tweet:
Today I learned or re-learned a valuable lesson: in practice, top output is system dependent, in ways that are not necessarily obvious. For instance, CPU % on multi-CPU systems.
(On some systems, CPU % is the percent of a single CPU; on some it's a % of all CPUs.)
You see, the reason that I had confidently known that SSH wasn't
CPU-limited on sending machine, which was one of our OmniOS
fileservers, is that I had run top
and
seen that the ssh
process was only using 25% of the CPU. Case
closed.
Except that OmniOS top
and Linux's top
report CPU usage percentages
differently. On Linux, CPU percentage is relative to a single CPU,
so 25% is a quarter of one CPU, 100% is all of it, and over 100%
is a multi-threaded program that is using up more than one CPU's
worth of CPU time. On OmniOS, the version of top
we're using comes
from pkgsrc (in what is by now a very
old version), and that version reports CPU percentage relative to
all CPUs in the machine. Our OmniOS fileservers are 4-CPU
machines,
so that '25% CPU' was actually 'all of a single CPU'. In other words,
I was completely wrong about the sending ssh
not being CPU-limited.
Since ssh
was CPU limited after all, it's suddenly no surprise why
switching ciphers sped things up to basically wire speed.
(Years ago I established that the old SunSSH that OmniOS was using
back then was rather slow, but then later we
upgraded to OpenSSH and I sort of thought that
I could not worry about SSH speeds any more. Well, I was wrong. Of
course, nothing can beat not doing SSH at all but instead using, say,
mbuffer. Using mbuffer
also means that you can deliberately limit your transfer bandwidth
to leave some room for things like NFS fileservice.)
PS: There are apparently more versions than you might think. On the FreeBSD
10.4 machine I have access to, top
reports CPU percentage in the
same way Linux does (100% is a single-threaded process using all
of one CPU). Although both the FreeBSD version and our OmniOS version
say they're the William LeFebvre implementation and have similar
version numbers, apparently they diverged significantly at some
point, probably when people had to start figuring out how to make
the original version of top
deal with multi-CPU machines.
|
|