2014-07-25
An interesting picky difference between Bourne shells
Today we ran into an interesting bug in one of our internal shell scripts. The script had worked for years on our Solaris 10 machines, but on a new OmniOS fileserver it suddenly reported an error:
script[77]: [: 232G: arithmetic syntax error
Cognoscenti of ksh error messages have probably already recognized this one and can tell me the exact problem. To show it to everyone else, here is line 77:
if [ "$qsize" -eq "none" ]; then ....
In a strict POSIX shell, this is an error because test
's -eq
operator is specifically for comparing numbers, not strings. What
we wanted is the =
operator.
What makes this error more interesting is that the script had been
running for some time on the OmniOS fileserver without this error.
However, until now the $qsize
variable had always had the value
'none
'. So why hadn't it failed earlier? After all, 'none
' (on
either side of the expression) is just as much of not-a-number as
'232G
' is.
The answer is that this is a picky difference between shells in
terms of how they actually behave. Bash, for example, always complains
about such misuse of -eq
; if either side is not a number you get an
error saying 'integer expression expected
' (as does Dash, with a
slightly different error). But on our OmniOS, /bin/sh
is actually
ksh93 and ksh93 has a slightly different behavior. Here:
$ [ "none" -eq "none" ] && echo yes yes $ [ "bogus" -eq "none" ] && echo yes yes $ [ "none" -eq 0 ] && echo yes yes $ [ "none" -eq "232G" ] && echo yes /bin/sh: [: 232G: arithmetic syntax error
The OmniOS version of ksh93 clearly has some sort of heuristic about
number conversions such that strings with no numbers are silently
interpreted as '0'. Only invalid numbers (as opposed to things that
aren't numbers at all) produce the 'arithmetic syntax error' message.
Bash and dash are both more straightforward about things (as is the
FreeBSD /bin/sh
, which is derived from ash).
Update: my description isn't actually what ksh93 is doing here; per
opk's comment, it's actually interpreting the none
and bogus
as variable names and giving them a value of 0 when unset.
Interestingly, the old Solaris 10 /bin/sh
seems to basically be
calling atoi()
on the arguments for -eq
; the first three examples
work the same, the fourth is silently false, and '[ 232 -eq 232G
]
' is true. This matches the 'let's just do it' simple philosophy
of the original Bourne shell and test
program and may be authentic
original V7 behavior.
(Technically this is a difference in test
behavior, but test
is a builtin in basically all Bourne shells these days. Sometimes
the standalone test
program in /bin
or /usr/bin
is actually
a shell script to invoke the builtin.)
The OmniOS version of SSH is kind of slow for bulk transfers
If you look at the manpage and so on, it's sort of obvious that the Illumos and thus OmniOS version of SSH is rather behind the times; Sun branched from OpenSSH years ago to add some features they felt were important and it has not really been resynchronized since then. It (and before it the Solaris version) also has transfer speeds that are kind of slow due to the SSH cipher et al overhead. I tested this years ago (I believe close to the beginning of our ZFS fileservers), but today I wound up retesting it to see if anything had changed from the relatively early days of Solaris 10.
My simple tests today were on essentially identical hardware (our new fileserver hardware) running OmniOS r151010j and CentOS 7. Because I was doing loopback tests with the server itself for simplicity, I had to restrict my OmniOS tests to the ciphers that the OmniOS SSH server is configured to accept by default; at the moment that is aes128-ctr, aes192-ctr, aes256-ctr, arcfour128, arcfour256, and arcfour. Out of this list, the AES ciphers run from 42 MBytes/sec down to 32 MBytes/sec while the arcfour ciphers mostly run around 126 MBytes/sec (with hmac-md5) to 130 Mbytes/sec (with hmac-sha1).
(OmniOS unfortunately doesn't have any of the umac-* MACs that I found to be significantly faster.)
This is actually an important result because aes128-ctr is the default cipher for clients on OmniOS. In other words, the default SSH setup on OmniOS is about a third of the speed that it could be. This could be very important if you're planning to do bulk data transfers over SSH (perhaps to migrate ZFS filesystems from old fileservers to new ones)
The good news is that this is faster than 1G Ethernet; the bad news is that this is not very impressive compared to what Linux can get on the same hardware. We can make two comparisons here to show how slow OmniOS is compared to Linux. First, on Linux the best result on the OmniOS ciphers and MACs is aes128-ctr with hmac-sha1 at 180 Mbytes/sec (aes128-ctr with hmac-md5 is around 175 MBytes/sec), and even the arcfour ciphers run about 5 Mbytes/sec faster than on OmniOS. If we open this up to the more extensive set of Linux ciphers and MACs, the champion is aes128-ctr with umac-64-etm at around 335 MBytes/sec and all of the aes GCM variants come in with impressive performances of 250 Mbytes/sec and up (umac-64-etm improves things a bit here but not as much as it does for aes128-ctr).
(I believe that one reason Linux is much faster on the AES ciphers is that the version of OpenSSH that Linux uses has tuned assembly for AES and possibly uses Intel's AES instructions.)
In summary, through a combination of missing optimizations and missing ciphers and MACs, OmniOS's normal version of OpenSSH is leaving more than half the performance it could be getting on the table.
(The 'good' news for us is that we are doing all transfers from our old fileservers over 1G Ethernet, so OmniOS's ssh speeds are not going to be the limiting factor. The bad news is that our old fileservers have significantly slower CPUs and as a result max out at about 55 Mbytes/sec with arcfour (and interestingly, hmac-md5 is better than hmac-sha1 on them).)
PS: If I thought that network performance was more of a limit than
disk performance for our ZFS transfers from old fileservers to the
new ones, I would investigate shuffling the data across the network
without using SSH. I currently haven't seen any sign that this is
the case; our 'zfs send | zfs recv
' runs have all been slower
than this. Still, it's an option that I may experiment with (and
who knows, a slow network transfer may have been having knock-on
effects).