Wandering Thoughts archives

2014-07-25

An interesting picky difference between Bourne shells

Today we ran into an interesting bug in one of our internal shell scripts. The script had worked for years on our Solaris 10 machines, but on a new OmniOS fileserver it suddenly reported an error:

script[77]: [: 232G: arithmetic syntax error

Cognoscenti of ksh error messages have probably already recognized this one and can tell me the exact problem. To show it to everyone else, here is line 77:

if [ "$qsize" -eq "none" ]; then
   ....

In a strict POSIX shell, this is an error because test's -eq operator is specifically for comparing numbers, not strings. What we wanted is the = operator.

What makes this error more interesting is that the script had been running for some time on the OmniOS fileserver without this error. However, until now the $qsize variable had always had the value 'none'. So why hadn't it failed earlier? After all, 'none' (on either side of the expression) is just as much of not-a-number as '232G' is.

The answer is that this is a picky difference between shells in terms of how they actually behave. Bash, for example, always complains about such misuse of -eq; if either side is not a number you get an error saying 'integer expression expected' (as does Dash, with a slightly different error). But on our OmniOS, /bin/sh is actually ksh93 and ksh93 has a slightly different behavior. Here:

$ [ "none" -eq "none" ] && echo yes
yes
$ [ "bogus" -eq "none" ] && echo yes
yes
$ [ "none" -eq 0 ] && echo yes
yes
$ [ "none" -eq "232G" ] && echo yes
/bin/sh: [: 232G: arithmetic syntax error

The OmniOS version of ksh93 clearly has some sort of heuristic about number conversions such that strings with no numbers are silently interpreted as '0'. Only invalid numbers (as opposed to things that aren't numbers at all) produce the 'arithmetic syntax error' message. Bash and dash are both more straightforward about things (as is the FreeBSD /bin/sh, which is derived from ash).

Update: my description isn't actually what ksh93 is doing here; per opk's comment, it's actually interpreting the none and bogus as variable names and giving them a value of 0 when unset.

Interestingly, the old Solaris 10 /bin/sh seems to basically be calling atoi() on the arguments for -eq; the first three examples work the same, the fourth is silently false, and '[ 232 -eq 232G ]' is true. This matches the 'let's just do it' simple philosophy of the original Bourne shell and test program and may be authentic original V7 behavior.

(Technically this is a difference in test behavior, but test is a builtin in basically all Bourne shells these days. Sometimes the standalone test program in /bin or /usr/bin is actually a shell script to invoke the builtin.)

unix/ShTestDifference written at 23:34:18; Add Comment

The OmniOS version of SSH is kind of slow for bulk transfers

If you look at the manpage and so on, it's sort of obvious that the Illumos and thus OmniOS version of SSH is rather behind the times; Sun branched from OpenSSH years ago to add some features they felt were important and it has not really been resynchronized since then. It (and before it the Solaris version) also has transfer speeds that are kind of slow due to the SSH cipher et al overhead. I tested this years ago (I believe close to the beginning of our ZFS fileservers), but today I wound up retesting it to see if anything had changed from the relatively early days of Solaris 10.

My simple tests today were on essentially identical hardware (our new fileserver hardware) running OmniOS r151010j and CentOS 7. Because I was doing loopback tests with the server itself for simplicity, I had to restrict my OmniOS tests to the ciphers that the OmniOS SSH server is configured to accept by default; at the moment that is aes128-ctr, aes192-ctr, aes256-ctr, arcfour128, arcfour256, and arcfour. Out of this list, the AES ciphers run from 42 MBytes/sec down to 32 MBytes/sec while the arcfour ciphers mostly run around 126 MBytes/sec (with hmac-md5) to 130 Mbytes/sec (with hmac-sha1).

(OmniOS unfortunately doesn't have any of the umac-* MACs that I found to be significantly faster.)

This is actually an important result because aes128-ctr is the default cipher for clients on OmniOS. In other words, the default SSH setup on OmniOS is about a third of the speed that it could be. This could be very important if you're planning to do bulk data transfers over SSH (perhaps to migrate ZFS filesystems from old fileservers to new ones)

The good news is that this is faster than 1G Ethernet; the bad news is that this is not very impressive compared to what Linux can get on the same hardware. We can make two comparisons here to show how slow OmniOS is compared to Linux. First, on Linux the best result on the OmniOS ciphers and MACs is aes128-ctr with hmac-sha1 at 180 Mbytes/sec (aes128-ctr with hmac-md5 is around 175 MBytes/sec), and even the arcfour ciphers run about 5 Mbytes/sec faster than on OmniOS. If we open this up to the more extensive set of Linux ciphers and MACs, the champion is aes128-ctr with umac-64-etm at around 335 MBytes/sec and all of the aes GCM variants come in with impressive performances of 250 Mbytes/sec and up (umac-64-etm improves things a bit here but not as much as it does for aes128-ctr).

(I believe that one reason Linux is much faster on the AES ciphers is that the version of OpenSSH that Linux uses has tuned assembly for AES and possibly uses Intel's AES instructions.)

In summary, through a combination of missing optimizations and missing ciphers and MACs, OmniOS's normal version of OpenSSH is leaving more than half the performance it could be getting on the table.

(The 'good' news for us is that we are doing all transfers from our old fileservers over 1G Ethernet, so OmniOS's ssh speeds are not going to be the limiting factor. The bad news is that our old fileservers have significantly slower CPUs and as a result max out at about 55 Mbytes/sec with arcfour (and interestingly, hmac-md5 is better than hmac-sha1 on them).)

PS: If I thought that network performance was more of a limit than disk performance for our ZFS transfers from old fileservers to the new ones, I would investigate shuffling the data across the network without using SSH. I currently haven't seen any sign that this is the case; our 'zfs send | zfs recv' runs have all been slower than this. Still, it's an option that I may experiment with (and who knows, a slow network transfer may have been having knock-on effects).

solaris/OmniOSSshIsSlow written at 01:36:48; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.