Our problem with iSCSI connections at boot on OmniOS

February 26, 2016

You might perhaps wonder why I recently needed to run a script when our OmniOS machines booted. As it happens, we sometimes have a little problem with our iSCSI networking when we reboot a system, and we would like to know about it right away. First, the high speed summary of iSCSI on our ZFS fileservers is that fileservers connect to their iSCSI backends over two separate and thus redundant networks. At a mechanical level this is done by statically configuring each iSCSI target disk twice, one over each network, joining them together with standard OmniOS multipathing (set to round-robin), and then telling the OmniOS iSCSI initiator that it should make two connections to each target with 'iscsiadm modify initiator-node -c 2' (here's a longer writeup).

What we want and expect is that those two connections to each target should be made over different networks. And most of the time this works. However, some of the time a system will boot up with all of its connections to some or even all of the targets going over only a single network. Usually there will still be two connections but both will be over the same network, which costs us both redundancy and bandwidth.

(It's possible that OmniOS would make a new connection over the other network if the first one died, but this isn't something we exactly want to bet on.)

Because nothing actually breaks when the system is like this (at least when both iSCSI networks are working), it's possible for fileservers to quietly stay in this state for some time. Once we got disturbed enough by this fact, we wrote a script on the backends that checks for this, but only once a day. We decided that we'd like to know faster than that for the most common case, where this unbalanced iSCSI usage happens at boot time and can be detected right after boot. That led to needing a boot time service to run the script and wound up with me deep in SMF for the first and hopefully last time.

By the way, this is not directly OmniOS's fault; it's something that's been happening in Solaris for some time. My assumption is that this problem has at least something to do with the tangled way that Solaris has always brought up iSCSI disks at boot time, such that the OmniOS iSCSI initiator is attempting to bring up the two connections we told it to make at a time when only one network is available.

(Perhaps I should file this as an OmniOS and/or Illumos bug, but somehow I doubt it would get much attention.)

Sidebar: How we fix this

In an ideal world, you could fix this simply by telling OmniOS to switch to having only one connection per target, then go back to two connections per target; OmniOS would notice that it had two networks available and that it would be smart to make that second connection over the other network. Sometimes this even works. Often it doesn't, though.

When it fails to work, what has worked for us is to remove entirely the static target configuration for the network that is not being used, drop to one connection per target, re-add all of those removed static target configurations, and go back to two connections per target. Fortunately we have scripts that generate most of the necessary commands.

Written on 26 February 2016.
« Mozilla, Symantec, SHA-1 certificates, and the balance of power
Sometimes brute force is the answer, Samba edition »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Feb 26 01:39:38 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.