Switch flow control and buffering: what we think was wrong in our iSCSI network
We have a theory about what was wrong with our problematic iSCSI switch. To set the scene, the problematic switch is a higher-end switch, of the sort that are generally intended as core switches in a network backplane; this is in fact what we mostly use this model of switch for (where we've been quite happy with them). The switch that works well is a lower-end switch from the same company, with all of the basic functionality but less bells and whistles of various sorts. During troubleshooting, we noticed that the problem switch did not have flow control turned on while the good one did; in fact this is the default configuration for each model. Turning on flow control on the problem switch didn't solve the problem, but we've had issues before with flow control on this model of switch.
Now for the theory. Our ZFS fileservers generally issue 128 Kbyte reads; this is the default ZFS blocksize and ZFS always reads whole blocks regardless of how much you asked for. On a gigabit network, 128 Kbytes takes about a millisecond and a bit to transmit (how much more depends on the iSCSI overhead), and it's possible that an iSCSI backend will have several reads worth of data to send to the fileserver at the same time.
Suppose that a fileserver happens to issue 128 Kb iSCSI reads to two backends over the problematic network and the backends get the data from the disks at about the same time and thus both start trying to transmit to the fileserver at the same time. For the duration that both are trying to dump data on the fileserver, they are each transmitting at a gigabit to the switch, for an aggregate burst bandwidth of 2 Gbits; however, the fileserver only has a single 1 Gb link from the switch. For the few milliseconds that both backends want to transmit at once, things simply don't fit and something has to give. The switch can buffer one backend's Ethernet frames, rapidly flow control one backend, or simply drop the frames it can't transmit down the fileserver's link.
At this point I was going to insert our speculation about how lower-end networking gear often has bigger buffers than higher-end gear, but it turns out I don't have to. The company that made both switches has their data sheets online and they cover switch buffer memory, so I can just tell you that the higher-end switch has 1 mega-bit of buffer memory, ie 128 Kbytes, while the lower-end switch has 2 megabytes of it. Given iSCSI, TCP, and Ethernet overheads, the higher-end switch can't even buffer one full iSCSI read reply; the lower-end switch can buffer several.
This explains the symptoms we saw. The problem appeared under load and got worse as the load went higher because the more IO load a fileserver was under (especially random IO from multiple sources), the higher the chance that it would send reads to more than one backend at the same time over the same network path (the fileserver used both network paths to each backend on a round-robin basis). The problem was worse on the mail spool because we put the mail spool in a highly replicated ZFS pool, which raises the chance that more than one backend would be trying to send to the fileserver at once (the disk based pool was a four-way mirror and the SSD pool is a three-way mirror). And the relatively long network stalls were because TCP transmission on the backends was stalling out under conditions of random packet loss, which both shrunk the socket send buffer size and slowed down transmission.
(And now that I've written this, I suspect that we'd have seen significant TCP error counts for things like retransmissions if we'd looked.)
Sidebar: why our problematic iSCSI switch wasn't broken
The short version is that the switch we had iSCSI problems wasn't broken (or wasn't broken much); instead, we were using it wrong. Although we didn't know it, we needed a switch that prioritized buffers over absolute flat out switching speed. My strong impression is that this is exactly backwards from the priorities of higher end core backbone switches. To make a bad analogy, we were asking a Ferrari to haul a big load of groceries.
One thing I take away from this is that switches are not necessarily one size fits all, not in practice. Just because a switch works great in one role doesn't mean that it's going to drop into another one without problems.