Understanding the iSCSI protocol for performance tuning
We'd like to improve the performance of our iSCSI SAN (who doesn't, really?). iSCSI has a bunch of tuning parameters with names like 'InitialR2T', but in order to sensibly touch those you need at least enough knowledge of the iSCSI protocol to get yourself into trouble. So I have been digging into iSCSI and thus into SCSI, and now I feel like writing down what I've learned and worked out before it all falls out of my head again.
(You would think that somewhere on the Internet there would be a good overview of this stuff. So far I haven't been able to find one.)
To understand iSCSI we need to first understand SCSI, because iSCSI is SCSI transported over the network. The important thing (for performance tuning) is that SCSI is not what I'll call a 'streaming' protocol, like TCP. In particular, when you issue a SCSI WRITE command you do not immediately send the data off to the disk drive with it; instead, you send the command (as a CDB) and then you get to wait for the drive to ask you for pieces of the write data.
(In theory the drive can ask for chunks in non-sequential order, although I don't know if any modern ones actually do.)
Since iSCSI is SCSI over IP, it inherits this behavior. Because everyone involved in creating iSCSI understood that lockstep back and forth protocols don't do so well over TCP, they promptly added workarounds for this. Many of the iSCSI tuning parameters are concerned with enabling (or disabling) and adjusting various bits of the workarounds.
The iSCSI protocol transports Protocol Data Units (PDUs) back and forth between initiator and target over a TCP stream. Each PDU has an iSCSI protocol header and some contents, such as a SCSI CDB or some data that's been read from the disk. The default situation in the iSCSI protocol is that the initiator must send a separate PDU for everything that would be a separate operation or transfer phase in a real SCSI operation, and just as with SCSI disks the initiator can only send write data (in one or more PDUs) when the target tells it to.
iSCSI overhead comes in two forms: things that affect bandwidth and things that affect latency. Raw theoretical bandwidth is reduced by the protocol overhead of both iSCSI and TCP; every PDU of real data costs you the iSCSI protocol header and every TCP packet that the whole iSCSI stream is broken up into costs you the TCP headers. The way to maximize theoretical bandwidth is to reduce both costs by making the protocol unit sizes as large as possible.
So we get to the meaning of the first iSCSI tuning parameter: the size of an iSCSI PDU in any particular connection is the Maximum Receive (or Send) Data Segment Length negotiated between the initiator and the target. Raising these numbers reduces iSCSI protocol overhead but may require your systems to allocate more kernel memory for reserved buffers.
(As with all iSCSI tuning parameters, generally you have to change them on both the initiator and the target. For extra fun, not all software supports changing all parameters, so you may be unable to adjust some of them because one side is stubborn.)
Latency is where we start running into the limits of my iSCSI knowledge. My impression is that an iSCSI target sending data in response to a SCSI read operation is free to throw it at the initiator as fast as possible; it is only initiator write operations that have to stop for the target to say 'go ahead, send me some data now'. If this is correct, you can only really tune iSCSI write latency, and the latency of write operations may not be all that important for various reasons.
If you want to tune this, there are a number of options in the iSCSI protocol to reduce the pauses for the target to say 'go ahead, send me more'. First off, there is two ways to send 'unsolicited data', data that the target did not send you a 'Ready To Transfer' (R2T) notice for:
- if ImmediateData is set to yes, the initiator can attach write data
to the initial PDU that carries the SCSI WRITE command. Because all
of this is in a single PDU, the size of the SCSI WRITE command and
the associated data is limited by the maximum PDU size.
- if InitialR2T is set to no, the initiator can send write data in separate Data-Out PDUs immediately after it sends the SCSI WRITE PDU.
If ImmediateData is yes and InitialR2T is no, the initiator can do both; it can send some write data as part of the SCSI WRITE PDU and then as much more as possible as separate Data-Out PDUs. Regardless of how it is sent, you can never send more total unsolicited data than the First Burst Length setting (which itself must be no larger than the Maximum Burst Length setting, as you might expect).
Once you have exhausted the allowed unsolicited data (if any), write data is transfered in response to R2T PDUs from the target to the initiator. I believe that each R2T can ask for and get up to the Maximum Burst Length of write data. If both the initiator and target opt to relax the normal data ordering requirements (which makes error recovery harder), you can have multiple outstanding R2Ts, up the the Maximum Outstanding R2T setting.
(See section 12.19 of the iSCSI RFC for the gory details of when this isn't possible.)
Now, there's another effect of the Maximum Burst Length: it limits how large a SCSI READ operation can be. The reply to a read has to be transfered in a single sequence of Data-In PDUs, and the data length of this sequence can be no more than the maximum burst length. To read more, the initiator needs to send another SCSI READ command. My feeling is that this not likely to be a real concern under most situations.