2011-04-30
Our likely iSCSI parameter tuning
Now that I have some idea of what the various iSCSI parameters do and control, I can think about how we might want to change them. The necessary disclaimer is that at the moment, all of this is theoretical, not tested and validated and useful.
My general belief is that our IO load is mostly reads and is somewhere between truly random and short sequential reads (ie, sequential reads of short files). I expect that most of our writes are asynchronous, but some of them are synchronous as ZFS commits transaction groups.
(I have not actually verified this belief, partly because measuring this stuff is hard. Please do not suggest DTrace as the obvious easy answer unless you also have a DTrace script that gives good answers to these questions, at scale.)
Given this, my first overall conclusion is that tuning iSCSI parameters probably isn't that important for us, provided that they are sane to start with. Bandwidth is not really an issue in general and the major latency tuning you can do is for writes, which are not that important for us. Dedicated tuning could modestly lower the protocol overhead for reads and theoretically this slightly reduces the latency, but.
This doesn't mean that there's nothing to tune, though. Here's what I think we'll want:
- InitialR2T set to No, so that the initiator does not have to wait
for the target before sending write data. But then everyone wants
InitialR2T set to no.
(It should really be the iSCSI default and then targets that have weird requirements and limitations would require it be Yes. But this grump really calls for a separate entry.)
- a maximum PDU size that lets the target return all of the data
for a typical small read in a single PDU. The default of 8Kbytes
is probably not enough, but I wouldn't go all that large either
(at least not without a lot of testing).
I don't think that ZFS does any small synchronous writes, so it's not worth worrying about fitting writes into a single PDU.
- a maximum burst size that's at least large enough to allow ZFS to
read its typical large read size in a single request. On many
pools, I believe that this will be the record size, normally 128
Kbytes. A larger maximum burst size is not a problem and may
sometimes be an advantage.
- a first burst length that is large enough to allow ZFS to do uberblock updates without waiting for an R2T; this assumes that InitialR2T is no. I believe that this is 128 Kbytes.
Since we can set InitialR2T to no, I don't think we really care about ImmediateData. If it naturally defaults to 'yes' there's no reason to change it, but if it doesn't then there's not much reason to bother changing it. The exception would be if it turns out to be harmless to set the maximum PDU size very large (because neither side does any fixed buffer allocations based on it), large enough that typical writes easily fit into a single PDU.
2011-04-29
Understanding the iSCSI protocol for performance tuning
We'd like to improve the performance of our iSCSI SAN (who doesn't, really?). iSCSI has a bunch of tuning parameters with names like 'InitialR2T', but in order to sensibly touch those you need at least enough knowledge of the iSCSI protocol to get yourself into trouble. So I have been digging into iSCSI and thus into SCSI, and now I feel like writing down what I've learned and worked out before it all falls out of my head again.
(You would think that somewhere on the Internet there would be a good overview of this stuff. So far I haven't been able to find one.)
To understand iSCSI we need to first understand SCSI, because iSCSI is SCSI transported over the network. The important thing (for performance tuning) is that SCSI is not what I'll call a 'streaming' protocol, like TCP. In particular, when you issue a SCSI WRITE command you do not immediately send the data off to the disk drive with it; instead, you send the command (as a CDB) and then you get to wait for the drive to ask you for pieces of the write data.
(In theory the drive can ask for chunks in non-sequential order, although I don't know if any modern ones actually do.)
Since iSCSI is SCSI over IP, it inherits this behavior. Because everyone involved in creating iSCSI understood that lockstep back and forth protocols don't do so well over TCP, they promptly added workarounds for this. Many of the iSCSI tuning parameters are concerned with enabling (or disabling) and adjusting various bits of the workarounds.
The iSCSI protocol transports Protocol Data Units (PDUs) back and forth between initiator and target over a TCP stream. Each PDU has an iSCSI protocol header and some contents, such as a SCSI CDB or some data that's been read from the disk. The default situation in the iSCSI protocol is that the initiator must send a separate PDU for everything that would be a separate operation or transfer phase in a real SCSI operation, and just as with SCSI disks the initiator can only send write data (in one or more PDUs) when the target tells it to.
iSCSI overhead comes in two forms: things that affect bandwidth and things that affect latency. Raw theoretical bandwidth is reduced by the protocol overhead of both iSCSI and TCP; every PDU of real data costs you the iSCSI protocol header and every TCP packet that the whole iSCSI stream is broken up into costs you the TCP headers. The way to maximize theoretical bandwidth is to reduce both costs by making the protocol unit sizes as large as possible.
So we get to the meaning of the first iSCSI tuning parameter: the size of an iSCSI PDU in any particular connection is the Maximum Receive (or Send) Data Segment Length negotiated between the initiator and the target. Raising these numbers reduces iSCSI protocol overhead but may require your systems to allocate more kernel memory for reserved buffers.
(As with all iSCSI tuning parameters, generally you have to change them on both the initiator and the target. For extra fun, not all software supports changing all parameters, so you may be unable to adjust some of them because one side is stubborn.)
Latency is where we start running into the limits of my iSCSI knowledge. My impression is that an iSCSI target sending data in response to a SCSI read operation is free to throw it at the initiator as fast as possible; it is only initiator write operations that have to stop for the target to say 'go ahead, send me some data now'. If this is correct, you can only really tune iSCSI write latency, and the latency of write operations may not be all that important for various reasons.
If you want to tune this, there are a number of options in the iSCSI protocol to reduce the pauses for the target to say 'go ahead, send me more'. First off, there is two ways to send 'unsolicited data', data that the target did not send you a 'Ready To Transfer' (R2T) notice for:
- if ImmediateData is set to yes, the initiator can attach write data
to the initial PDU that carries the SCSI WRITE command. Because all
of this is in a single PDU, the size of the SCSI WRITE command and
the associated data is limited by the maximum PDU size.
- if InitialR2T is set to no, the initiator can send write data in separate Data-Out PDUs immediately after it sends the SCSI WRITE PDU.
If ImmediateData is yes and InitialR2T is no, the initiator can do both; it can send some write data as part of the SCSI WRITE PDU and then as much more as possible as separate Data-Out PDUs. Regardless of how it is sent, you can never send more total unsolicited data than the First Burst Length setting (which itself must be no larger than the Maximum Burst Length setting, as you might expect).
Once you have exhausted the allowed unsolicited data (if any), write data is transfered in response to R2T PDUs from the target to the initiator. I believe that each R2T can ask for and get up to the Maximum Burst Length of write data. If both the initiator and target opt to relax the normal data ordering requirements (which makes error recovery harder), you can have multiple outstanding R2Ts, up the the Maximum Outstanding R2T setting.
(See section 12.19 of the iSCSI RFC for the gory details of when this isn't possible.)
Now, there's another effect of the Maximum Burst Length: it limits how large a SCSI READ operation can be. The reply to a read has to be transfered in a single sequence of Data-In PDUs, and the data length of this sequence can be no more than the maximum burst length. To read more, the initiator needs to send another SCSI READ command. My feeling is that this not likely to be a real concern under most situations.
2011-04-11
The importance of test suites for standards
A while back I noted that very few of the web's standards have a test suite and that this could be a problem. You might reasonably ask why this matters, especially since so few standards in general have a test suite.
My answer is that having an official test suite for the standard does a lot of things:
- it lets you know if the implementation you've created actually
conforms. Without this you're left with various sorts of ad-hoc
tests that may be hard to set up and run (eg, do you interoperate
with as many of the other implementations as possible in as many
situations as possible).
- it means that everyone has the same idea of conformance and what
the correct behavior is. Ideally this includes odd and unconventional
behaviors, because the people who created the test suite looked for
areas in the standard that could be missed or misunderstood and added
tests to cover them.
- a test suite forces standard writers to be unambiguous about what
the standard says. When people write tests, they also have to come
up with what the results of the tests should be.
- the process of creating a test suite exercises the standard and
thus helps to insure that it doesn't have subtle contradictions
and that it is complete. These issues will also be discovered by
attempting to implement the standard, but the advantage of a test
suite is that it discovers these issues before the standard is
frozen.
- the process of creating the test suite also makes sure that the standard's authors understand at least some of the implications of the standard's requirements before the standard is finalized. This is not as good as requiring an implementation, but it will at least find some of the problems.
All of these are good and praiseworthy things, but there's another way to look at the situation. The reality is that every standard needs a test suite and is going to get at least one. The only question is whether the 'test suite' will be written independently by every sane implementor, using whatever potentially mistaken ideas of the proper standard-compliant behavior that the implementor gathered from reading the standard a few times, or if the test suite will be created by people who know exactly what the standard requires because they wrote it.
(Every sane implementor needs a test suite because they need to test whether their implementation actually works right.)
(Yes, all of this is obvious and well known. I just feel like writing it down, partly to fix it in my mind.)
2011-04-10
The evolution of the git tree format
This is a second-hand war story. Second hand because it's not my war story; I was just lucky enough to be reading the right mailing lists during the early days of Linus Torvalds developing git.
(One of the occasional privileges of hanging around the mailing lists for various open source projects is getting to see people evolve designs on the fly as they learn more about the problem that they're tackling. To be clear, I mean this non-sarcastically; it's not often that you get to see highly skilled developers refine something before your eyes, and the experience is very useful if you pay attention.)
A git commit captures, among other things, the state of the directory tree at that point. Conceptually, the state of the tree is just a list of all of the filenames in the tree with the SHA1 hashes of their current contents (sorted into some canonical order). Git calls this a 'tree'.
In the early versions of git, the internal representation of a tree was literally this simple; it was a file with all of the filenames and their SHA1 hashes (and their permissions and so on). This worked fine for git itself, but when Linus started trying to apply this to larger things (like say the Linux kernel) it rapidly became obvious that there was a drawback to this simple approach. A list of all of the files in the Linux kernel is a pretty big thing, and most commits change only a very few files, so commits were creating big new tree objects where almost all of the contents were the same as the previous version of the tree. This was a very inefficient representation.
(It's not just an issue of disk space. Unnecessarily large tree objects take longer to generate and process and compare and so on; the code spends a lot of time doing pointless work.)
Linus's solution was to make tree objects hierarchical, containing both filenames and pointers to (sub-)tree objects to represent subdirectories. This means that a commit's tree can reuse the sub-tree objects for all of the subdirectories that haven't changed from the last commit (well, from any commit, really); for typical commits, almost all of the tree is the same as the last time around so almost all of the objects are reused as-is. And the bits of the tree representation that do change are relatively small, since individual directories tend to not be very large.
(When a file changes, every tree object between it and the root of the repository also has to change. The file's subdirectory tree object changes because the file's SHA1 has changed, the tree object of the subdirectory's parent has to change because now the subdirectory tree object has a different SHA1 itself, and so on up to the root.)
A disclaimer: this is how I remember things going. I have to admit that I haven't gone back to the appropriate mailing list archives to double-check that my memory is completely correct here.
Sidebar: some numbers on tree sizes
A null-separated list of all of the files in a current Linux kernel tree is over a megabyte; a git tree version of this would be larger, since it also needs to encode SHA1s and file permissions. Even compressed with gzip this file list is over 180 KB.