The problem with the CFQ IO scheduler and our iSCSI targets
Current versions of Linux have several different ways to schedule IO activity; one writeup of this is here. The developers of the iSCSI target software that we use have for some time recommended that people switch away from CFQ (the default one) to either the 'deadline' or the 'noop' scheduler to get better performance. Recently I got around to testing this to see if it made a difference in our environment, and it turns out that it does.
(Whether the difference is significant is an open question.)
The specifics of the difference are less interesting than why it happens. According to what I've gathered from the developers, this is what is going on:
CFQ is the 'completely fair queuing' scheduler. Its goal is to fairly schedule IO between multiple contending processes, so that each gets its fair share of disk bandwidth and IO and that one highly active process doesn't lead to long delays for IO from other processes. (This problem has a long history in Unixes, especially once the unified buffer cache appeared.)
The problem between CFQ and our iSCSI target software is the target driver uses multiple threads and randomly assigns incoming IO requests to them. Each thread is seen as a separate context by CFQ, and as part of being fair CFQ won't merge contiguous requests together if they're in different CFQ contexts. So the frontend splits up one big IO operating into multiple iSCSI requests (because of, for example, size limits on a single iSCSI request), these iSCSI requests are dispatched to different threads on the backend, and then they are scheduled and completed separately (and thus slower).
(The other case that I can imagine is several different streams of sequential IO requests, such as readaheads on multiple files at once. The target software may well split the IO requests for a given stream across threads, thereby killing any chance of merging them.)
Because the deadline scheduler doesn't attempt fair scheduling this way, it will merge such requests together, which is what we want. My understanding is that it will also try to make sure that no individual request waits too long, which has various potential benefits in our environment.
(We effectively have three or four different logical disks on each physical disk, so we don't want IO to one logical disk to starve the others. This implies that iSCSI cooperating with CFQ with one CFQ context per logical disk/LUN would probably be ideal, since that would fairly divide the disk bandwidth between the LUNs. The IET developers are talking about doing that at some point in the future.)
Having written all this (partly to get it straight in my head), I now wonder if this issue also affects other sorts of disk and fileservers that use threads. The big one would be NFS servers, since I believe that they have a similar thread pool setup.
(Unfortunately I have no Linux NFS servers to do experiments with.)