Counterintuitive RAID read performance
While doing performance tests on an iSCSI RAID controller, we recently turned up some unexpected results: the controller could write significantly faster than it could read, to both RAID 5 and RAID 0 targets. In one case, a six-disk RAID 0 target could do streaming writes 20 megabytes/second faster than it could do streaming reads (75 MB/s write versus 55 MB/s read). This surprised me a lot, because I usually expect reads to run faster than writes. (It's certainly the case on single SATA disks, and this was a SATA-based iSCSI controller.)
(We saw this behavior with both Solaris 10 and Linux, just to rule out one variable.)
Someone I talked with online suggested that what's happening is that writes are being implicitly parallelized across the disks by the writeback caches on the controller and the disks (and the operating system delaying write-out), whereas the reads aren't. It's easy to see how the writes can be parallelized and done in bulk this way, but why aren't reads also being parallelized?
There's two places the whole system can parallelize reads that I can see:
- if the operating system issues large read requests to the array,
the array could immediately issue requests to multiple disks.
(The operating system can also break the single large read up into multiple SCSI commands and use CTQ to issue several of them at once to the array, which can then distribute them around the disks involved.)
- if the operating system does aggressive enough readahead we'd get at least two simultaneously active requests, which would hopefully hit at least two different disks.
We want the OS to do large readaheads and issue single IO requests that are several times the stripe size of the target (ideally the stripe size times the number of disks, since that means one request can busy all of the disks). However, many operating systems have relatively low limits on these, and for iSCSI you have to get the RAID controller at the other end to agree on the big numbers too.
I suppose this is why many vendors ship things with small default stripe sizes; it maximizes the chance that streaming IO from even modestly configured systems (or just programs, for local RAID devices) will span multiple drives. And streaming IO performance is something that people can easily measure, whereas the effects of small stripe size on random IO are less obvious.
(iSCSI performance tuning seems to be one of those somewhat underdocumented areas, which is a bit surprising for something with as many knobs and options as iSCSI seems to have. Tuning up the 'maximum burst size' on the iSCSI controller and the Solaris 10 machine got me up to 60 MBytes/sec on streaming bulk reads, but this is still not very impressive, and it may have made writes worse.)