Counterintuitive RAID read performance

March 23, 2007

While doing performance tests on an iSCSI RAID controller, we recently turned up some unexpected results: the controller could write significantly faster than it could read, to both RAID 5 and RAID 0 targets. In one case, a six-disk RAID 0 target could do streaming writes 20 megabytes/second faster than it could do streaming reads (75 MB/s write versus 55 MB/s read). This surprised me a lot, because I usually expect reads to run faster than writes. (It's certainly the case on single SATA disks, and this was a SATA-based iSCSI controller.)

(We saw this behavior with both Solaris 10 and Linux, just to rule out one variable.)

Someone I talked with online suggested that what's happening is that writes are being implicitly parallelized across the disks by the writeback caches on the controller and the disks (and the operating system delaying write-out), whereas the reads aren't. It's easy to see how the writes can be parallelized and done in bulk this way, but why aren't reads also being parallelized?

There's two places the whole system can parallelize reads that I can see:

  • if the operating system issues large read requests to the array, the array could immediately issue requests to multiple disks.

    (The operating system can also break the single large read up into multiple SCSI commands and use CTQ to issue several of them at once to the array, which can then distribute them around the disks involved.)

  • if the operating system does aggressive enough readahead we'd get at least two simultaneously active requests, which would hopefully hit at least two different disks.

We want the OS to do large readaheads and issue single IO requests that are several times the stripe size of the target (ideally the stripe size times the number of disks, since that means one request can busy all of the disks). However, many operating systems have relatively low limits on these, and for iSCSI you have to get the RAID controller at the other end to agree on the big numbers too.

I suppose this is why many vendors ship things with small default stripe sizes; it maximizes the chance that streaming IO from even modestly configured systems (or just programs, for local RAID devices) will span multiple drives. And streaming IO performance is something that people can easily measure, whereas the effects of small stripe size on random IO are less obvious.

(iSCSI performance tuning seems to be one of those somewhat underdocumented areas, which is a bit surprising for something with as many knobs and options as iSCSI seems to have. Tuning up the 'maximum burst size' on the iSCSI controller and the Solaris 10 machine got me up to 60 MBytes/sec on streaming bulk reads, but this is still not very impressive, and it may have made writes worse.)


Comments on this page:

By Dan.Astoorian at 2007-03-26 11:18:11:

I don't know what your testing methodology is, but below a certain threshold it seems quite natural for writes to be faster than reads to a hardware RAID controller: the controller can signal successful completion of a write operation as soon as the data has been written to the battery-backed cache (provided the cache has room to hold it); successful completion of a read operation obviously requires real I/O if the data is not already cached.

If your write tests are too brief, the boost you get until the cache fills up may throw off your numbers significantly.

I believe most controllers have a setting to control whether a write operation may return upon write to cache (write-back) or whether it must commit the data to disk before returning (write-through). It would be interesting to know whether your writes outperform your reads by the same order of margin if the controller is set to use a write-through cache.

--Dan

By cks at 2007-03-26 13:50:31:

I did the tests with a 50 gigabyte file (write 50 Gb to it, read the 50 Gb back, and so on), partly to eliminate caching effects, and the data rates reported were consistent over the entire run.

Cranking down the stripe size to 64 Kb (from 512 Kb) resulted in reads getting the same performance as writes. I'm going to try it with an 8 Kb stripe size as soon as I finish some other tests and the array stabilizes.

Written on 23 March 2007.
« An irony of conditional GET for dynamic websites
Randomly engaging NumLock considered irritating »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Fri Mar 23 23:28:46 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.