2008-10-26
Why RAID-1 is the right choice for our new fileservers
Our old SAN was set up in the traditional way: the SAN backend units did RAID-5 interally, and this RAID-5 space was carved up into LUNs and used by the frontend fileservers. For natural reasons all of our fileservers wound up using LUNs from all of our SAN backends. This setup has low overhead, decent resilience, and decent performance. Our new fileservers are set up in an entirely different way, and among other things they use RAID-1 instead of RAID-5. Although we were driven to adopt a RAID-1 approach by other issues, it has turned out to be entirely the right choice (despite the space overhead).
The problem with our old environment was what I will call 'IO contamination'. In practice, any substantial IO to any LUN on any of the RAID-5 arrays touches all of the disks in the array, which means that it contends with and affects any other IO happening to any other LUN on the array. This is especially important because multiple streams of IO are quite likely to force seeks, and the weak point of all current disks is how many separate IO operations a second that they can sustain. Thus, since all of our fileservers used each backend, significant IO load on one LUN on one array could slow down many filesystems on all fileservers.
(The most glaring place that this showed up was parallelizing backups. Attempting to balance the IO load was basically impossible, so we had to just hope for the best by telling Amanda to run a few backups per fileserver.)
Did I mention that the different filesystems were owned and used by all sorts of different research groups and professors?
The great advantage of RAID-1 for us is that it makes IO traffic for different things genuinely separate; with only a few exceptions (such as if we max out a network port's bandwidth), IO to one group's space doesn't affect IO to another group's space. In practice, this independence gives everyone significantly better performance (and it has certainly sped up our backups a lot). And if there ever are performance problems, figuring out the cause is going to be much easier, because it will actually be possible to work backwards from 'hot' disks to find whatever is creating the load.
The drawback of RAID-1 is that it does cost more. Fortunately the cost of disks is dropping all the time, especially if you build your SAN out of commodity hardware.
(Since we are basically doing random IO, RAID-1 also has a straightforward performance advantage; it significantly increases how many spindles we have, and the rule for random IO is that the more spindles you have, the better.)