The problem with big RAID-5 arrays
July 13, 2008
Let's start by talking about drive failure modes and how they're measured. Hard drives have both MTTF, the rate at which they fail completely, and UER, the rate at which they report an unreadable sector. Drive MTTF is expressed in hours, but drive UER is expressed as a function of how much data is read (technically as 'errors/bits read').
(Typical consumer drive drive UER is apparently 1 per 1014 bits read; 'enterprise' disks improve this to 1 per 1015 bits read. This is of course an average figure, just like the MTTF; you can be lucky, and you can be unlucky.)
The looming problem with big RAID-5 sets is that UER has stayed constant as drive sizes have increased, which means the odds of an unrecoverable read error when you read the entire drive keep rising. Or to put it another way, the odds of an error depend only on how much data you read; the more data you read, the higher the odds.
When this matters is when a drive in your big RAID-5 set fails. Now you need to reconstruct the array onto a spare drive, which means that you must read all of the data on all of the remaining drives. As you have more and more and larger and larger drives, the chance of an unrecoverable read error during reconstruction become significant. If you are lucky, your RAID-5 array will report an unreadable sector or stripe when this happens; if you are unlucky, the software will declare the entire array dead.
(To put some actual numbers on this, a UER of 1e-14 errors/bits read means that you can expect on average one error for every 12 or so terabytes read (assuming that I am doing the math right, and I am rounding up a bit). This is uncomfortably close to the usable size of modern RAID-5 arrays.)
The easiest way to deal with this issue is to go to RAID-6, because RAID-6 can recover from a read failure even after you lose a single disk. To lose data, you would need to either lose two disks and have a read failure during the subsequent reconstruction, or lose one disk and have two read failures in the same stripe, which is pretty unlikely. Otherwise, you need to keep your RAID-5 arrays small enough that the chance of a UER during reconstruction is sufficiently low. Unfortunately, as raw disk sizes grow larger and larger this means using fewer and fewer disks, which raises the RAID overhead.
(Disclaimer: I learned all of this from discussions on the ZFS mailing list and associated readings; I am writing it down here partly to make sure that I have it all straight in my head. See eg here and here for some interesting reading.)
Written on 13 July 2008.
* * *