How ZFS helps out with the big RAID-5 problem
It's time for me to say something nice about ZFS for a change, because ZFS can make the big RAID-5 problem significantly less of a problem for many people. ZFS offers two significant advantages:
- because it knows what parts of the array actually contain live data,
it doesn't need to read all of the disks. Less data read means less
chance of an unrecoverable read error.
(How much of an improvement this is depends on how full your pools are; if you routinely run with very full pools, you are reading most of your disks anyways.)
- ZFS has mechanisms for identifying and tracking damaged files, so even if you hit an unrecoverable read error you will not loose the entire pool, just the affected file(s). Since ZFS defaults to making multiple copies of filesystem metadata (even in raidz pools), you may not even lose anything if you are lucky enough to have the UER hit a directory or the like, instead of an actual file.
(One reason that many RAID-5 implementations give up and declare the entire array dead if they hit a UER during array reconstruction is that they have no mechanisms for recording that part of the array is damaged; either they pretend that the array is entirely healthy or they kill it entirely, and they opt for the latter for 'safety'. As the chance for a UER during reconstruction rises, this may change.)
I think that the ZFS people would still strongly suggest that you limit your raidz pool sizes, use raidz2, or both, but at least ZFS gives you better odds if you have to run with raidz instead of raidz2.
(As an aside, it is worth noting that this is one place where RAID-6 is clearly better than RAID-5 plus a hot spare for the same number of disks, as covered in the last entry.)
The problem with big RAID-5 arrays
Let's start by talking about drive failure modes and how they're measured. Hard drives have both MTTF, the rate at which they fail completely, and UER, the rate at which they report an unreadable sector. Drive MTTF is expressed in hours, but drive UER is expressed as a function of how much data is read (technically as 'errors/bits read').
(Typical consumer drive drive UER is apparently 1 per 1014 bits read; 'enterprise' disks improve this to 1 per 1015 bits read. This is of course an average figure, just like the MTTF; you can be lucky, and you can be unlucky.)
The looming problem with big RAID-5 sets is that UER has stayed constant as drive sizes have increased, which means the odds of an unrecoverable read error when you read the entire drive keep rising. Or to put it another way, the odds of an error depend only on how much data you read; the more data you read, the higher the odds.
When this matters is when a drive in your big RAID-5 set fails. Now you need to reconstruct the array onto a spare drive, which means that you must read all of the data on all of the remaining drives. As you have more and more and larger and larger drives, the chance of an unrecoverable read error during reconstruction become significant. If you are lucky, your RAID-5 array will report an unreadable sector or stripe when this happens; if you are unlucky, the software will declare the entire array dead.
(To put some actual numbers on this, a UER of 1e-14 errors/bits read means that you can expect on average one error for every 12 or so terabytes read (assuming that I am doing the math right, and I am rounding up a bit). This is uncomfortably close to the usable size of modern RAID-5 arrays.)
The easiest way to deal with this issue is to go to RAID-6, because RAID-6 can recover from a read failure even after you lose a single disk. To lose data, you would need to either lose two disks and have a read failure during the subsequent reconstruction, or lose one disk and have two read failures in the same stripe, which is pretty unlikely. Otherwise, you need to keep your RAID-5 arrays small enough that the chance of a UER during reconstruction is sufficiently low. Unfortunately, as raw disk sizes grow larger and larger this means using fewer and fewer disks, which raises the RAID overhead.
(Disclaimer: I learned all of this from discussions on the ZFS mailing list and associated readings; I am writing it down here partly to make sure that I have it all straight in my head. See eg here and here for some interesting reading.)