Different ways you can initialize a RAID-[567+] array

February 11, 2017

I was installing a machine today where we're using Linux software RAID to build a RAID-6 array of SATA HDs, and naturally one of the parts of the installation is creating and thus initializing the RAID-6 array. This is not something that goes very fast, and when I wandered past the server itself I noticed that the drive activity lights were generally blinking, not on solid. This got me thinking about various different ways that you might initialize a newly created RAID-N array.

It's obvious, but the reason newly created RAID-N arrays need to be initialized is to make the parity blocks consistent with the data blocks. The array generally starts with drives where all the blocks are in some random and unknown state, which means that the parity blocks of a RAID stripe are extremely unlikely to match with the data blocks. Initializing a RAID array fixes this in one way or another, so that you know that any parity mismatches are due to data corruption somewhere.

The straightforward way to initialize a RAID-N array is to read the current state of all of the data blocks for each stripe, compute the parity blocks, and write them out. This approach does minimal write IO, but it has the drawback that it sends an interleaved mixture of read and write IO to all drives, which may slow them down and force seeking. This happens because the parity blocks are normally distributed over all of the drives, rotating from drive to drive with each stripe. This rotation means that every drive will have parity blocks written to it and no drive sees pure sequential read or write IOs. This way minimizes write IO to any particular drive.

A clever way to initialize the array is to create it as a degraded array and then add new disks. If you have an M disk array with N-way parity, create the array with M-N disks active. This has no redundancy and thus no need to resynchronize the redundancy to be correct. Now add N more disks, and let your normal RAID resynchronization code go into effect. You'll read whatever random stuff is on those first M-N disks, assume it's completely correct, reconstruct the 'missing' data and parity from it, and write it to the N disks. The result is random garbage, but so what; it was always going to be random garbage. The advantage here is that you should be sequentially reading from the M-N disks and sequentially writing to the N disks, and disks like simple sequential read and write IO. You do however write over all of the N disks, and you still spend the CPU to do the parity computation for every RAID stripe.

The final way I can think of is to explicitly blank all the drives. You can pre-calculate the two parity blocks for a stripe with all zeros in the data blocks, then build appropriate large write IOs for each drive that interleave zero'd data blocks and the rotating parity blocks, and finally blast these out to all of the drives as fast as each one can write. There's no need to do any per-stripe computation or any read IO. The cost of this is that you overwrite all of every disk in the array.

(If you allow people to do regular IO to a RAID array being initialized, each scheme also needs a way to preempt itself and handle writes to a random place in the array.)

In a world with both HDs and SSDs, I don't think it's possible to say that one approach is right and the other approaches are wrong. On SSDs seeks and reads are cheap, writes are sometimes expensive, and holding total writes down will keep their lifetimes up. On HDs, seeks are expensive, reads are moderately cheap but not free, writes may or may not be expensive (depending in part on how big they are), and we usually assume that we can write as much data to them as we want with no lifetime concerns.

PS: There are probably other clever ways to initialize RAID-N arrays; these are just the three I can think of now.

(I'm deliberately excluding schemes where you don't actually initialize the RAID array but instead keep track of which parts have been written to and so have had their parity updated to be correct. I have various reactions to them that do not fit in the margins of this entry.)

PPS: The Linux software RAID people have a discussion of this issue from 2008. Back then, RAID-5 used the 'create as degraded' trick, but RAID-6 didn't; I'm not sure why. There may be some reason it's not a good idea.

Written on 11 February 2017.
« Python won't (and can't) import native modules from zip archives
I'm too much of a perfectionist about contributing to open source projects »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Feb 11 00:53:39 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.