There are two different scenarios for replacing disks in a RAID

September 16, 2015

One possible reply to my annoyance at btrfs being limited to two-way mirrors is to note that btrfs, like many RAID systems, allows you to explicitly replace disks. While btrfs is in the process of doing this, it maintains three-way redundancy (as do some but not all RAID systems); only at the end, with the new disk fully up and running, does it drop the old disk out of the replacement. If something goes wrong with the new disk during this process you are presumably no worse off than before. This is certainly better than the alternative, but it's not great because it misses one usage case. You see, there are two different scenarios for replacing disks here.

In the first scenario you are replacing a dying disk (or at least what you think is one). You don't trust it and the sooner you get data off it the better. As a result, a new unproven disk is strictly better than the old disk because at least the new disk (probably) isn't about to die. Discarding the old disk the moment all the data is fully copied to the new disk is perfectly fine; you trust the new disk at least as much as the old disk.

In the second scenario you are replacing a (currently) good disk with what you think is a better one; it has more space, it is more modern, it is a SSD instead of a HD, whatever. However, this new disk is unproven. It could have infant mortality, bad performance, or just some sort of compatibility problem in your environment for various reasons. You trust it less than the old proven disk (which is known to work fine) and so you really don't want to just discard the old disk once the data is fully copied to the new disk. You want the new disk to prove itself for a while before you fully trust it and you want to preserve your redundancy while that trust is being built (or lost, if there are problems).

It is generally the second disk replacement scenario where people want persistent N-way redundancy. Certainly it's where I want it. N-way redundancy during the data copy from the old drive to the new drive is not good enough, because the new drive doesn't really get proven enough during just that.

Unfortunately the second scenario probably works best with mirroring. It's my view that good RAID-[56] systems should have some way to have a component device that's actually two devices paired together, but they're unlikely to want to have this in routine operation for long.

(A RAID-[56] system that supports a true 'replace' operation needs the ability to run two disks in parallel for a while as data copies over. Ideally it would be doing reads from the new disk as well as from the old disk just in case the new disk writes fine but has problems on reads.)

Written on 16 September 2015.
« A caution about cgo's error returns for errno
We know what you are »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Sep 16 01:28:59 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.