2012-02-28
In place migration to Linux software RAID
Suppose that you have an existing system that is operating without mirrored disks and you want to change that; you want to add a second physical disk to the system and then wind up with software RAID mirroring of appropriate filesystems. This generally goes by the name of 'in-place migration'. Knowledge of how to do this used to be more common because back in the old days, distribution installers couldn't set up mirrored system disks during installation; these days installers can and so needing to do this by hand is much rarer.
(In place migration is easy with software RAID implementations that store the metadata 'out of band', outside of the disk space being mirrored. Solaris DiskSuite can more or less trivially do in-place migration to mirrors, with only a minor pause to remount most filesystems. Unfortunately for us, Linux software RAID is not such a thing; it stores its metadata 'in-band', at either the start or the end of the partition being mirrored.)
When this question came up here recently, I said that there are two ways to do in place migration: the traditional, well tested approach that everyone used to use and a theoretically possible approach that I at least have never tested. The well tested approach is not quite literally in-place; the theoretical one is, but is trickier and untested.
The traditional approach goes like this:
- arrange to have an identical partition on your second disk. The
traditional way to do this is to use identical disks and copy the
partition table from the first disk to the second with
sfdisk. - create a mirror using only the second disks's partition and
a missing (aka failed) device, for example:
mdadm -C /dev/md0 -l raid1 -n 2 -x 1 -a missing /dev/sdb3 missingMake very, very sure that you are using the second disk's partition for this, not the first disk. The first disk should not be mentioned anywhere in the command line.
- mkfs the new mirror and mount it somewhere; we usually used
/mnt. - copy the existing filesystem to the mirror using the tool of your
choice. I prefer to use
dumpandrestore, but tastes differ. - edit
/etc/fstabto mount the filesystem from the mirror. - unmount the current filesystem and immediately remount the mirror in its place. (Doing this for the root filesystem requires a reboot, among other things, and is outside the scope of this entry.)
- hot-add the old filesystem's partition on the first disk to the
mirror, for example:
mdadm -a /dev/md0 /dev/sda3Since you're adding a new device to a mirror, the mirror resyncs onto the new device. You can watch the progress of the resync in
/proc/mdstat, and on modern systems you may get email frommdadmwhen it finishes.
Back in the days of Ubuntu 6.06 and similar systems we did this a lot and we never had problems (at least if we weren't shuffling filesystems around at the same time). This is not quite in-place because it involves copying the filesystem, and on a sufficiently busy filesystem it may be troublesome to get a complete and accurate copy (eg, you may need an extended downtime to halt all other activity on the filesystem).
The theoretical way that is fully in-place is to set up your new partition on your new disk and then do something like this:
- shrink the filesystem so that there is enough space for software RAID metadata at the end of the partition. You will need to experiment to find out exactly how much space the metadata needs.
- unmount the filesystem.
- create a software RAID mirror with a single mirror, using a format
with metadata at the end of the partition.
In theory this doesn't write to anything except the metadata area. Note that I have neither tested nor verified that this is true in practice; that's why this is a theoretical way. You will want to test the heck out of this (probably in a virtual machine).
- change
/etc/fstabto mount the filesystem from the mirror and remount it. - hot-add the partition on the second disk and let the mirror resync on to it.
Depending on how long it takes to shrink the filesystem and whether or not it can be done live, this may require less downtime than the other approach.
I doubt I'll ever use the theoretical way. While it's vaguely neat,
it'll clearly take a bunch of work and testing to develop into something
that can be used for real and it has only marginal advantages over the
tried and true way (especially on extN filesystems, where resize2fs
can only shrink unmounted filesystems).
2012-02-15
The temptation of LVM mirroring
One of the somewhat obscure things that LVM can do is mirroring. If you mention this, most people will probably ask why on earth you'd want to use it; mirroring is what software RAID is for, and then you can stack LVM on top if you want to. Well, yes (and I agree with them in general). But I have an unusual situation that makes LVM mirroring very tempting right now.
The background is that I'm in the process of migrating my office workstation from a pair of old 320 GB drives to a pair of somewhat newer 750 GB drives, and it's reached the time to move my LVM setup to the 750s (it's currently on a RAID-1 array on the two 320s). There are at least three convenient ways of doing this:
- add the appropriate partitions from the 750s as two more mirrors to
the existing RAID-1 array. There are at least two drawbacks to
this; I can't change the raid superblock format, and growing the LVM volume
afterwards so that I can actually use the new space will be
somewhat of a pain.
(I suppose that a single
pvresizeis not that much of a pain, provided that it works as advertised.) - create a new RAID-1 on the 750s, add it as a new physical volume, and
pvmovefrom the old RAID-1 physical volume to the new RAID-1 PV.(I did pilot trials of
pvmovein a virtual machine and it worked fine even with a significant IO load on the LVM group being moved, which gives me the confidence to think about this even after my bad experience many years ago.) - as above, but set up LVM mirroring between the old and the new disks
instead of immediately
pvmove'ing to the new disks and using them alone.(Done with the right magic this might leave an intact, usable copy of all of the logical volumes behind on the 320 GB drives when I finally break the mirror.)
The drawback of the second approach is that if the 750 GB drives turn
out to be flaky or have errors, I don't have a quick and easy way to go
back to the current situation; I would have to re-pvmove back in the
face of disk problems. And, to make me nervous, I already had one 750
become flaky after it was just sitting in my machine for a bit of time.
(I've already changed to having the root filesystem on the new drives, but I have an acceptable fallback for that and anyways it's less important than my actual data.)
The drawback of the third approach is that I would have to trust LVM mirroring, which is undoubtedly far less widely used than software RAID-1. But it's temptingly easier (and better) than just adding two more mirrors to the current RAID-1 array. If it worked without problems, it would clearly be the best answer; it has the best virtues of both of the other two solutions.
(This would be a terrible choice for a production server unless we really needed to change the RAID superblock format and couldn't afford any downtime. But this is my office workstation, so the stakes are lower.)
I suppose the right answer is to do a trial run of LVM mirroring in a
virtual machine, just as I did a pilot run of pvmove. The drawback of
that is having to wait longer to migrate to the 750s and ironically a
significant reason for the migration is so that I can have more space
for virtual machine images.
2012-02-08
Choosing the superblock format for Linux's software RAID
Linux's software RAID implementation stores metadata about the RAID
device in each physical device involved in the RAID, in what mdadm
calls 'RAID superblocks' by analogy to the filesystem superblocks that
describe filesystems. In modern versions of software RAID there are a
number of different formats for these RAID superblocks with different
tradeoffs involved in each one, and one of the decisions you need to
make when you create a software RAID array is what format you want to
use.
(Even if you don't actively make a decision, mdadm will pick a format
for you. Sometimes it will whine irritatingly at you about the situation,
which is how I discovered the whole issue.)
In my opinion, at the moment there are three sensible options to choose from: the 0.90 format and then two variants of the 'version-1' metadata format.
- 0.90 is the original metadata format, which is widely understood
and used. For most people, the most potentially important
limitation of 0.90 metadata is that component devices can't be
larger than 2 TB.
The 0.90 superblock goes at the end of the underlying partition.
- 1.0 puts the superblock at the end of the underlying partition.
- 1.2 puts the superblock 4 Kb from the start of the underlying partition
It's the sort of default for modern versions of
mdadm.
(You can see what format your current RAID arrays are using by looking
at /proc/mdstat. If an array doesn't say 'super <something>' it's
using 0.90 format metadata; otherwise, it's using whatever version it
says it is. Many relatively modern systems, such as Ubuntu 10.04, either
don't support anything past 0.90 or default to 0.90 in system setup.)
Where the superblock goes is potentially important for RAID-1 arrays. A RAID-1 array with the superblock at the end can relatively easily have whatever filesystem it contains mounted read-only without the RAID running, because the filesystem will start at the start of the underlying raw partitions; this can be important sometimes. A RAID-1 array with the superblock at or near the start of the underlying partitions can't have the raw partitions used this way, because you have to look somewhat beyond the start of the raw partition to see the filesystem.
(Some versions of mdadm will explicitly warn you about this or even
quiz you about it if you don't specify a format explicitly.)
If you want to use a modern format and are going to directly use the
RAID-1 array for a filesystem, I would use 1.0 format (this is what
I've done for my new / and /boot). For swap areas you might as well
use 1.2 format; if you ever need to use swap without software RAID, you
can just destroy the 1.2 superblocks with mkswap. For LVM physical
volumes you can argue back and forth either way; right now I've chosen
1.2 format because I really don't want to think about what it would take
to safely bring up an LVM physical volume without software RAID running.
(LVM physical volumes have their own metadata, which normally goes at
the start of the 'raw' partition that LVM is using but which can be
replicated to the end as well. See pvcreate's manpage.)
As far as I know you can't change the superblock format of an array after it has been created, at least not without destroying it and recreating it. You can sort of do this without an extra disk with sufficient work, but really you want to get it right at creation time.
PS: note that in theory you can use dmsetup to gain access to
filesystems or other sorts of data that doesn't begin at the start of
a raw partition, so you can get at a filesystem embedded inside the
raw partition of a RAID-1 array with 1.2 format metadata. However this
requires user level intervention, which means that you're going to need
a rescue environment or rescue disk of some sort.