Wandering Thoughts archives

2012-02-28

In place migration to Linux software RAID

Suppose that you have an existing system that is operating without mirrored disks and you want to change that; you want to add a second physical disk to the system and then wind up with software RAID mirroring of appropriate filesystems. This generally goes by the name of 'in-place migration'. Knowledge of how to do this used to be more common because back in the old days, distribution installers couldn't set up mirrored system disks during installation; these days installers can and so needing to do this by hand is much rarer.

(In place migration is easy with software RAID implementations that store the metadata 'out of band', outside of the disk space being mirrored. Solaris DiskSuite can more or less trivially do in-place migration to mirrors, with only a minor pause to remount most filesystems. Unfortunately for us, Linux software RAID is not such a thing; it stores its metadata 'in-band', at either the start or the end of the partition being mirrored.)

When this question came up here recently, I said that there are two ways to do in place migration: the traditional, well tested approach that everyone used to use and a theoretically possible approach that I at least have never tested. The well tested approach is not quite literally in-place; the theoretical one is, but is trickier and untested.

The traditional approach goes like this:

  1. arrange to have an identical partition on your second disk. The traditional way to do this is to use identical disks and copy the partition table from the first disk to the second with sfdisk.
  2. create a mirror using only the second disks's partition and a missing (aka failed) device, for example:
    mdadm -C /dev/md0 -l raid1 -n 2 -x 1 -a missing /dev/sdb3 missing

    Make very, very sure that you are using the second disk's partition for this, not the first disk. The first disk should not be mentioned anywhere in the command line.

  3. mkfs the new mirror and mount it somewhere; we usually used /mnt.
  4. copy the existing filesystem to the mirror using the tool of your choice. I prefer to use dump and restore, but tastes differ.
  5. edit /etc/fstab to mount the filesystem from the mirror.
  6. unmount the current filesystem and immediately remount the mirror in its place. (Doing this for the root filesystem requires a reboot, among other things, and is outside the scope of this entry.)
  7. hot-add the old filesystem's partition on the first disk to the mirror, for example:
    mdadm -a /dev/md0 /dev/sda3

    Since you're adding a new device to a mirror, the mirror resyncs onto the new device. You can watch the progress of the resync in /proc/mdstat, and on modern systems you may get email from mdadm when it finishes.

Back in the days of Ubuntu 6.06 and similar systems we did this a lot and we never had problems (at least if we weren't shuffling filesystems around at the same time). This is not quite in-place because it involves copying the filesystem, and on a sufficiently busy filesystem it may be troublesome to get a complete and accurate copy (eg, you may need an extended downtime to halt all other activity on the filesystem).

The theoretical way that is fully in-place is to set up your new partition on your new disk and then do something like this:

  • shrink the filesystem so that there is enough space for software RAID metadata at the end of the partition. You will need to experiment to find out exactly how much space the metadata needs.
  • unmount the filesystem.
  • create a software RAID mirror with a single mirror, using a format with metadata at the end of the partition.

    In theory this doesn't write to anything except the metadata area. Note that I have neither tested nor verified that this is true in practice; that's why this is a theoretical way. You will want to test the heck out of this (probably in a virtual machine).

  • change /etc/fstab to mount the filesystem from the mirror and remount it.
  • hot-add the partition on the second disk and let the mirror resync on to it.

Depending on how long it takes to shrink the filesystem and whether or not it can be done live, this may require less downtime than the other approach.

I doubt I'll ever use the theoretical way. While it's vaguely neat, it'll clearly take a bunch of work and testing to develop into something that can be used for real and it has only marginal advantages over the tried and true way (especially on extN filesystems, where resize2fs can only shrink unmounted filesystems).

InplaceSoftwareRaid written at 00:11:50; Add Comment

2012-02-15

The temptation of LVM mirroring

One of the somewhat obscure things that LVM can do is mirroring. If you mention this, most people will probably ask why on earth you'd want to use it; mirroring is what software RAID is for, and then you can stack LVM on top if you want to. Well, yes (and I agree with them in general). But I have an unusual situation that makes LVM mirroring very tempting right now.

The background is that I'm in the process of migrating my office workstation from a pair of old 320 GB drives to a pair of somewhat newer 750 GB drives, and it's reached the time to move my LVM setup to the 750s (it's currently on a RAID-1 array on the two 320s). There are at least three convenient ways of doing this:

  1. add the appropriate partitions from the 750s as two more mirrors to the existing RAID-1 array. There are at least two drawbacks to this; I can't change the raid superblock format, and growing the LVM volume afterwards so that I can actually use the new space will be somewhat of a pain.

    (I suppose that a single pvresize is not that much of a pain, provided that it works as advertised.)

  2. create a new RAID-1 on the 750s, add it as a new physical volume, and pvmove from the old RAID-1 physical volume to the new RAID-1 PV.

    (I did pilot trials of pvmove in a virtual machine and it worked fine even with a significant IO load on the LVM group being moved, which gives me the confidence to think about this even after my bad experience many years ago.)

  3. as above, but set up LVM mirroring between the old and the new disks instead of immediately pvmove'ing to the new disks and using them alone.

    (Done with the right magic this might leave an intact, usable copy of all of the logical volumes behind on the 320 GB drives when I finally break the mirror.)

The drawback of the second approach is that if the 750 GB drives turn out to be flaky or have errors, I don't have a quick and easy way to go back to the current situation; I would have to re-pvmove back in the face of disk problems. And, to make me nervous, I already had one 750 become flaky after it was just sitting in my machine for a bit of time.

(I've already changed to having the root filesystem on the new drives, but I have an acceptable fallback for that and anyways it's less important than my actual data.)

The drawback of the third approach is that I would have to trust LVM mirroring, which is undoubtedly far less widely used than software RAID-1. But it's temptingly easier (and better) than just adding two more mirrors to the current RAID-1 array. If it worked without problems, it would clearly be the best answer; it has the best virtues of both of the other two solutions.

(This would be a terrible choice for a production server unless we really needed to change the RAID superblock format and couldn't afford any downtime. But this is my office workstation, so the stakes are lower.)

I suppose the right answer is to do a trial run of LVM mirroring in a virtual machine, just as I did a pilot run of pvmove. The drawback of that is having to wait longer to migrate to the 750s and ironically a significant reason for the migration is so that I can have more space for virtual machine images.

LVMMirroringTemptation written at 18:02:58; Add Comment

2012-02-08

Choosing the superblock format for Linux's software RAID

Linux's software RAID implementation stores metadata about the RAID device in each physical device involved in the RAID, in what mdadm calls 'RAID superblocks' by analogy to the filesystem superblocks that describe filesystems. In modern versions of software RAID there are a number of different formats for these RAID superblocks with different tradeoffs involved in each one, and one of the decisions you need to make when you create a software RAID array is what format you want to use.

(Even if you don't actively make a decision, mdadm will pick a format for you. Sometimes it will whine irritatingly at you about the situation, which is how I discovered the whole issue.)

In my opinion, at the moment there are three sensible options to choose from: the 0.90 format and then two variants of the 'version-1' metadata format.

  • 0.90 is the original metadata format, which is widely understood and used. For most people, the most potentially important limitation of 0.90 metadata is that component devices can't be larger than 2 TB.

    The 0.90 superblock goes at the end of the underlying partition.

  • 1.0 puts the superblock at the end of the underlying partition.
  • 1.2 puts the superblock 4 Kb from the start of the underlying partition It's the sort of default for modern versions of mdadm.

(You can see what format your current RAID arrays are using by looking at /proc/mdstat. If an array doesn't say 'super <something>' it's using 0.90 format metadata; otherwise, it's using whatever version it says it is. Many relatively modern systems, such as Ubuntu 10.04, either don't support anything past 0.90 or default to 0.90 in system setup.)

Where the superblock goes is potentially important for RAID-1 arrays. A RAID-1 array with the superblock at the end can relatively easily have whatever filesystem it contains mounted read-only without the RAID running, because the filesystem will start at the start of the underlying raw partitions; this can be important sometimes. A RAID-1 array with the superblock at or near the start of the underlying partitions can't have the raw partitions used this way, because you have to look somewhat beyond the start of the raw partition to see the filesystem.

(Some versions of mdadm will explicitly warn you about this or even quiz you about it if you don't specify a format explicitly.)

If you want to use a modern format and are going to directly use the RAID-1 array for a filesystem, I would use 1.0 format (this is what I've done for my new / and /boot). For swap areas you might as well use 1.2 format; if you ever need to use swap without software RAID, you can just destroy the 1.2 superblocks with mkswap. For LVM physical volumes you can argue back and forth either way; right now I've chosen 1.2 format because I really don't want to think about what it would take to safely bring up an LVM physical volume without software RAID running.

(LVM physical volumes have their own metadata, which normally goes at the start of the 'raw' partition that LVM is using but which can be replicated to the end as well. See pvcreate's manpage.)

As far as I know you can't change the superblock format of an array after it has been created, at least not without destroying it and recreating it. You can sort of do this without an extra disk with sufficient work, but really you want to get it right at creation time.

PS: note that in theory you can use dmsetup to gain access to filesystems or other sorts of data that doesn't begin at the start of a raw partition, so you can get at a filesystem embedded inside the raw partition of a RAID-1 array with 1.2 format metadata. However this requires user level intervention, which means that you're going to need a rescue environment or rescue disk of some sort.

SoftwareRaidSuperblockFormats written at 01:20:38; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.