Wandering Thoughts archives

2012-05-28

How to do a very cautious LVM storage migration

A while back I wrote about how I was tempted by LVM mirroring when I wanted to migrate my LVM setup from a RAID mirror on some old disks to a new RAID mirror on some new(er) disks. Because I am some peculiar combination of cautious and daring, I gave in to this temptation recently. Now that the migration has more or less finished, it's time I reported in how it went and how to do this.

The short summary is using LVM mirroring to migrate my LVM volume group from disk to disk worked without problems but the next time I need to do this I will probably just use pvmove, because establishing the actual mirrors was achingly slow and the whole process was kind of a tedious pain in the rear. I don't know if pvmove would be faster, but I can hope.

(The mirrors seemed to perform decently once they were synchronized. But initial synchronization of about 250 GB of data took literally days and it was not disk speed limited; LVM never drove the disks at full bandwidth or full IOPs/second rates.)

There are two advantages of using LVM mirroring instead of pvmove and I used both of them. First, you can run for a while on both the new storage and the old storage at the same time, to build up confidence in the new storage. Second, you can preserve a complete and usable copy of all of your data on the old storage, a copy that you can inspect, mount, and so on if you wind up having to. With pvmove, your data just moves; you wind up only on the new storage and there's nothing left on the old storage.

I read a number of writeups of how to do LVM mirroring on the web, but I found all of them to be a little bit unclear (partly because the logic of when you specified which disk device wasn't always clear). So here is the annotated steps that I used. First, let's say that the old disk space you're migrating away from is /dev/OLD and the new disk space is /dev/NEW, and you're migrating the LVM volume group vg0 with the single volume vg0/data, mounted on /data. Then:

  1. Initialize /dev/NEW as a LVM physical volume:
    pvcreate /dev/NEW
  2. Add it to the volume group:
    vgextend vg0 /dev/NEW

  3. Mirror each volume/filesystem to the new storage:
    lvconvert -m1 --mirrorlog mirrored --alloc anywhere vg0/data /dev/NEW

    This is the step that takes forever, and you have to repeat it for each filesystem (I did not try to lvconvert multiple volumes at once, I did them one at a time).

    It's possible that you will not need '--alloc anywhere'; leave it out the first time to see (if you do need it, LVM will report that it can't find space to put stuff). The important arguments are -m1, which tells lvconvert to create a mirror (on /dev/NEW, because that's the physical volume we specified) and --mirrorlog mirrored which tells it to create a (mirrored) persistent on-disk log of what bits of the mirror are in sync.

    If I was doing this again I might just use --mirrorlog disk, because as it happens LVM put both of my mirror log mirrors on /dev/NEW for its own inscrutable reasons (it's possible that --alloc anywhere influenced this). I didn't let this worry me because the whole situation was temporary and /dev/NEW was itself a mirrored RAID array, so it was already pretty reliable.

    (It's possible that a non-mirrored mirrorlog would speed things up.)

  4. Verify that everything looks good:
    lvs -a -o+devices

    What this should show is that vg0/data now has four internal subvolumes. The _mimage_N subvolumes are the actual mirrors (the original volume you started with and the mirror on the new storage), one on each of /dev/OLD and /dev/NEW, and you'll also have two additional subvolumes for the mirror log (ideally one on each disk, but see above).

    At this point you can run with full mirroring for as long as you want in order to build up confidence in the new disk(s). Once you're fully happy with them, it's time to complete the migration by splitting off the old disks.

  5. Split apart each volume, leaving the live version on the new disk and creating a new volume that is the data on the old disk. I think that I read that this apparently goes better if the filesystem is unmounted at the time, so that's how I did it:
    umount /data
    lvconvert --splitmirrors 1 -n data-o vg0/data /dev/OLD
    mount /data

    The -n data-o gives the volume name of the 'new' volume (ie, the name you want for the original volume on the original disk). We specify /dev/OLD here to tell lvconvert that it should act on the mirror side that is on /dev/OLD.

    If you run 'lvs -a -o+devices' afterwards, you should see that all of those internal subvolumes have disappeared and you now have two volumes; vg0/data should be entirely on /dev/NEW and vg0/data-o should be entirely on /dev/OLD.

  6. After doing this for each filesystem you have one volume group using both /dev/OLD and /dev/NEW but all of your live volumes are on /dev/NEW; all of the volumes on /dev/OLD are unused. The final step is to split apart the volume group itself into two, the live one on /dev/NEW and a second volume group that is just all of the old volumes on /dev/OLD.

    First, we need to make all of the volumes on /dev/OLD inactive:

    lvchange -an vg0/data-o

    This should complete without complaints because none of these volumes should be in use; they should all be quiescent, unmounted, and so on.

    Then we can split the volume group itself:

    vgsplit vg0 vg0-o /dev/OLD

    Here vg0-o is the name of the 'new' volume group, ie the old copy of the data on the old storage. We specify /dev/OLD to tell vgsplit to act on the volumes (and physical volume and so on) on /dev/OLD.

    Running 'lvs -a -o+devices' should now show two volume groups, with vg0 using only /dev/NEW and vg0-o using only /dev/OLD.

After this is done you can decommission vg0-o at your leisure. I haven't gotten around to doing that since I haven't quite reached the point where I want to physically remove the old disks (I still have my boot partition on them, partly because I need to figure out which physical SATA plug on the motherboard actually is sda, sdb, and so on).

(I don't know if you can just disconnect the disks without doing anything special in LVM. That would be the ideal way to do it since it would preserve vg0-o and its volumes completely intact for any future need, but LVM might get upset when you reboot your machine because a volume group it expects isn't there.)

LVMCautiousMigration written at 00:33:42; Add Comment

2012-05-10

All your servers should have Linux's magic SysRq enabled

This is effectively another lesson learned from our recent building power shutdown. I will put it simply:

All of your servers should have magic SysRq enabled.

There are reasons to not do this on client machines (but not necessarily very good ones), but none on your servers (which certainly should have their hardware and consoles in a secure location).

What magic SysRq is good for on servers (above everything else) is giving you a last ditch chance to shut down or reboot the machine in something approaching an orderly way. I'm not just talking about if the system goes crazy, because it's also quite possible for ordinary system shutdowns to hang, especially if you're shutting down a group of systems that have complex NFS filesystem relationships and something went down out of order. If this happens and you don't have magic SysRq support available, you're plain out of luck; all you can do is pull the power and hope that nothing is going to explode because it hasn't been killed, had its data synced to disk, or whatever.

With magic SysRq you have at least a chance of doing something about this. You can force a kernel level sync, a kernel level unmount of as many filesystems as possible, and even hit processes with signals if you think it's going to do any good. And then you can reboot the machine (and afterwards, possibly pull the power to keep the machine down).

PS: you should explicitly enabled magic SysRq in your standard server install setup, even if your distribution normally defaults to leaving it on; distribution defaults can change over time. Also, note that if you have a serial console you generally need a getty listening on it in order to make magic SysRq work.

(You can check to see if magic SysRq is enabled by looking at the value of /proc/sys/kernel/sysrq; a 1 means that it is, a 0 means that it isn't.)

ServersEnableMagicSysrq written at 16:28:49; Add Comment

2012-05-07

Third party Linux kernel modules should build against non-running kernels

For my sins, I have to deal with third-party kernel modules (some of them open source, some of them not so much). When you deal with third party modules, you get to rebuild them every time you update the kernel. At one level, almost everyone has this down pat these days; you run a command or two and you're done, with everything handled properly.

(I would like to say that this is a very welcome development. I've been around Linux long enough to remember when it was much more work and pain.)

At another level almost everyone gets this wrong, because they all force you to build their modules against the current running kernel and only the running kernel. This probably sounds fine for ordinary users but it makes sysadmins angry because it makes our lives more difficult. There are two problems.

First and obviously, it increases the service downtime. When you force us to build modules against the running kernel, the upgrade process is to install the new kernel, reboot the machine to activate it (which takes down the service, not just because of the reboot but because we're without your module), sit there rebuilding your modules with the service down, and then either activate the service or even reboot the machine again. If we could rebuild against a non-running version, we could install the new kernel, rebuild your modules, and then reboot; the only downtime would be the actual reboot.

Second, it makes kernel upgrades more dangerous. With third party modules there's always the chance that the module will not rebuild against your new kernel. If we can only rebuild against the running kernel, we have to reboot the machine (taking the service down) before we find out whether or not your modules will rebuild happily. If we can build against a non-running version, we can install the new kernel, try to rebuild, and if it fails we know we don't even have to schedule a downtime because there's no point.

(Yes, in theory everyone has test machines. This is still annoying, and sometimes you actually don't.)

I somewhat sympathize with the people building third party kernel modules; I'm sure that building against the running kernel simplifies your life and it's nice to be able to immediately test that your newly compiled module can be loaded. But don't make it mandatory and do provide an 'I know what I'm doing, just compile the thing against <X>' option. Sysadmins will thank you.

As a side note, being able to rebuild kernel modules in advance makes it possible to do opportunistic kernel upgrades. If we have to reboot the machine for some other reason, or even if the machine crashes and reboots on its own, we can switch it into a new kernel in the process. This is especially valuable for unattended reboots at odd hours, where there won't be a sysadmin there to rebuild the modules by hand at the time.

(Another case is situations with mass reboots, where you don't want to be babysitting individual machines as they come up in order to get them back into service.)

BuildAgainstAlternateKernels written at 22:55:30; Add Comment

By day for May 2012: 7 10 28; before May; after May.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.