Wandering Thoughts archives

2014-04-03

Shifting a software RAID mirror from disk to disk in modern Linux

Suppose that you have a software RAID mirror and you want to migrate one side of the mirror from one disk to another to replace the old disk. The straightforward way is to remove the old disk, put in the new disk, and resync the mirror. However this leaves you without a mirror at all for the duration of the resync so if you can get all three disks online at once what you'd like to do is add the new disk as a third mirror and then remove the old disk later. Modern Linux makes this a little bit complicated.

The core complication is that your software RAID devices know how many active mirrors they are supposed to have. If you add a device beyond that, it becomes a hot spare instead of being an active mirror. To activate it as a mirror you must add it then grow the number of active devices in the mirror. Then to properly deactivate the old disk you need to do the reverse.

Here are the actual commands (for my future use if nothing else):

  1. Hot-add the new device:
    mdadm -a /dev/md17 /dev/sdd7

    If you look at /proc/mdstat afterwards you'll see it marked as a spare.

  2. 'Grow' the number of active devices in the mirror:
    mdadm -G -n 3 /dev/md17

  3. Wait for the mirror to resync. You may want to run the new disk in parallel with the old disk for a few days to make sure that all is well with it; this is fine. You may want to be wary about reboots during this time.

  4. Take the old disk out by first manually failing it and then actually removing it:
    mdadm --fail /dev/md17 /dev/sdb7
    mdadm -r /dev/md17 /dev/sdb7

  5. Finally, shrink the number of active devices in the mirror down to two again:
    mdadm -G -n 2 /dev/md17

You really do want to explicitly shrink the number of active devices in the mirror. A mismatch between the number of actual devices and the number of expected devices can have various undesirable consequences. If a significant amount of time happened between step three and four, make sure that your mdadm.conf still has the correct number of devices configured in it for all of the arrays (ie, two).

Unfortunately marking the old disk as failed will likely get you warning email from mdadm's status monitoring about a failed device. This is the drawback of mdadm not having a way to directly do 'remove an active device' as a single action. I can understand why mdadm doesn't have an operation for this, but it's still a bit annoying.

(Looking at this old entry makes it clear that I've run into the need to grow and shrink the number of active mirror devices before, but apparently I didn't consider it noteworthy at that point.)

linux/SoftwareRaidShiftingMirror written at 19:51:05; Add Comment

The scariness of uncertainty

One of the issues that I'm facing right now (and have been for a while) is that being uncertain can be a daunting thing. As sysadmins we deal with uncertainty all of the time, of course, and if we were paralyzed by it in general we'd never get anywhere. It's usually easy enough to overcome uncertainty and move forward in small situations or important situations (for various reasons). Where uncertainty can dig in is in dauntingly big and complex projects that are not essential. If you don't have to have whatever and building anything is clearly a lot of work for an uncertain reward, it's very easy to defer and defer action in favour of various stalling measures (or other work).

All of this sounds rather hand waving, so let me tell you about my project with gathering OS level performance statistics. Or rather my non-project.

If you look around, there are a lot of options for gathering, aggregating, and graphing OS performance stats (in tools, full systems, and ecologies of tools). Beyond a certain basic level it's unclear which ones of them are going to work best for us and which ones will be crawling failures, but at the same time it's also clear that any of them that look good are going to take a significant amount of work and time to set up and try out (and I'm going to have to try them in production).

As a result I have been circling around this project for literally years now. Every so often I poke and prod at the issue; I read more about some tool or another, I look at pretty pictures, I hear about something new, and so on and so forth. But I've never sat down to really do something. I've always found higher priority things to do or other excuses.

(Here in the academy this behavior in graduate students is well known and gets called 'thesis avoidance'.)

The scariness of uncertainty is not the only reason for this, of course, but it's a significant contributing factor. In a way it raises the stakes for making a choice.

(The uncertainty comes from two directions. One is simply trying to select which system to use; the other is whether not the whole idea is going to be worthwhile. The latter is a bit stupid since we're probably not going to be left with a white elephant of a system that we ignore and then quietly abandon, but the possibility gnaws at me and feeds other uncertainties and doubts.)

I don't have any answers, but maybe writing this entry has made it more likely that I do something here. And maybe I should embrace the possibility of failure as a sign that I am finally taking enough risk.

(I feel divided about that idea but I need to think about it more and then write another entry on it.)

sysadmin/UncertaintyScariness written at 00:34:47; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.