How Linux starts non-system software RAID arrays during boot under systemd

April 15, 2019

In theory, you do not need to care about how your Linux software RAID arrays get assembled and started during boot because it all just works. In practice, sometimes you do, and on a modern systemd-based Linux this seems to be an unusually tangled situation. So here is what I can determine so far about how it works for software RAID arrays that are assembled and started outside of the initramfs, after your system has mounted your real root filesystem and is running from it.

(How things work for starting software RAID arrays in the initramfs is quite varied between Linux distributions. There is some distribution variation even for post-initramfs booting, but these days the master version of mdadm ships canonical udev and systemd scripts, services, and so on and I think most distributions use them almost unchanged.)

As has been the case for some time, the basic work is done through udev rules. On a typical Linux system, the main udev rule file for assembly will be called something like 64-md-raid-assembly.rules and be basically the upstream mdadm version. Udev itself identifies block devices that are potentially Linux RAID members (probably mostly based on the presence of RAID superblocks), and mdadm's udev rules then run mdadm in a special incremental assembly mode on them. To quote the manpage:

This mode is designed to be used in conjunction with a device discovery system. As devices are found in a system, they can be passed to mdadm --incremental to be conditionally added to an appropriate array.

As array components become visible to udev and cause it to run mdadm --incremental on them, mdadm progressively adds them to the array. When the final device is added, mdadm will start the array. This makes the software RAID array and its contents visible to udev and to systemd, where it will be used to satisfy dependencies for things like /etc/fstab mounts and thus trigger them happening.

(There are additional mdadm udev rules for setting up device names, starting mdadm monitoring, and so on. And then there's a whole collection of general udev rules and other activities to do things like read the UUIDs of filesystems from new block devices.)

However, all of this only happens if all of the array component devices show up in udev (and show up fast enough); if only some of the devices show up, the software RAID will be partially assembled by mdadm --incremental but not started because it's not complete. To deal with this situation and eventually start software RAID arrays in degraded mode, mdadm's udev rules start a systemd timer unit when enough of the array is present to let it run degraded, specifically the templated timer unit mdadm-last-resort@.timer (so for md0 the specific unit is mdadm-last-resort@md0.timer). If the RAID array isn't assembled and the timer goes off, it triggers the corresponding templated systemd service unit, using mdadm-last-resort@.service, which runs 'mdadm --run' on your degraded array to start it.

(The timer unit is only started when mdadm's incremental assembly reports back that it's 'unsafe' to assemble the array, as opposed to impossible. Mdadm reports this only once there are enough component devices present to run the array in a degraded mode; how many devices are required (and what devices) depends on the specific RAID level. RAID-1 arrays, for example, only require one component device to be 'unsafe'.)

Because there's an obvious race potential here, the systemd timer and service both work hard to not act if the RAID array is actually present and already started. The timer conflicts with 'sys-devices-virtual-block-<array>.device', the systemd device unit representing the RAID array, and as an extra safety measure the service refuses to run if the RAID array appears to be present in /sys/devices. In addition, the udev rule that triggers systemd starting the timer unit will only act on software RAID devices that appear to belong to this system, either because they're listed in your mdadm.conf or because their home host is this host.

(This is the MD_FOREIGN match in the udev rules. The environment variables come from mdadm's --export option, which is used during udev incremental assembly. Mdadm's code for incremental assembly, which also generates these environment variables, is in Incremental.c. The important enough() function is in util.c.)

As far as I know, none of this is documented or official; it's just how mdadm, udev, and systemd all behave and interact at the moment. However this appears to be pretty stable and long standing, so it's probably going to keep being the case in the future.

PS: As far as I can tell, all of this means that there are no real user-accessible controls for whether or not degraded software RAID arrays are started on boot. If you want to specifically block degraded starts of some RAID arrays, it might work to 'systemctl mask' either or both of the last-resort timer and service unit for the array. If you want to always start degraded arrays, well, the good news is that that's supposed to happen automatically.

Comments on this page:

There is more "fun" with software raids. Assume software raid on few disks that are passed to KVM VM and assembled there as an array.

systemd/udev/mdadm on host will ... also assemble this array.

Somehow that doesn't corrupt filesystems on it (which is surprising).

Written on 15 April 2019.
« A VPN for me but not you: a surprise when tethering to my phone
Private browsing mode versus a browser set to keep nothing on exit »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Apr 15 22:37:24 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.