How Linux starts non-system software RAID arrays during boot under systemd
In theory, you do not need to care about how your Linux software RAID arrays get assembled and started during boot because it all just works. In practice, sometimes you do, and on a modern systemd-based Linux this seems to be an unusually tangled situation. So here is what I can determine so far about how it works for software RAID arrays that are assembled and started outside of the initramfs, after your system has mounted your real root filesystem and is running from it.
(How things work for starting software RAID arrays in the initramfs is quite varied between Linux distributions. There is some distribution variation even for post-initramfs booting, but these days the master version of mdadm ships canonical udev and systemd scripts, services, and so on and I think most distributions use them almost unchanged.)
As has been the case for some time,
the basic work is done through
udev rules. On a typical Linux
system, the main udev rule file for assembly will be called something
like 64-md-raid-assembly.rules and be basically the upstream
Udev itself identifies block devices that are potentially Linux
RAID members (probably mostly based on the presence of RAID
superblocks), and mdadm's udev rules then run
mdadm in a special
incremental assembly mode on them. To quote the manpage:
This mode is designed to be used in conjunction with a device discovery system. As devices are found in a system, they can be passed to
mdadm --incrementalto be conditionally added to an appropriate array.
As array components become visible to udev and cause it to run
mdadm --incremental on them,
mdadm progressively adds them to
the array. When the final device is added,
mdadm will start the
array. This makes the software RAID array and its contents visible to
udev and to systemd, where it will be used to satisfy dependencies for
/etc/fstab mounts and thus trigger them happening.
(There are additional mdadm udev rules for setting up device names, starting mdadm monitoring, and so on. And then there's a whole collection of general udev rules and other activities to do things like read the UUIDs of filesystems from new block devices.)
However, all of this only happens if all of the array component
devices show up in udev (and show up fast enough); if only some of
the devices show up, the software RAID will be partially assembled
mdadm --incremental but not started because it's not complete.
To deal with this situation and eventually start software RAID
arrays in degraded mode, mdadm's udev rules start a systemd timer
when enough of the array is present to let it run degraded,
specifically the templated timer unit mdadm-last-resort@.timer
(so for md0 the specific unit is firstname.lastname@example.org). If
the RAID array isn't assembled and the timer goes off, it triggers
the corresponding templated systemd service unit, using
which runs '
mdadm --run' on your degraded array to start it.
(The timer unit is only started when mdadm's incremental assembly reports back that it's 'unsafe' to assemble the array, as opposed to impossible. Mdadm reports this only once there are enough component devices present to run the array in a degraded mode; how many devices are required (and what devices) depends on the specific RAID level. RAID-1 arrays, for example, only require one component device to be 'unsafe'.)
Because there's an obvious race potential here, the systemd timer
and service both work hard to not act if the RAID array is actually
present and already started. The timer conflicts with
'sys-devices-virtual-block-<array>.device', the systemd device unit
representing the RAID array, and as an extra safety measure the
service refuses to run if the RAID array appears to be present in
/sys/devices. In addition, the udev rule that triggers systemd
starting the timer unit will only act on software RAID devices that
appear to belong to this system, either because they're listed in
mdadm.conf or because their home host is this host.
(This is the
MD_FOREIGN match in the udev rules.
The environment variables come from mdadm's
--export option, which
is used during udev incremental assembly. Mdadm's code for incremental
assembly, which also generates these environment variables, is in
enough() function is in util.c.)
As far as I know, none of this is documented or official; it's just how mdadm, udev, and systemd all behave and interact at the moment. However this appears to be pretty stable and long standing, so it's probably going to keep being the case in the future.
PS: As far as I can tell, all of this means that there are no real
user-accessible controls for whether or not degraded software RAID
arrays are started on boot. If you want to specifically block
degraded starts of some RAID arrays, it might work to '
mask' either or both of the last-resort timer and service unit for
the array. If you want to always start degraded arrays, well, the
good news is that that's supposed to happen automatically.
Comments on this page:Written on 15 April 2019.