Wandering Thoughts archives

2017-11-26

One way of capturing debugging state information in a systemd-based system

Suppose, not entirely hypothetically, that you have a systemd .service unit running something where the something (whatever it is) is mysteriously failing to start or run properly. In the most frustrating version of this, you can run the operation just fine after the system finishes booting and you can log in, but it fails during boot and you can't see why. In this situation you often want to gather information about the boot-time state of the system just before your daemon or program is started and fails; you might need to know things like what devices are available, the state of network interfaces and routes, what filesystems have been mounted, what other things are already running, and so on.

All of this information can be gathered by a shell script, but the slightly tricky bit is figuring out how to get it to run. I've taken two approaches here. The first one is to simply write a new .service file:

[Unit]
Description=Debug stuff
After=<whatever>
Before=<whatever else>

[Service]
Type=oneshot
RemainAfterExit=True
ExecStart=/root/gather-info

[Install]
WantedBy=multi-user.target

Here the actual information gathering script is /root/gather-info. I typically have it write its data into a file in /root as well. I use /root as a handy dumping ground that's on the root filesystem but not conceptually owned by the package manager in the way that /etc, /bin, and so on are; I can throw things in there without worrying that I'm causing (much) future problems.

(If you use an ExecStop= instead of ExecStart= you can gather the same sort of information at shutdown.)

However, if you're interested in the state basically right before some other .service runs, the better approach is to modify that .service to add an extra ExecStartPre= line. In order to make sure I know what's going on, my approach is to copy the entire .service file to /etc/systemd/system (if necessary) and then edit it. As an example, suppose that your ZFS on Linux setup is failing to import pools on boot because the zfs-import-cache.service unit is failing.

Here I'd modify the .service like this:

[...]

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStartPre=/root/gather-info
ExecStartPre=/sbin/modprobe zfs
ExecStart=/sbin/zpool import -c /etc/zfs/zpool.cache -aN

[...]

Unfortunately I don't think you can do this without copying the whole .service file, or at least I wouldn't want to trust it any other way.

Possibly there's a better way to do this in the systemd world, but I've been sort of frustrated by how difficult it is to do various things here. For example, it would be nice if systemd would easily give you the names of systemd units that ran or failed, instead of their Description= texts. More than once I've had to resort to 'grep -rl <whatever> /usr/lib/systemd/system' in an attempt to find a unit file so I could see what it actually did.

Sidebar: My usual general format for information-gathering scripts

I tend to write them like this:

#!/bin/sh
( date;
  [... various commands ...]
  echo
) >>/root/somefile.txt

The things I've found important are the date stamp at the start, that I'm appending to the file instead of overwriting it, and the blank line at the end for some more visual separation. Appending instead of overwriting can really save things if for some reason I have to reboot twice instead of once, because it means information from the first reboot is still there.

linux/SystemdCapturingBootState written at 02:09:14; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.