One way of capturing debugging state information in a systemd-based system

November 26, 2017

Suppose, not entirely hypothetically, that you have a systemd .service unit running something where the something (whatever it is) is mysteriously failing to start or run properly. In the most frustrating version of this, you can run the operation just fine after the system finishes booting and you can log in, but it fails during boot and you can't see why. In this situation you often want to gather information about the boot-time state of the system just before your daemon or program is started and fails; you might need to know things like what devices are available, the state of network interfaces and routes, what filesystems have been mounted, what other things are already running, and so on.

All of this information can be gathered by a shell script, but the slightly tricky bit is figuring out how to get it to run. I've taken two approaches here. The first one is to simply write a new .service file:

[Unit]
Description=Debug stuff
After=<whatever>
Before=<whatever else>

[Service]
Type=oneshot
RemainAfterExit=True
ExecStart=/root/gather-info

[Install]
WantedBy=multi-user.target

Here the actual information gathering script is /root/gather-info. I typically have it write its data into a file in /root as well. I use /root as a handy dumping ground that's on the root filesystem but not conceptually owned by the package manager in the way that /etc, /bin, and so on are; I can throw things in there without worrying that I'm causing (much) future problems.

(If you use an ExecStop= instead of ExecStart= you can gather the same sort of information at shutdown.)

However, if you're interested in the state basically right before some other .service runs, the better approach is to modify that .service to add an extra ExecStartPre= line. In order to make sure I know what's going on, my approach is to copy the entire .service file to /etc/systemd/system (if necessary) and then edit it. As an example, suppose that your ZFS on Linux setup is failing to import pools on boot because the zfs-import-cache.service unit is failing.

Here I'd modify the .service like this:

[...]

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStartPre=/root/gather-info
ExecStartPre=/sbin/modprobe zfs
ExecStart=/sbin/zpool import -c /etc/zfs/zpool.cache -aN

[...]

Unfortunately I don't think you can do this without copying the whole .service file, or at least I wouldn't want to trust it any other way.

Possibly there's a better way to do this in the systemd world, but I've been sort of frustrated by how difficult it is to do various things here. For example, it would be nice if systemd would easily give you the names of systemd units that ran or failed, instead of their Description= texts. More than once I've had to resort to 'grep -rl <whatever> /usr/lib/systemd/system' in an attempt to find a unit file so I could see what it actually did.

Sidebar: My usual general format for information-gathering scripts

I tend to write them like this:

#!/bin/sh
( date;
  [... various commands ...]
  echo
) >>/root/somefile.txt

The things I've found important are the date stamp at the start, that I'm appending to the file instead of overwriting it, and the blank line at the end for some more visual separation. Appending instead of overwriting can really save things if for some reason I have to reboot twice instead of once, because it means information from the first reboot is still there.


Comments on this page:

You can make the scripts a little easier to read and write by using exec to redirect, instead of a subshell:

#!/bin/sh
exec >> /root/somefile.txt ; date

# ...
# ...
# ...

echo
From 78.60.211.195 at 2017-11-26 08:09:48:

In the most frustrating version of this, you can run the operation just fine after the system finishes booting and you can log in, but it fails during boot and you can't see why

Would systemd.confirm-spawn be useful for this task?

Unfortunately I don't think you can do this without copying the whole .service file, or at least I wouldn't want to trust it any other way.

Should be fine as long as you don't forget to undo it later (I've more than once found a 6-month-old local copy shadowing the updated original unit).

That's what systemctl edit --full and systemctl revert are for, anyway.

For example, it would be nice if systemd would easily give you the names of systemd units that ran or failed, instead of their Description= texts.

Sadly, unit authors still do things like "Description=A high-performance web server" even though the documentation has a specific mention of avoiding exactly that...

By cks at 2017-11-26 20:40:44:

I don't think systemd.confirm_spawn is going to help here for a number of reasons. Even with systemd set up with a root shell on another VT so you can get in to examine the system state, confirm_spawn seems likely to slow down the whole boot to the point where race conditions in things like device probing may well vanish.

(It would be different if you could say 'only pause to confirm these particular units', but I don't think systemd supports that.)

Written on 26 November 2017.
« Sequential scrubs and resilvers are coming for (open-source) ZFS
The dig program now needs some additional options for useful DNS server testing »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Nov 26 02:09:14 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.