Putting cron jobs into systemd user slices doesn't always work (on Ubuntu 16.04)

September 27, 2017

As part of dealing with our Ubuntu 16.04 shutdown problem, we now have our systems set up to put all user cron jobs into systemd user slices so that systemd will terminate them before it starts unmounting NFS filesystems. Since we made this change, we've rebooted all of our systems and thus had an opportunity to see how it works in practice in our environment.

Unfortunately, what we've discovered is that pam_systemd apparently doesn't always work right. Specifically, we've seen some user cron @reboot entries create processes that wound up still under cron.service, although other @reboot entries for the same user on the same machine wound up with their processes in user slices. When things fail, pam_systemd doesn't log any sort of errors that I can see in the systemd journal.

(Since no failures are logged, this doesn't seem related to the famous systemd issue where pam_systemd can't talk to systemd, eg systemd issue 2863 or this Ubuntu issue.)

The pam_systemd source code isn't very long and doesn't do very much itself. The most important function here appears to be pam_sm_open_session, and reading the code I can't spot a failure path that doesn't cause pam_systemd to log an error. The good news is that turning on debugging for pam_systemd doesn't appear to result in an overwhelming volume of extra messages, so we can probably do this on the machines where we've seen the problem in the hopes that something useful shows up.

(It will probably take a while, since we don't reboot these machines very often. I have not seen or reproduced this on test machines, at least so far.)

Looking through what 'systemctl list-dependencies' with various options says for cron.service, it's possible that we need an explicit dependency on systemd-logind.service (although systemd-analyze on one system says that systemd-logind started well before crond). In theory it looks like pam_systemd should be reporting errors if systemd-logind hasn't started, but in practice, who knows. We might as well adopt a cargo cult 'better safe than sorry' approach to unit dependencies, even if it feels like a very long shot.

(Life would be simpler if systemd had a simple way of discovering the relationship, if any, between two units.)


Comments on this page:

       /* Make this a NOP on non-logind systems */
       if (!logind_running())
               return PAM_SUCCESS;

Surely that's the problem.

logind doesn't seem to be dbus-activated, so pam_systemd can't take advantage of bus-activation to wait for it to start up. Instead it's just checking the filesystem:

       static inline bool logind_running(void) {
               return access("/run/systemd/seats/", F_OK) >= 0;
       }

getty@.service has the same bug as your cron :(. It runs after systemd-user-sessions.service, but has no ordering with systemd-logind.

I think systemd-user-sessions.service needs to run After=systemd-logind.service

Ah, that's not quite right.

logind_running() is lying, it actually means systemd_running(). Because the directory it tests for is created by systemd-tmpfiles (tmpfiles.d/systemd.conf). I think it's even created if you disabled building logind (??).

So I still think there's a dependency bug, but it doesn't explain why you're not seeing any error messages.

By cks at 2017-09-28 08:42:39:

Although I linked to the upstream source, I was looking at the Ubuntu/Debian version of systemd-229 to see if I could spot a silent exit path and it turns out that Debian specifically remove that check, with the patch comment:

Don't make pam_sm_open_session() a NOP if logind is not running. Trying to access logind via D-Bus will start it on demand.

Also, in one case that I've looked at in detail it was the second @reboot cron job for a user that wound up under cron.service, while the first one was properly handled. This suggests something different from a systemd-logind startup delay (or not being running at the start); instead it feels more like logind might be overloaded and dropping things.

Thanks.

I misread logind. It does indeed have an alias for bus-activation; this is set up "statically" in /usr/lib/systemd/system. It's not set where I had looked in /etc/systemd/system. (`systemctl disable` will not work on it). So we wouldn't expect to need any explicit dependency.

Written on 27 September 2017.
« ZFS's recordsize, holes in files, and partial blocks
More on systemd on Ubuntu 16.04 failing to reliably reboot some of our servers »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Wed Sep 27 23:58:12 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.