Putting cron jobs into systemd user slices doesn't always work (on Ubuntu 16.04)

September 27, 2017

As part of dealing with our Ubuntu 16.04 shutdown problem, we now have our systems set up to put all user cron jobs into systemd user slices so that systemd will terminate them before it starts unmounting NFS filesystems. Since we made this change, we've rebooted all of our systems and thus had an opportunity to see how it works in practice in our environment.

Unfortunately, what we've discovered is that pam_systemd apparently doesn't always work right. Specifically, we've seen some user cron @reboot entries create processes that wound up still under cron.service, although other @reboot entries for the same user on the same machine wound up with their processes in user slices. When things fail, pam_systemd doesn't log any sort of errors that I can see in the systemd journal.

(Since no failures are logged, this doesn't seem related to the famous systemd issue where pam_systemd can't talk to systemd, eg systemd issue 2863 or this Ubuntu issue.)

The pam_systemd source code isn't very long and doesn't do very much itself. The most important function here appears to be pam_sm_open_session, and reading the code I can't spot a failure path that doesn't cause pam_systemd to log an error. The good news is that turning on debugging for pam_systemd doesn't appear to result in an overwhelming volume of extra messages, so we can probably do this on the machines where we've seen the problem in the hopes that something useful shows up.

(It will probably take a while, since we don't reboot these machines very often. I have not seen or reproduced this on test machines, at least so far.)

Looking through what 'systemctl list-dependencies' with various options says for cron.service, it's possible that we need an explicit dependency on systemd-logind.service (although systemd-analyze on one system says that systemd-logind started well before crond). In theory it looks like pam_systemd should be reporting errors if systemd-logind hasn't started, but in practice, who knows. We might as well adopt a cargo cult 'better safe than sorry' approach to unit dependencies, even if it feels like a very long shot.

(Life would be simpler if systemd had a simple way of discovering the relationship, if any, between two units.)

Written on 27 September 2017.
« ZFS's recordsize, holes in files, and partial blocks
More on systemd on Ubuntu 16.04 failing to reliably reboot some of our servers »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Sep 27 23:58:12 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.