Systemd's DynamicUser feature is (currently) dangerous

August 8, 2018

Yesterday I described how timesynd couldn't be restarted on one of our Ubuntu 18.04 machines, where the specific thing that caused the failure was timesyncd attempting to access /var/lib/private/systemd/timesync and failing because /var/lib/private is only accessible by root, not the UID that timesyncd was running as. My diagnostic efforts left me puzzled as to how this was supposed to work at all, but Trent Lloyd (@lathiat) pointed me to the answer, which is in Lennart Poettering's article Dynamic Users with systemd, which introduces the overall system, explains the role of /var/lib/private, and covers how timesyncd is supposed to get access through an inaccessible directory. I'll quote the explanation for that:

[Access through /var/lib/private] is achieved by invoking the service process in a slightly modified mount name-space: it will see most of the file hierarchy the same way as everything else on the system ([...]), except for /var/lib/private, which is over-mounted with a read-only tmpfs file system instance, with a slightly more liberal access mode permitting the service read access. [...]

Since timesyncd is not able to get access through /var/lib/private, you might guess that something has gone wrong in the process of setting up this slightly modified mount namespace. Indeed this turned out to be the case. The machine that this happened on is an NFS client and (as is usual) its UID 0 is mapped to an unprivileged UID on our fileservers. On this machine there were some FUSE mounts in the home directories of users who have their $HOME not world readable (our default $HOME permissions are owner-only, to avoid accidents). When systemd was setting up the 'slightly modified mount name-space' it attempted to access these FUSE mounts as part of binding them into the namespace, but it failed because UID 0 had no permissions to look inside user home directories.

This failure caused systemd to give up attempting to set up the namespace. However, systemd did not abort unit activation or even log an error message. Instead it continued on to try to start timesyncd without this special namespace, despite the fact that timesyncd uses both DynamicUser and StateDirectory and so starting it normally was essentially absolutely guaranteed to fail.

(Although my initial case was dangling FUSE mounts, it soon developed that any FUSE mounts would do it, for example a sshfs or smbfs mount in a user's NFS mounted home directory when the home directory isn't world-accessible.)

Systemd's failure to handle errors in setting up the namespace here has been raised as systemd issue 9835. However, merely logging an error or aborting the unit activation would not actually fix the core problem; it would merely let you see exactly why your timesyncd or whatever service is failing to start. The core problem is that systemd's current design for DynamicUser intrinsically blows up if systemd and UID 0 don't have full access to every mount that's visible on the system.

(Well, DynamicUser plus StateDirectory, but the idea seems to be that pretty much every service using dynamic users will have a systemd managed state directory.)

In my opinion, this makes using DynamicUser surprisingly dangerous. A systemd service that is set to use it can't be reliably started or restarted on all systems; it only works on some systems, some of the time (but those happen to be the common case). If there's ever a problem setting up the special namespace that each such service requires, things fail. Machines that are NFS clients are the obvious case, since the client's UID 0 often has limited privileges, but I believe that there are likely to be others.

(And of course services can be restarted for random and somewhat unpredictable reasons, such as package updates or other services being restarted. You should not assume that you can always control these circumstances, or completely predict the state of the system when they happen.)

Comments on this page:

By Tom at 2018-08-09 04:36:29:

I wonder if setting ProtectHome=yes (and/or some variety of InaccessiblePaths) would cause systemd to not try to remount any of the paths under /home, thus not trying to bind the fuse mounts into the restricted namespace.

@Tom It doesn't seem to help in the current implementation. At least, systemd-timesyncd.service already has ProtectHome=yes, when I looked on upstream v237.

I think the problem is due to ProtectSystem=strict. This is effectively a superset of ProtectHome=readonly. Systemd tries to turn all the mounts it finds there into readonly bind mounts, and it chokes if it can't access them.

I'm not sure exactly why Chris's system hits this, since I think systemd should be running as root with CAP_DAC_READ_SEARCH at this point. But at least it was easy to contrive another situation where this is clearly dangerous. (FUSE --no-allow-other) "unprivilged users can break starting services which have DynamicUser=1 (and rely on StateDirectory=)"

Oh, I think I see what Chris is saying. The implication is that CAP_DAC_READ_SEARCH doesn't work on NFS, which sounds very plausible. Then there are places root can't go.

(It's also how my vague memories say that NFS root_squash works).

By cks at 2018-08-09 20:43:27:

This is indeed how root over NFS works, at least for NFS v3 and old style NFS v4 security. It doesn't matter what capabilities the client's kernel thinks UID 0 is supposed to have (or any particular use of any particular UID by any particular program with various capabilities). What matters is what UID the client's kernel puts in the NFS requests themselves and what the server feels about it, and by default NFS servers feel that UID 0 is actually UID <something> and that UID has no special privileges.

(The 'something' varies, but it is usually a predictable large UID that isn't supposed to be used by anything else.)

In theory Linux could (re)bind mounts inside NFS mounted directories without having to access them from the server, because how the mount namespace is set up is entirely up to the client. In practice systemd first attempts to do various other operations that look like conventional filesystem access, which the kernel has no choice but to ship off to the NFS server and get EACCES on.

I understand umount() can avoid touching the root of the FS it unmounts. And remount does the same. But AFAICT remount still needs permission to traverse the parent directories and find the mount point.

I don't think I want to try redesigning 'BindReadOnlyPaths= (hence ProtectHome=read-only and ProtectSystem=strict) as copying mount trees in userspace. That would let you omit the inaccessible mounts, yes. But it also means dealing with weird race conditions. Was a new sub-mount created after we read /proc/mounts, but before we mounted a copy of the parent mount and hence got automatic propagation of new sub-mounts?

(Mount propagation is generally essential, in order to allow unmounting of removable devices etc in the main namespace).

IIRC, when the mount program was still tracking mounts in /etc/mtab, they got rather unhappy with bugs related to stuff like recursive binds.

The fix proposed for this sticks with 'MS_REC|MS_BIND' (as in 'mount --rbind'), and just accepts that it won't be able to change the inaccessible submounts to readonly bind mounts.

If someone thinks ProtectSystem=strict was a marginal idea to start with, this issue may strengthen their opinion :). I commented that I think the docs are misleading about how reliable it can be for custom mounts in general. It feels like it's more a way to describe relationships between different services and the system, than to improve security.

Gah. Sorry, I see you were hedging on what Linux could theoretically do, without specifically saying it would be possible on current kernels. Makes sense.

Written on 08 August 2018.
« A timesyncd total failure and systemd's complete lack of debugability
One simple general pattern for making sure things are alive »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Aug 8 21:51:36 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.