It matters where (or when) your programs ask questions
The other day, I wrote about how we belatedly evolved our account creation script so that it could now just assume we wanted the defaults for most everything, and how this simple change had been a real quality of life improvement for us. This improvement isn't just because we interact with the script a lot less; it's also because we changed where we interact with it. Specifically, we now only interact with the script right at the start and all the way at the end; before, we had to periodically interact with the script all the way through its run.
The problem with periodic interactions is that they have the end result of slowing down the whole process a bunch, and often they make it feel draining and demanding. What happens in practice is that you start the process, have it run, get bored with waiting for it to ask you a question, look away to do something else, don't notice immediately that the process has paused with a question, go back to it, answer the question, get bored again, and repeat until the whole thing is over. If and when you wind up constantly looking over to check on the process or focusing on it while you wait for it to ask you something, it feels draining and demanding. You're not doing anything, but you have to pay attention and wait.
When you shift all of the questions and interaction to the start and the end, you wipe most of this away. You start the process, it immediately asks you a question or two, and then you can go away. When it finishes, you may have a final question or two to answer, but at that point it's actually done. You don't have to constantly pay it some amount of attention in order to keep it moving along; it becomes a fire and mostly forget thing. Maybe you look over every so often to see if it's finished yet, but you know that you're not really delaying it by not paying enough attention.
As a result of our experiences with this script (and similar ones that need to ask us questions or have us do things by hand), I've come to be strongly biased about where I want to put any interactions in my scripts. If I have to ask questions, I'm going to do my best to put them as early as possible. If I can't ask them right at the start, I'm at least going to ask them all at once, so there's only one pause and interruption, and once that's over I know I can basically ignore the script for a while.
(Our local scripts are not perfect here, and perhaps we should change one that asks its questions early but not right away. But that script does at least ask all its questions all at once.)
PS: You might wonder how you wind up with a bunch of questions scattered through your script. Through good intentions, basically. If you have a bunch of different operations to do and you have a tacit custom that you want to manually confirm operations, you can easily wind up with a pattern where people adding operations add a 'okay, should I do this/tell me what option to take' question right before they start the operation itself. Then you wind up with a stop-start script that keeps pausing to ask you questions.
Go's net package doesn't have opaque errors, just undocumented ones
I continue to be irritated by how opaque important Go errors are. I should not have to do string comparisons to discover that my network connection failed due to 'host is unreachable'.
The standard library
has a general error type
that's returned from most network operations. If you read through
the package documentation straightforwardly, as I did in this tweet,
you will likely conclude that the only reasonable way to see if
net.Dial() call to something has failed because your Unix
is reporting 'no route to host' is to perform a string match against
the string value of the error you get back.
(You want to do that string match against
that's what gets you the constant error string without varying bits
like the remote host and port you're trying to connect to.)
As I discovered when I started digging into things in the process
of writing a different version of this entry, things are somewhat
more structured under the hood. In fact the error that you get back
net.Dial() is likely to be all officially exported types and
you can do a more precise check than string comparisons (at least
on Unix), but you have to reach through several layers to see what
is going on. It goes like this:
net.Dial()is probably returning a
*net.OpError, which wraps another error that is stored in its
- if you have a connection failure (or some other specific OS level
*net.OpError.Errvalue is probably an
*os.SyscallError. This is itself a wrapper around an underlying error, in
.Err(and the syscall that failed is in
.Syscall; you could verify that it's
- this underlying error is probably a
*syscall.Errno, which can be compared against the various
E*errno constants that are also defined in
syscall. Here, I'd want to check for
So we have a
*syscall.Errno inside an
*net.OpError. This wrapping sequence is not documented
and thus not covered by any compatibility guarantees (neither is
the string comparison, of course). Since all of these
are declared as type
error instead of concrete types, unwrapping
the whole nesting requires a bunch of checked type casts.
If I was doing this regularly, I would probably bother to write a function to check 'is this errno <X>', or perhaps a list of errnos. As a one-off check, I don't feel particularly guilty about doing the string check even now that I know it's possible to get the specific details if you dig hard enough. Pragmatically it works just as well, it's probably just as reliable, and it's easier.
(You still need to do a checked type cast to
that's as far as you need to go. If you don't even want to bother
with that, you could just string-ify the whole error and then use
For my purposes I wanted to check some other parts of the
*net.OpError, so I needed the type cast anyway.)
In my view, the general shape of this sequence of wrapped errors should be explicitly documented. Like it or not, the relative specifics of network errors are something that people care about in the real world, so they are going to go digging for this information one way or another, and I at least assume that Go would prefer we unwrap things to check explicitly rather than just string-ifying errors and matching strings. If there are cautions about future compatibility or present variations in behavior, document them explicitly so that people writing Go programs know what to look out for.
(Like it or not, the actual behavior of things creates a de facto standard, especially if you don't warn people away. Without better information, people will code to what the dominant implementation actually does, with various consequences if this ever changes.)
Our problem with HTTPS and user-created content
We have a departmental web server, where people can host their personal pages (eg) and pages for their research groups and so on, including user-run web servers behind reverse proxies. In other words, this web server has a lot of content, created by a lot of people, and essentially none of it is under our control. These days, in one sense this presents us with a bit of a problem.
Our departmental web server supports HTTPS (and has for years). Recent browser developments are clearly pushing websites from HTTP to HTTPS, even if perhaps not as much as has been heralded, and so it would be good if we were to actively switch over. But, well, there's an obvious problem for us, and the name of that problem is mixed content. A not insignificant number of pages on our web server refer to resources like CSS stylesheets using explicit HTTP URLs (either local ones or external ones), and so would and do break if loaded over HTTPS, where browsers generally block mixed content.
We are obviously not going to break user web pages just because the Internet would now kind of like to see us using HTTPS instead of HTTP; if we even proposed doing that, the users would get very angry at us. Nor is it feasible to get users to audit and change all of their pages to eliminate mixed content problems (and from the perspectives of many users, it would be make-work). The somewhat unfortunate conclusion is that we will never be able to do a general HTTP to HTTPS upgrade on our departmental web server, including things like setting HSTS. Some of the web server's content will always be in the long tail of content that will never migrate to HTTPS and will continue to be HTTP content for years to come.
Probably this issue is confronting anyone with significant amounts of user-created content, especially in situations where people wrote raw HTML, CSS, and so on. I suspect that a lot of these sites will stay HTTPS-optional for plenty of time to come.
(Our users can use a
.htaccess to force HTTP to HTTPS redirection
for their own content, although I don't expect very many people to
ever do that. I have set this up for my pages, partly just to make sure that
it worked properly, but I'm not exactly a typical person here.)
The evolution of our account creation script
One of the things about system administration automation is that its evolution often follows the path of least resistance. This can leave you with interesting and peculiar historical remnants, and it can also create situations where it takes a relatively long time before a system does the obvious thing. As it happens, I have a story about this.
To go with our account request system, which handles people requesting new accounts and authorizing requested accounts, we have an actual script that we run to actually create Unix accounts. Until relatively recently that script asked you a bunch of questions, although they all had default answers that we'd accept essentially all of the time. The presence of these questions was both a historical remnant of the path that the script took and an illustration of how unquestioningly acclimatized we can all become to what we think of as 'normal'.
We have been running Unix systems and creating accounts on them for a very long time, and in particular we've been doing this since before the World Wide Web existed and was readily accessible. Back in the beginning of things, accounts were requested on printed forms; graduate students and suchlike filled out the form with the information, got their account sponsors to sign it, handed it to the system staff, and the system staff typed all of the information into a script that asked us questions like 'login?', 'name?', 'Unix group?', 'research group affiliation?', and so on.
At a certain point, the web became enough of a thing that having a CGI version of our paper account request form was an obvious thing to do. Not everyone was going to use the CGI form (or be able to), and anyway we already had the account creation script that knew all of the magic required to properly create an account around here, so we adopted the existing script to also work with the CGI. The CGI wrote out the submitted information into a file (basically as setting shell environment variables) and this file was then loaded into the account creation script as the default answers to many of the questions that had originally been fields on the printed form. If the submitted information was good, you could just hit Return through many of the questions. After you created the account, you then had to email some important information about it (especially the temporary password) off to the person it was for; you did this by hand, because you generated the random password by hand outside of the script.
(For reasons lost to history, the data file that the CGI wrote and the script loaded was a m4 file that was then processed through m4 to create shell variable assignments.)
When we wrote our account request system to replace the basic CGI (and the workflow around it, which involved manually emailing account sponsors to ask them about approving accounts), the simple and easy way for it to actually get accounts created was to carefully write the same data file that the CGI had used (m4isms and all). The account request script remained basically unchanged, and in particular it kept asking us to confirm all of the 'default' answers, ie all of the information that the account request system had already validated and generated. More than that, we added a few more bits of special handling for some accounts, with their own questions.
(Although the account request system was created in 2011, it took
until a 2016 major revision for a new version of Django for us to
switch from generating m4 data files to just directly generating
shell variable assignments that the script directly sourced with
That we had to actually answer these questions and then write the 'you have a new account' email made the whole process of creating an account a tedious thing. You couldn't just start the script and go away for a while; you had to periodically interact with it, hitting Return, generating a password in another window and pasting it in to the password prompt, and composing email yourself. None of these things were actually necessary for the backend of the account request system, but they stayed for historical reasons (and because we needed them occasionally, because some accounts are created outside of the account request system). And we, the people who used the script, were so acclimatized to this situation that we didn't really think about it; in fact I built my own automation around writing the 'you have a new account' form email.
At this point I've forgotten what the exact trigger event was, but last year around this time, in the middle of creating a bunch of new graduate student accounts (where the existing script's behavior was at its most tedious), we realized that this could be fixed. I'll quote my commit messages:
New 'fast create' mode for account creation that takes all the defaults and doesn't bother asking if we're really sure.
For fast mode, add the ability to randomly generate or set the initial password at the start of the process.
offer to send new-account greeting email.
make sending greeting email be the default (if you just hit return).
(In theory we could make sending the greeting email happen automatically. In practice, asking a final question gives us an opportunity to look back at all the messages printed out just in case there's some problem that the script didn't catch and we want to pause to fix things up.)
This simple change has been a real quality of life improvement for us, turning a tedious slog into a mostly fire and forget exercise that we can casually run through. That it took so long to make our account creation script behave this way is an illustration not just of the power of historical paths but also of the power of habituation. We were so used to how the existing system worked that we never really questioned if it had to be that way; we just grumbled and accepted it.
(This is, in a sense, part of the power of historical paths. The path that something took to get where it is shapes what we see as 'normal' and 'just how things are', because it's what we get used to.)
Sidebar: There were some additional steps in there
There are a few questions in the account creation script where in theory we have a genuine choice to make; for example, some accounts have several options for what filesystem they get created in. Part of what made the no-questions version of the script possible was that we realized that in practice we always made a particular choice (for filesystems, we always picked the one with the most free space), so we revised the script to make this choice the default answer.
Had we not worked out default answers for all of these questions, we couldn't have made the creation script not even ask the questions. We might have done both at the same time if it was necessary, but in practice it certainly helped that everything already had default answers so the 'fast create' mode could just be 'take all of the default answers without requiring confirmation'.
A recent spate of ZIP attachments with everything
Our program for logging email attachment type information looks inside
.jar archives, including one level of nesting. Often what we see in this is routine, with
basically the sort of content you'd expect from either routine stuff
or malware, but recently we've been seeing zip archives that are
just stuffed with at least one of almost any file extension you can
think of. A few days ago we logged an extreme example:
1fnnAC-0003dZ-EP attachment application/zip; MIME file ext: .zip; zip exts: .jar; inner zip exts: .abc .abl .acc .ach .adc .adz .afd .age .ago .agy .aht .ake .ala .alp .and .ans .aob .aor .app .apt .ara .ary .aud .aus .ave .axe .baa .bag .bap .bat .bde .bet .bin .bis .bkg .boe .bra .bsh .buz .bye .cai .cal .cat .caw .cdg .chm .cit .class .cli .clo .col .cop .cpl .crc .crs .cst .ctg .cto .cup .cwt .dad .dbl .dcb .der .det .dew .dey .dig .dil .dks .dur .dwt .dye .eft .ego .elb .elm .els .emf .emm .emu .err .esd .esq .ext .eyn .fax .fbi .fcs .fee .fei .fem .ffa .fgn .fig .flb .fly .foe .fog .fud .gab .gae .gal .gas .geb .gig .gin .gio .goa .gob .god .gon .goo .gox .gtc .gun .had .hah .hak .hao .hat .hau .hcb .hcl .hed .heh .hen .hes .hia .hip .hir .hld .hoc .hoe .hts .hug .hye .ibo .ide .ihp .ijo .ilk .imu .ing .ipr .iqs .ire .iwa .iyo .jah .jap .jay .jct .jem .jud .jur .kat .kaw .kay .key .khi .kop .kor .kos .kph .kyl .lab .lap .lcm .lea .lek .les .lib .lid .lit .llb .lou .lub .lxx .mao .map .maw .meu .mf .mix .mks .mog .mor .mot .mph .mus .nee .nef .nei .nep .nut .oak .obb .ofo .oki .one .oni .ops .ora .our .pan .pap .par .paw .pax .pay .pdq .peh .pep .pia .pie .pig .pit .pks .poh .pos .pot .ppa .pps .pre .pry .psi .pwr .pyr .rab .ram .rat .raw .rct .ref .reg .res .rfs .rig .rim .rix .rld .roc .roi .rpm .rut .rux .rwd .rwy .rye .sab .sau .sds .sed .sei .sel .sew .she .shr .sie .sil .sim .sip .six .sny .soe .sou .soy .sqq .stg .sum .sur .syd .tar .tat .tay .ted .tef .tem .tng .ton .tou .twa .udo .uns .urb .urn .uti .vac .vil .von .vum .wab .wae .wea .wop .wot .wro .wud .xii .xiv .xxi .xxv .xxx .yam .yay .yea .yeo .yer .yez .yoe .yrs .yun .zat .zen .zho .zig .zip .zod
(We deliberately log file extensions inside zip archives in alphabetical order, so it may well have had a much different order originally.)
This particular message was detected by Sophos PureMessage as 'Mal/DrodZp-A', which may be a relatively generic name. The Subject: of the message was the relatively generic 'Re: Invoice/Receipt', and I don't know what the overall MIME filename of the .zip was claimed to be. We've received a bunch of very similar attachments that were just .jars (not .zip in .jar) with giant lists of extensions. Many of them have been rejected for containing (nominal) bad file types, and their MIME filenames have been things like 'ORIGIAL SHIPPING DOCUMENTS.qrypted.jar' and "0042133704 _ PDF.jar".
(It's possible that these direct .jars would also be detected as Mal/DrodZp-A, but we reject for bad file types before we check for known viruses.)
I doubt that the attachment had genuine examples of these file
types, especially things like
.rpm (RPM packages) and
(Nikon camera RAWs, which are invariably anywhere from several
megabytes to tens of megabytes for the latest high-resolution Nikon
DSLRs). I'm sure that the malware has some reason for doing this
spray of files and file extensions, but I have no idea what it might
be. If there are some anti-virus products that give up if a .jar
has enough different file extensions in it, that's kind of sad
(among other things).
Sadly for any additional filtering we might considering doing, I suspect that the dangerous parts of this were in the actual Java stuff (eg the .class files) and everything else is distraction. It'd be somewhat interesting to pick through a captured sample, because I am curious about what's in all of those files (or if they're just zero-length ones put in to pad things out) and also what file names they have. Did the malware make up some jumble of random file names, or is it embedded a message in them or something clever? I'll never know, because it's not important enough to bother doing anything special for.
Fetching really new Fedora packages with Bodhi
Normal Fedora updates that have been fully released are available
through the regular
updates repository, which is (or should be)
already configured into
dnf on your Fedora system. More recent
(and less well tested) updates are available through the
repository, which you can selectively enable in order to see if
what you're looking for is there. Right now I'm interested in Rust
1.28, because it's now required to build the latest Firefox from
# dnf --enablerepo=updates-testing check-update 'rust*' Last metadata expiration check: 0:00:56 ago on Fri 10 Aug 2018 02:12:32 PM EDT. #
However sometimes, as in this case and past ones, any update that actually exists is too new to even have made it into the updates-testing DNF repo. Fedora does their packaging stuff through Fedora Bodhi (see also), and as part of this packages can be built and available in Bodhi even before they're pushed to updates-testing, so if you want the very freshest bits you want to check in Bodhi.
There are two ways to check Bodhi; through the command line using
bodhi client (which comes from the bodhi-client package), or
through the website. Perhaps
I should use the client all the time, but I tend to reach for the
website as my first check. The URL for a specific package on the
website is of the form:
For example, https://bodhi.fedoraproject.org/updates/?packages=rust is the URL for Rust (and there's a RSS feed if you care a lot about a particular package). For casual use, it's probably easier to just search from Bodhi's main page.
Through the command line, checking for and downloading an update looks like this:
; bodhi updates query --packages rust --releases f28 --status pending ============================= [...] rust-1.28.0-2.fc28 ============================= [...] Update ID: FEDORA-2018-42024244f2 [...] Notes: New versions of Rust and related tools -- see the release notes : for [1.28](https://blog.rust-lang.org/2018/08/02/Rust-1.28.html). Submitter: jistone Submitted: 2018-08-10 14:35:56 [...]
We insist on the
pending status because that cuts the listing
down and normally gives us only one package, where we get to see
detailed information about it; I believe that there's normally
only one package in pending status for a particular Fedora release.
If there's multiple ones, you get a less helpful summary listing
that will give you only the full package name instead of the update
ID. If you can't get the update ID through
bodhi, you can always
get it through the website by clicking on the link to the
specific package version on the package's page.
To fetch all of the binary RPMs for an update:
; cd /tmp/scratch ; bodhi updates download --updateid FEDORA-2018-42024244f2 [...]
; cd /tmp/scratch ; bodhi updates download --builds rust-1.28.0-2.fc28 [...]
Both versions of the
bodhi command download things to the current
directory, which is why I change to a scratch directory first. Then
you can do '
dnf update /tmp/scratch/*.rpm'. If the resulting
packages work and you feel like it, you can leave feedback on the
Bodhi page for the package, which may help get it released into the
updates-testing repo and then eventually the updates repo.
(In theory you can leave feedback through the
bodhi command too,
but it requires more setup and I think has somewhat less options
than the website.)
As far as I've seen, installing RPMs this way will cause things to
remember that you installed them by hand, even when they later
become available through the
updates-testing or the
repo. This is probably not important to you.
(I decided I wanted an actual entry on this process that I can find easily later, instead of having to hunt around for my postscript in this entry the next time I need it.)
The benefits of driving automation through cron
In light of our problem with timesyncd, we needed a different (and working)
solution for time synchronization on our Ubuntu 18.04 machines. The
obvious solution would have been to switch over to chrony; Ubuntu even has chrony set up so that if you run
it, timesyncd is automatically blocked. I like chrony so I was
tempted by this idea briefly, but then I realized that using chrony
would mean having yet another daemon that we have to care about.
Instead, our replacement for timesyncd is running
There are a number of quiet virtues of driving automation out of
cron entries. The whole approach is simple and brute force, but
this creates a great deal of reliability. Cron basically never dies
and if it were ever to die it's so central to how our systems operate
that we'd probably notice fairly fast. If we're ever in any doubt,
cron logs when it runs things to syslog (and thus to our central
syslog server), and if jobs fail or produce output, cron has a very
reliable and well tested system for reporting that to us. A simple
cron entry that runs
ntpdate has no ongoing state that can get
messed up, so if cron is running at all, the
ntpdate is running
at its scheduled interval and so our clocks will stay synchronized.
If something goes wrong on one run, it doesn't really matter because
cron will run it again later. Network down temporarily? DNS resolution
broken? NTP servers unhappy? Cure the issue and we'll automatically
get time synchronization back.
A cron job is simple blunt force; it repeats its activities over and over and over again, throwing itself at the system until it batters its way through and things work. Unless you program it otherwise, it's stateless and so indifferent to what happened the last time around. There's a lot to be said for this in many system tasks, including synchronizing the clock.
(Of course this can be a drawback if you have a cron job that's failing and generating email every failure, when you'd like just one email on the first failure. Life is not perfect.)
There's always a temptation in system administration to make things complicated, to run daemons and build services and so on. But sometimes the straightforward brute force way is the best answer. We could run a NTP daemon on our Ubuntu machines, and on a few of them we probably will (such as our new fileservers), but for everything else, a cron job is the right approach. Probably it's the right approach for some of our other problems, too.
(If timesyncd worked completely reliably on Ubuntu 18.04, we would likely stick with it simply because it's less work to use the system's default setup. But since it doesn't, we need to do something.)
PS: Although we don't actively monitor cron right now, there are ways to notice if it dies. Possibly we should add some explicit monitoring for cron on all of our machines, given how central it is to things like our password propagation system. Sure, we'd notice sooner or later anyway, but noticing sooner is good.
One simple general pattern for making sure things are alive
One perpetual problem in system monitoring is detecting when something goes away. Detecting the presence of something is often easy because it reports itself, but detecting absence is usually harder. For example, it generally doesn't work well to have some software system email you when it completes its once a day task, because the odds are only so-so that you'll actually notice on the day when the expected email isn't there in your mailbox.
One general pattern for dealing with this is what I'll call a staleness timer. In a staleness timer you have a timer that effectively slowly counts down; when the timer reaches 0, you get an alert. When systems report in that they're alive, this report resets their timer to its full value. You can implement this as a direct timer, or you can write a check that is 'if system last reported in more than X time ago, raise an alert' (and have this check run every so often).
(More generally, if you have an overall metrics system you can presumably write an alert for 'last metric from source <X> is more than <Y> old'.)
In a way this general pattern works because you've flipped the problem around. Instead of the default state being silence and exceptional things having to happen to generate an alert, the default state is an alert and exceptional things have to happen to temporarily suppress the alert.
There are all sorts of ways of making programs and systems report in, depending on what you have available and what you want to check. Traditional low rent approaches are touching files and sending email to special dedicated email aliases (which may write incoming email to a file, or simply run a program on incoming email that touches a relevant file). These can have the drawback that they depend on multiple different systems all working, but they often have the advantage that you have them working already (and sometimes it's a feature to verify all of the systems at once).
(If you have a real monitoring system, it hopefully already provides a full selection of ways to submit 'I am still alive' notifications to it. There probably is a very simple system that just does this based on netcat-level TCP messages or the like, too; it seems like the kind of thing sysadmins write every so often. Or perhaps we are just unusual in never having put together a modern, flexible, and readily customizable monitoring system.)
All of this is a reasonably obvious and well known thing around the general community, but for my own reasons I want to write it down explicitly.
DynamicUser feature is (currently) dangerous
Yesterday I described how timesynd couldn't be restarted on one of our Ubuntu 18.04 machines, where the specific thing that caused the failure was timesyncd attempting to access /var/lib/private/systemd/timesync and failing because /var/lib/private is only accessible by root, not the UID that timesyncd was running as. My diagnostic efforts left me puzzled as to how this was supposed to work at all, but Trent Lloyd (@lathiat) pointed me to the answer, which is in Lennart Poettering's article Dynamic Users with systemd, which introduces the overall system, explains the role of /var/lib/private, and covers how timesyncd is supposed to get access through an inaccessible directory. I'll quote the explanation for that:
[Access through /var/lib/private] is achieved by invoking the service process in a slightly modified mount name-space: it will see most of the file hierarchy the same way as everything else on the system ([...]), except for
/var/lib/private, which is over-mounted with a read-only
tmpfsfile system instance, with a slightly more liberal access mode permitting the service read access. [...]
Since timesyncd is not able to get access through /var/lib/private,
you might guess that something has gone wrong in the process of
setting up this slightly modified mount namespace. Indeed this
turned out to be the case. The machine that this happened on is an
NFS client and (as is usual) its UID 0 is mapped to an unprivileged
UID on our fileservers. On this
machine there were some FUSE mounts in the home directories of users
who have their
$HOME not world readable (our default
permissions are owner-only, to avoid accidents). When systemd was
setting up the 'slightly modified mount name-space' it attempted
to access these FUSE mounts as part of binding them into the
namespace, but it failed because UID 0 had no permissions to look
inside user home directories.
This failure caused systemd to give up attempting to set up the
namespace. However, systemd did not abort unit activation or even
log an error message. Instead it continued on to try to start
timesyncd without this special namespace, despite the fact that
timesyncd uses both
StateDirectory and so
starting it normally was essentially absolutely guaranteed to fail.
(Although my initial case was dangling FUSE mounts, it soon developed that any FUSE mounts would do it, for example a sshfs or smbfs mount in a user's NFS mounted home directory when the home directory isn't world-accessible.)
Systemd's failure to handle errors in setting up the namespace here
has been raised as systemd issue 9835. However, merely
logging an error or aborting the unit activation would not actually
fix the core problem; it would merely let you see exactly why your
timesyncd or whatever service is failing to start. The core problem
is that systemd's current design for
blows up if systemd and UID 0 don't have full access to every mount
that's visible on the system.
(Well, DynamicUser plus StateDirectory, but the idea seems to be that pretty much every service using dynamic users will have a systemd managed state directory.)
In my opinion, this makes using
DynamicUser surprisingly dangerous.
A systemd service that is set to use it can't be reliably started
or restarted on all systems; it only works on some systems, some
of the time (but those happen to be the common case). If there's
ever a problem setting up the special namespace that each such
service requires, things fail. Machines that are NFS clients are
the obvious case, since the client's UID 0 often has limited
privileges, but I believe that there are likely to be others.
(And of course services can be restarted for random and somewhat unpredictable reasons, such as package updates or other services being restarted. You should not assume that you can always control these circumstances, or completely predict the state of the system when they happen.)
A timesyncd total failure and systemd's complete lack of debugability
Last November, I wrote an entry about how we were switching to using systemd's timesyncd on our Ubuntu machines. Ubuntu 18.04 defaults to using timesyncd just as 16.04 does, and when we set up our standard Ubuntu 18.04 environment we stuck with that default behavior (although we customize the list of NTP servers). Then today I discovered that timesyncd had silently died on one of our 18.04 servers back on July 20th, and worse it couldn't be restarted.
Specifically, it reported:
systemd-timesyncd: Failed to create state directory: Permission denied
The state directory it's complaining about is /var/lib/systemd/timesync,
which is actually a symlink to /var/lib/private/systemd/timesync
(at least on systems that are in good order; if the symlink has had
something happen to it, you can apparently get other errors from
timesyncd). I had a clever informed theory about what was wrong
with things, but it turns out
strace says I'm wrong.
(To my surprise, doing '
strace -f -p 1' on this system did not
produce either explosions or an impossibly large amount of output.
This would have been a very different thing on a system that was
actually in use; this is basically an almost idle server being used
as part of our testing of 18.04 before we upgrade our production
servers to it.)
strace, what is failing is timesyncd's attempts to
access /var/lib/private/systemd/timesync as its special UID (and
GID) 'systemd-timesync'. This is failing for the prosaic reason
that /var/lib/private is owner-only and owned by root. Since this
works on all of our other Ubuntu 18.04 machines, presumably the
actual failure is somewhere else.
The real problem here is that it is impossible to diagnose or debug
this situation. Simply to get this far I had to read the systemd
source code (to find the code in timesyncd that printed this specific
error message) and then search through 25,000 lines of
output. And I still don't know what the problem is or how to fix
it. I'm not even confident that rebooting the server will change
anything, especially when all the relevant pieces on this server
seem to be just the same as the pieces on other, working servers.
(I do know that according to logs this failure started happening immediately after the systemd package was upgraded and re-executed itself. On the other hand, the systemd upgrade also happened on other Ubuntu 18.04 machines, and they didn't have their timesyncds explode.)
Since systemd has no clear diagnostic information here, I spent a great deal of time chasing the red herring that if you look at /var/lib/private/systemd/timesync on such a failing system, it will be owned by a numeric UID and GID, while on working systems it will be the magically special login and group 'systemd-timesync'. This is systemd's 'dynamic user' facility in action, combined with systemd itself creating the /var/lib/private/systemd/timesync directory (with the right login and group) before exec'ing the timesyncd binary. When timesyncd fails to start, systemd removes the login and group but leaves the directory behind, now not owned by any existing login or group.
(You might think that the 'failed to create state directory' error
message would mean that timesyncd was the one actually creating the
state directory, but strace says otherwise; the
exec() does, while the new process that will become
timesyncd is still in systemd's code. timesyncd's code does try to
create the directory, but presumably the internal systemd functions
it's using are fine if the directory is already there with the right
ownership and so on.)
I am rather unhappy about this situation, and I am even unhappier that there is effectively nothing that we can do about any aspect of it except to stop using timesyncd (which is now something that I will be arguing for, especially since this server drifted more than half a second out of synchronization before I found this issue entirely by coincidence). Reporting a bug to either systemd or to Ubuntu is hopeless (systemd will tell me to reproduce on the latest version, Ubuntu will ignore it as always). This is simply what happens when the systemd developers produce a design and an implementation that doesn't explain how it actually works and doesn't contain any real support for field diagnosis. Once again we get to return to the era of 'reboot the server, maybe that will fix it'. Given systemd's general current attitude, I don't expect this to change any time soon. Adding documentation of systemd's internals and diagnosis probes would be admitting that the internals can have bugs, problems, and issues, and that's just not supposed to happen.
PS: The extra stupid thing about the whole situation is that the only thing /var/lib/systemd/timesync is used for is to hold a zero-length file whose timestamp is used to track the last time the clock was synchronized, and non-root users can't even see this file on Ubuntu 18.04.
Update: I've identified the cause of this problem, which is
described in my new entry on how systemd's
is dangerous. The short version is
that systemd silently failed to set up a custom namespace that would
have given timesyncd access to /var/lib/private because it could
not deal with FUSE mounts in NFS mounted user home directories that
were not world-accessible.