Wandering Thoughts


It matters where (or when) your programs ask questions

The other day, I wrote about how we belatedly evolved our account creation script so that it could now just assume we wanted the defaults for most everything, and how this simple change had been a real quality of life improvement for us. This improvement isn't just because we interact with the script a lot less; it's also because we changed where we interact with it. Specifically, we now only interact with the script right at the start and all the way at the end; before, we had to periodically interact with the script all the way through its run.

The problem with periodic interactions is that they have the end result of slowing down the whole process a bunch, and often they make it feel draining and demanding. What happens in practice is that you start the process, have it run, get bored with waiting for it to ask you a question, look away to do something else, don't notice immediately that the process has paused with a question, go back to it, answer the question, get bored again, and repeat until the whole thing is over. If and when you wind up constantly looking over to check on the process or focusing on it while you wait for it to ask you something, it feels draining and demanding. You're not doing anything, but you have to pay attention and wait.

When you shift all of the questions and interaction to the start and the end, you wipe most of this away. You start the process, it immediately asks you a question or two, and then you can go away. When it finishes, you may have a final question or two to answer, but at that point it's actually done. You don't have to constantly pay it some amount of attention in order to keep it moving along; it becomes a fire and mostly forget thing. Maybe you look over every so often to see if it's finished yet, but you know that you're not really delaying it by not paying enough attention.

As a result of our experiences with this script (and similar ones that need to ask us questions or have us do things by hand), I've come to be strongly biased about where I want to put any interactions in my scripts. If I have to ask questions, I'm going to do my best to put them as early as possible. If I can't ask them right at the start, I'm at least going to ask them all at once, so there's only one pause and interruption, and once that's over I know I can basically ignore the script for a while.

(Our local scripts are not perfect here, and perhaps we should change one that asks its questions early but not right away. But that script does at least ask all its questions all at once.)

PS: You might wonder how you wind up with a bunch of questions scattered through your script. Through good intentions, basically. If you have a bunch of different operations to do and you have a tacit custom that you want to manually confirm operations, you can easily wind up with a pattern where people adding operations add a 'okay, should I do this/tell me what option to take' question right before they start the operation itself. Then you wind up with a stop-start script that keeps pausing to ask you questions.

sysadmin/QuestionsWhereMatter written at 00:54:55; Add Comment


Go's net package doesn't have opaque errors, just undocumented ones

I tweeted:

I continue to be irritated by how opaque important Go errors are. I should not have to do string comparisons to discover that my network connection failed due to 'host is unreachable'.

The standard library net package has a general error type that's returned from most network operations. If you read through the package documentation straightforwardly, as I did in this tweet, you will likely conclude that the only reasonable way to see if your net.Dial() call to something has failed because your Unix is reporting 'no route to host' is to perform a string match against the string value of the error you get back.

(You want to do that string match against net.OpError.Err, since that's what gets you the constant error string without varying bits like the remote host and port you're trying to connect to.)

As I discovered when I started digging into things in the process of writing a different version of this entry, things are somewhat more structured under the hood. In fact the error that you get back from net.Dial() is likely to be all officially exported types and you can do a more precise check than string comparisons (at least on Unix), but you have to reach through several layers to see what is going on. It goes like this:

  • net.Dial() is probably returning a *net.OpError, which wraps another error that is stored in its .Err field.

  • if you have a connection failure (or some other specific OS level error), the *net.OpError.Err value is probably an *os.SyscallError. This is itself a wrapper around an underlying error, in .Err (and the syscall that failed is in .Syscall; you could verify that it's "connect").

  • this underlying error is probably a *syscall.Errno, which can be compared against the various E* errno constants that are also defined in syscall. Here, I'd want to check for EHOSTUNREACH.

So we have a *syscall.Errno inside an *os.SyscallError inside a *net.OpError. This wrapping sequence is not documented and thus not covered by any compatibility guarantees (neither is the string comparison, of course). Since all of these .Err fields are declared as type error instead of concrete types, unwrapping the whole nesting requires a bunch of checked type casts.

If I was doing this regularly, I would probably bother to write a function to check 'is this errno <X>', or perhaps a list of errnos. As a one-off check, I don't feel particularly guilty about doing the string check even now that I know it's possible to get the specific details if you dig hard enough. Pragmatically it works just as well, it's probably just as reliable, and it's easier.

(You still need to do a checked type cast to *net.OpError, but that's as far as you need to go. If you don't even want to bother with that, you could just string-ify the whole error and then use strings.HasSuffix(). For my purposes I wanted to check some other parts of the *net.OpError, so I needed the type cast anyway.)

In my view, the general shape of this sequence of wrapped errors should be explicitly documented. Like it or not, the relative specifics of network errors are something that people care about in the real world, so they are going to go digging for this information one way or another, and I at least assume that Go would prefer we unwrap things to check explicitly rather than just string-ifying errors and matching strings. If there are cautions about future compatibility or present variations in behavior, document them explicitly so that people writing Go programs know what to look out for.

(Like it or not, the actual behavior of things creates a de facto standard, especially if you don't warn people away. Without better information, people will code to what the dominant implementation actually does, with various consequences if this ever changes.)

programming/GoNetErrorsUndocumented written at 23:46:08; Add Comment

Our problem with HTTPS and user-created content

We have a departmental web server, where people can host their personal pages (eg) and pages for their research groups and so on, including user-run web servers behind reverse proxies. In other words, this web server has a lot of content, created by a lot of people, and essentially none of it is under our control. These days, in one sense this presents us with a bit of a problem.

Our departmental web server supports HTTPS (and has for years). Recent browser developments are clearly pushing websites from HTTP to HTTPS, even if perhaps not as much as has been heralded, and so it would be good if we were to actively switch over. But, well, there's an obvious problem for us, and the name of that problem is mixed content. A not insignificant number of pages on our web server refer to resources like CSS stylesheets using explicit HTTP URLs (either local ones or external ones), and so would and do break if loaded over HTTPS, where browsers generally block mixed content.

We are obviously not going to break user web pages just because the Internet would now kind of like to see us using HTTPS instead of HTTP; if we even proposed doing that, the users would get very angry at us. Nor is it feasible to get users to audit and change all of their pages to eliminate mixed content problems (and from the perspectives of many users, it would be make-work). The somewhat unfortunate conclusion is that we will never be able to do a general HTTP to HTTPS upgrade on our departmental web server, including things like setting HSTS. Some of the web server's content will always be in the long tail of content that will never migrate to HTTPS and will continue to be HTTP content for years to come.

(Yes, CSP has upgrade-insecure-requests, but that only helps for local resources, not external ones.)

Probably this issue is confronting anyone with significant amounts of user-created content, especially in situations where people wrote raw HTML, CSS, and so on. I suspect that a lot of these sites will stay HTTPS-optional for plenty of time to come.

(Our users can use a .htaccess to force HTTP to HTTPS redirection for their own content, although I don't expect very many people to ever do that. I have set this up for my pages, partly just to make sure that it worked properly, but I'm not exactly a typical person here.)

(This elaborates on an old tweet of mine, and I covered the 'visual noise' bit in this entry.)

web/HTTPSUserContentProblem written at 00:15:40; Add Comment


The evolution of our account creation script

One of the things about system administration automation is that its evolution often follows the path of least resistance. This can leave you with interesting and peculiar historical remnants, and it can also create situations where it takes a relatively long time before a system does the obvious thing. As it happens, I have a story about this.

To go with our account request system, which handles people requesting new accounts and authorizing requested accounts, we have an actual script that we run to actually create Unix accounts. Until relatively recently that script asked you a bunch of questions, although they all had default answers that we'd accept essentially all of the time. The presence of these questions was both a historical remnant of the path that the script took and an illustration of how unquestioningly acclimatized we can all become to what we think of as 'normal'.

We have been running Unix systems and creating accounts on them for a very long time, and in particular we've been doing this since before the World Wide Web existed and was readily accessible. Back in the beginning of things, accounts were requested on printed forms; graduate students and suchlike filled out the form with the information, got their account sponsors to sign it, handed it to the system staff, and the system staff typed all of the information into a script that asked us questions like 'login?', 'name?', 'Unix group?', 'research group affiliation?', and so on.

At a certain point, the web became enough of a thing that having a CGI version of our paper account request form was an obvious thing to do. Not everyone was going to use the CGI form (or be able to), and anyway we already had the account creation script that knew all of the magic required to properly create an account around here, so we adopted the existing script to also work with the CGI. The CGI wrote out the submitted information into a file (basically as setting shell environment variables) and this file was then loaded into the account creation script as the default answers to many of the questions that had originally been fields on the printed form. If the submitted information was good, you could just hit Return through many of the questions. After you created the account, you then had to email some important information about it (especially the temporary password) off to the person it was for; you did this by hand, because you generated the random password by hand outside of the script.

(For reasons lost to history, the data file that the CGI wrote and the script loaded was a m4 file that was then processed through m4 to create shell variable assignments.)

When we wrote our account request system to replace the basic CGI (and the workflow around it, which involved manually emailing account sponsors to ask them about approving accounts), the simple and easy way for it to actually get accounts created was to carefully write the same data file that the CGI had used (m4isms and all). The account request script remained basically unchanged, and in particular it kept asking us to confirm all of the 'default' answers, ie all of the information that the account request system had already validated and generated. More than that, we added a few more bits of special handling for some accounts, with their own questions.

(Although the account request system was created in 2011, it took until a 2016 major revision for a new version of Django for us to switch from generating m4 data files to just directly generating shell variable assignments that the script directly sourced with the . command.)

That we had to actually answer these questions and then write the 'you have a new account' email made the whole process of creating an account a tedious thing. You couldn't just start the script and go away for a while; you had to periodically interact with it, hitting Return, generating a password in another window and pasting it in to the password prompt, and composing email yourself. None of these things were actually necessary for the backend of the account request system, but they stayed for historical reasons (and because we needed them occasionally, because some accounts are created outside of the account request system). And we, the people who used the script, were so acclimatized to this situation that we didn't really think about it; in fact I built my own automation around writing the 'you have a new account' form email.

At this point I've forgotten what the exact trigger event was, but last year around this time, in the middle of creating a bunch of new graduate student accounts (where the existing script's behavior was at its most tedious), we realized that this could be fixed. I'll quote my commit messages:

New 'fast create' mode for account creation that takes all the defaults and doesn't bother asking if we're really sure.

For fast mode, add the ability to randomly generate or set the initial password at the start of the process.

offer to send new-account greeting email.
make sending greeting email be the default (if you just hit return).

(In theory we could make sending the greeting email happen automatically. In practice, asking a final question gives us an opportunity to look back at all the messages printed out just in case there's some problem that the script didn't catch and we want to pause to fix things up.)

This simple change has been a real quality of life improvement for us, turning a tedious slog into a mostly fire and forget exercise that we can casually run through. That it took so long to make our account creation script behave this way is an illustration not just of the power of historical paths but also of the power of habituation. We were so used to how the existing system worked that we never really questioned if it had to be that way; we just grumbled and accepted it.

(This is, in a sense, part of the power of historical paths. The path that something took to get where it is shapes what we see as 'normal' and 'just how things are', because it's what we get used to.)

Sidebar: There were some additional steps in there

There are a few questions in the account creation script where in theory we have a genuine choice to make; for example, some accounts have several options for what filesystem they get created in. Part of what made the no-questions version of the script possible was that we realized that in practice we always made a particular choice (for filesystems, we always picked the one with the most free space), so we revised the script to make this choice the default answer.

Had we not worked out default answers for all of these questions, we couldn't have made the creation script not even ask the questions. We might have done both at the same time if it was necessary, but in practice it certainly helped that everything already had default answers so the 'fast create' mode could just be 'take all of the default answers without requiring confirmation'.

sysadmin/AccountCreationScriptEvolution written at 22:58:21; Add Comment

A recent spate of ZIP attachments with everything

Our program for logging email attachment type information looks inside .zip and .jar archives, including one level of nesting. Often what we see in this is routine, with basically the sort of content you'd expect from either routine stuff or malware, but recently we've been seeing zip archives that are just stuffed with at least one of almost any file extension you can think of. A few days ago we logged an extreme example:

1fnnAC-0003dZ-EP attachment application/zip; MIME file ext: .zip; zip exts: .jar; inner zip exts: .abc .abl .acc .ach .adc .adz[2] .afd .age .ago .agy .aht .ake .ala .alp .and .ans .aob[2] .aor .app .apt .ara .ary .aud .aus .ave .axe .baa .bag .bap .bat .bde .bet .bin .bis .bkg .boe .bra .bsh .buz .bye .cai .cal .cat .caw .cdg .chm .cit .class[10] .cli[2] .clo .col .cop .cpl .crc .crs .cst .ctg .cto .cup .cwt .dad .dbl .dcb .der .det[2] .dew .dey .dig .dil[3] .dks[2] .dur .dwt .dye .eft .ego .elb[2] .elm .els[2] .emf .emm[2] .emu .err .esd .esq .ext .eyn .fax .fbi[2] .fcs .fee .fei .fem .ffa .fgn .fig .flb .fly[3] .foe .fog .fud .gab .gae .gal .gas .geb .gig .gin .gio[2] .goa .gob .god .gon .goo .gox .gtc .gun .had[2] .hah .hak[2] .hao .hat .hau .hcb .hcl[2] .hed .heh .hen[2] .hes[3] .hia .hip .hir .hld .hoc .hoe .hts .hug .hye .ibo .ide .ihp[2] .ijo .ilk .imu .ing[2] .ipr[2] .iqs .ire .iwa .iyo[2] .jah .jap .jay .jct .jem[2] .jud .jur .kat .kaw .kay .key .khi .kop .kor .kos .kph .kyl .lab[3] .lap .lcm .lea .lek .les .lib .lid .lit .llb .lou .lub .lxx .mao .map .maw .meu .mf .mix .mks .mog .mor .mot .mph .mus .nee .nef .nei .nep .nut .oak[2] .obb .ofo .oki .one .oni .ops .ora .our .pan .pap .par .paw .pax .pay .pdq .peh .pep .pia .pie .pig .pit .pks .poh .pos .pot .ppa .pps .pre .pry[2] .psi .pwr .pyr .rab .ram .rat .raw .rct .ref .reg .res .rfs .rig .rim .rix .rld .roc .roi .rpm .rut .rux .rwd .rwy .rye .sab .sau .sds .sed[2] .sei .sel .sew .she .shr .sie .sil .sim .sip .six .sny .soe .sou .soy .sqq .stg .sum .sur[2] .syd .tar .tat .tay .ted .tef .tem .tng .ton .tou .twa .udo .uns .urb .urn .uti .vac[2] .vil .von .vum .wab .wae .wea .wop[2] .wot .wro[2] .wud .xii[2] .xiv .xxi .xxv .xxx .yam[2] .yay .yea .yeo .yer .yez .yoe .yrs .yun .zat .zen .zho .zig .zip .zod

(We deliberately log file extensions inside zip archives in alphabetical order, so it may well have had a much different order originally.)

This particular message was detected by Sophos PureMessage as 'Mal/DrodZp-A', which may be a relatively generic name. The Subject: of the message was the relatively generic 'Re: Invoice/Receipt', and I don't know what the overall MIME filename of the .zip was claimed to be. We've received a bunch of very similar attachments that were just .jars (not .zip in .jar) with giant lists of extensions. Many of them have been rejected for containing (nominal) bad file types, and their MIME filenames have been things like 'ORIGIAL SHIPPING DOCUMENTS.qrypted.jar' and "0042133704 _ PDF.jar".

(It's possible that these direct .jars would also be detected as Mal/DrodZp-A, but we reject for bad file types before we check for known viruses.)

I doubt that the attachment had genuine examples of these file types, especially things like .rpm (RPM packages) and .nef (Nikon camera RAWs, which are invariably anywhere from several megabytes to tens of megabytes for the latest high-resolution Nikon DSLRs). I'm sure that the malware has some reason for doing this spray of files and file extensions, but I have no idea what it might be. If there are some anti-virus products that give up if a .jar has enough different file extensions in it, that's kind of sad (among other things).

Sadly for any additional filtering we might considering doing, I suspect that the dangerous parts of this were in the actual Java stuff (eg the .class files) and everything else is distraction. It'd be somewhat interesting to pick through a captured sample, because I am curious about what's in all of those files (or if they're just zero-length ones put in to pad things out) and also what file names they have. Did the malware make up some jumble of random file names, or is it embedded a message in them or something clever? I'll never know, because it's not important enough to bother doing anything special for.

spam/ZipAttachmentWithEverything written at 00:26:29; Add Comment


Fetching really new Fedora packages with Bodhi

Normal Fedora updates that have been fully released are available through the regular updates repository, which is (or should be) already configured into dnf on your Fedora system. More recent (and less well tested) updates are available through the updates-testing repository, which you can selectively enable in order to see if what you're looking for is there. Right now I'm interested in Rust 1.28, because it's now required to build the latest Firefox from source, so:

# dnf --enablerepo=updates-testing check-update 'rust*'
Last metadata expiration check: 0:00:56 ago on Fri 10 Aug 2018 02:12:32 PM EDT.

However sometimes, as in this case and past ones, any update that actually exists is too new to even have made it into the updates-testing DNF repo. Fedora does their packaging stuff through Fedora Bodhi (see also), and as part of this packages can be built and available in Bodhi even before they're pushed to updates-testing, so if you want the very freshest bits you want to check in Bodhi.

There are two ways to check Bodhi; through the command line using the bodhi client (which comes from the bodhi-client package), or through the website. Perhaps I should use the client all the time, but I tend to reach for the website as my first check. The URL for a specific package on the website is of the form:

https://bodhi.fedoraproject.org/updates/?packages=<source package>

For example, https://bodhi.fedoraproject.org/updates/?packages=rust is the URL for Rust (and there's a RSS feed if you care a lot about a particular package). For casual use, it's probably easier to just search from Bodhi's main page.

Through the command line, checking for and downloading an update looks like this:

; bodhi updates query --packages rust --releases f28 --status pending
============================= [...]
============================= [...]
   Update ID: FEDORA-2018-42024244f2
       Notes: New versions of Rust and related tools -- see the release notes
            : for [1.28](https://blog.rust-lang.org/2018/08/02/Rust-1.28.html).
   Submitter: jistone
   Submitted: 2018-08-10 14:35:56

We insist on the pending status because that cuts the listing down and normally gives us only one package, where we get to see detailed information about it; I believe that there's normally only one package in pending status for a particular Fedora release. If there's multiple ones, you get a less helpful summary listing that will give you only the full package name instead of the update ID. If you can't get the update ID through bodhi, you can always get it through the website by clicking on the link to the specific package version on the package's page.

To fetch all of the binary RPMs for an update:

; cd /tmp/scratch
; bodhi updates download --updateid FEDORA-2018-42024244f2


; cd /tmp/scratch
; bodhi updates download --builds rust-1.28.0-2.fc28

Both versions of the bodhi command download things to the current directory, which is why I change to a scratch directory first. Then you can do 'dnf update /tmp/scratch/*.rpm'. If the resulting packages work and you feel like it, you can leave feedback on the Bodhi page for the package, which may help get it released into the updates-testing repo and then eventually the updates repo.

(In theory you can leave feedback through the bodhi command too, but it requires more setup and I think has somewhat less options than the website.)

As far as I've seen, installing RPMs this way will cause things to remember that you installed them by hand, even when they later become available through the updates-testing or the updates repo. This is probably not important to you.

(I decided I wanted an actual entry on this process that I can find easily later, instead of having to hunt around for my postscript in this entry the next time I need it.)

PS: For my future use, here is the Bodhi link for the kernel, which is probably the package I'm most likely to want to fish out of Bodhi regularly. And just in case, openssl and OpenSSH.

linux/FedoraBodhiGetPackages written at 14:58:56; Add Comment

The benefits of driving automation through cron

In light of our problem with timesyncd, we needed a different (and working) solution for time synchronization on our Ubuntu 18.04 machines. The obvious solution would have been to switch over to chrony; Ubuntu even has chrony set up so that if you run it, timesyncd is automatically blocked. I like chrony so I was tempted by this idea briefly, but then I realized that using chrony would mean having yet another daemon that we have to care about. Instead, our replacement for timesyncd is running ntpdate from cron.

There are a number of quiet virtues of driving automation out of cron entries. The whole approach is simple and brute force, but this creates a great deal of reliability. Cron basically never dies and if it were ever to die it's so central to how our systems operate that we'd probably notice fairly fast. If we're ever in any doubt, cron logs when it runs things to syslog (and thus to our central syslog server), and if jobs fail or produce output, cron has a very reliable and well tested system for reporting that to us. A simple cron entry that runs ntpdate has no ongoing state that can get messed up, so if cron is running at all, the ntpdate is running at its scheduled interval and so our clocks will stay synchronized. If something goes wrong on one run, it doesn't really matter because cron will run it again later. Network down temporarily? DNS resolution broken? NTP servers unhappy? Cure the issue and we'll automatically get time synchronization back.

A cron job is simple blunt force; it repeats its activities over and over and over again, throwing itself at the system until it batters its way through and things work. Unless you program it otherwise, it's stateless and so indifferent to what happened the last time around. There's a lot to be said for this in many system tasks, including synchronizing the clock.

(Of course this can be a drawback if you have a cron job that's failing and generating email every failure, when you'd like just one email on the first failure. Life is not perfect.)

There's always a temptation in system administration to make things complicated, to run daemons and build services and so on. But sometimes the straightforward brute force way is the best answer. We could run a NTP daemon on our Ubuntu machines, and on a few of them we probably will (such as our new fileservers), but for everything else, a cron job is the right approach. Probably it's the right approach for some of our other problems, too.

(If timesyncd worked completely reliably on Ubuntu 18.04, we would likely stick with it simply because it's less work to use the system's default setup. But since it doesn't, we need to do something.)

PS: Although we don't actively monitor cron right now, there are ways to notice if it dies. Possibly we should add some explicit monitoring for cron on all of our machines, given how central it is to things like our password propagation system. Sure, we'd notice sooner or later anyway, but noticing sooner is good.

sysadmin/CronAutomationBenefits written at 13:37:44; Add Comment

One simple general pattern for making sure things are alive

One perpetual problem in system monitoring is detecting when something goes away. Detecting the presence of something is often easy because it reports itself, but detecting absence is usually harder. For example, it generally doesn't work well to have some software system email you when it completes its once a day task, because the odds are only so-so that you'll actually notice on the day when the expected email isn't there in your mailbox.

One general pattern for dealing with this is what I'll call a staleness timer. In a staleness timer you have a timer that effectively slowly counts down; when the timer reaches 0, you get an alert. When systems report in that they're alive, this report resets their timer to its full value. You can implement this as a direct timer, or you can write a check that is 'if system last reported in more than X time ago, raise an alert' (and have this check run every so often).

(More generally, if you have an overall metrics system you can presumably write an alert for 'last metric from source <X> is more than <Y> old'.)

In a way this general pattern works because you've flipped the problem around. Instead of the default state being silence and exceptional things having to happen to generate an alert, the default state is an alert and exceptional things have to happen to temporarily suppress the alert.

There are all sorts of ways of making programs and systems report in, depending on what you have available and what you want to check. Traditional low rent approaches are touching files and sending email to special dedicated email aliases (which may write incoming email to a file, or simply run a program on incoming email that touches a relevant file). These can have the drawback that they depend on multiple different systems all working, but they often have the advantage that you have them working already (and sometimes it's a feature to verify all of the systems at once).

(If you have a real monitoring system, it hopefully already provides a full selection of ways to submit 'I am still alive' notifications to it. There probably is a very simple system that just does this based on netcat-level TCP messages or the like, too; it seems like the kind of thing sysadmins write every so often. Or perhaps we are just unusual in never having put together a modern, flexible, and readily customizable monitoring system.)

All of this is a reasonably obvious and well known thing around the general community, but for my own reasons I want to write it down explicitly.

sysadmin/SimpleAliveCheckPattern written at 00:42:26; Add Comment


Systemd's DynamicUser feature is (currently) dangerous

Yesterday I described how timesynd couldn't be restarted on one of our Ubuntu 18.04 machines, where the specific thing that caused the failure was timesyncd attempting to access /var/lib/private/systemd/timesync and failing because /var/lib/private is only accessible by root, not the UID that timesyncd was running as. My diagnostic efforts left me puzzled as to how this was supposed to work at all, but Trent Lloyd (@lathiat) pointed me to the answer, which is in Lennart Poettering's article Dynamic Users with systemd, which introduces the overall system, explains the role of /var/lib/private, and covers how timesyncd is supposed to get access through an inaccessible directory. I'll quote the explanation for that:

[Access through /var/lib/private] is achieved by invoking the service process in a slightly modified mount name-space: it will see most of the file hierarchy the same way as everything else on the system ([...]), except for /var/lib/private, which is over-mounted with a read-only tmpfs file system instance, with a slightly more liberal access mode permitting the service read access. [...]

Since timesyncd is not able to get access through /var/lib/private, you might guess that something has gone wrong in the process of setting up this slightly modified mount namespace. Indeed this turned out to be the case. The machine that this happened on is an NFS client and (as is usual) its UID 0 is mapped to an unprivileged UID on our fileservers. On this machine there were some FUSE mounts in the home directories of users who have their $HOME not world readable (our default $HOME permissions are owner-only, to avoid accidents). When systemd was setting up the 'slightly modified mount name-space' it attempted to access these FUSE mounts as part of binding them into the namespace, but it failed because UID 0 had no permissions to look inside user home directories.

This failure caused systemd to give up attempting to set up the namespace. However, systemd did not abort unit activation or even log an error message. Instead it continued on to try to start timesyncd without this special namespace, despite the fact that timesyncd uses both DynamicUser and StateDirectory and so starting it normally was essentially absolutely guaranteed to fail.

(Although my initial case was dangling FUSE mounts, it soon developed that any FUSE mounts would do it, for example a sshfs or smbfs mount in a user's NFS mounted home directory when the home directory isn't world-accessible.)

Systemd's failure to handle errors in setting up the namespace here has been raised as systemd issue 9835. However, merely logging an error or aborting the unit activation would not actually fix the core problem; it would merely let you see exactly why your timesyncd or whatever service is failing to start. The core problem is that systemd's current design for DynamicUser intrinsically blows up if systemd and UID 0 don't have full access to every mount that's visible on the system.

(Well, DynamicUser plus StateDirectory, but the idea seems to be that pretty much every service using dynamic users will have a systemd managed state directory.)

In my opinion, this makes using DynamicUser surprisingly dangerous. A systemd service that is set to use it can't be reliably started or restarted on all systems; it only works on some systems, some of the time (but those happen to be the common case). If there's ever a problem setting up the special namespace that each such service requires, things fail. Machines that are NFS clients are the obvious case, since the client's UID 0 often has limited privileges, but I believe that there are likely to be others.

(And of course services can be restarted for random and somewhat unpredictable reasons, such as package updates or other services being restarted. You should not assume that you can always control these circumstances, or completely predict the state of the system when they happen.)

linux/SystemdDynamicUserDangerous written at 21:51:36; Add Comment

A timesyncd total failure and systemd's complete lack of debugability

Last November, I wrote an entry about how we were switching to using systemd's timesyncd on our Ubuntu machines. Ubuntu 18.04 defaults to using timesyncd just as 16.04 does, and when we set up our standard Ubuntu 18.04 environment we stuck with that default behavior (although we customize the list of NTP servers). Then today I discovered that timesyncd had silently died on one of our 18.04 servers back on July 20th, and worse it couldn't be restarted.

Specifically, it reported:

systemd-timesyncd[10940]: Failed to create state directory: Permission denied

The state directory it's complaining about is /var/lib/systemd/timesync, which is actually a symlink to /var/lib/private/systemd/timesync (at least on systems that are in good order; if the symlink has had something happen to it, you can apparently get other errors from timesyncd). I had a clever informed theory about what was wrong with things, but it turns out strace says I'm wrong.

(To my surprise, doing 'strace -f -p 1' on this system did not produce either explosions or an impossibly large amount of output. This would have been a very different thing on a system that was actually in use; this is basically an almost idle server being used as part of our testing of 18.04 before we upgrade our production servers to it.)

According to strace, what is failing is timesyncd's attempts to access /var/lib/private/systemd/timesync as its special UID (and GID) 'systemd-timesync'. This is failing for the prosaic reason that /var/lib/private is owner-only and owned by root. Since this works on all of our other Ubuntu 18.04 machines, presumably the actual failure is somewhere else.

The real problem here is that it is impossible to diagnose or debug this situation. Simply to get this far I had to read the systemd source code (to find the code in timesyncd that printed this specific error message) and then search through 25,000 lines of strace output. And I still don't know what the problem is or how to fix it. I'm not even confident that rebooting the server will change anything, especially when all the relevant pieces on this server seem to be just the same as the pieces on other, working servers.

(I do know that according to logs this failure started happening immediately after the systemd package was upgraded and re-executed itself. On the other hand, the systemd upgrade also happened on other Ubuntu 18.04 machines, and they didn't have their timesyncds explode.)

Since systemd has no clear diagnostic information here, I spent a great deal of time chasing the red herring that if you look at /var/lib/private/systemd/timesync on such a failing system, it will be owned by a numeric UID and GID, while on working systems it will be the magically special login and group 'systemd-timesync'. This is systemd's 'dynamic user' facility in action, combined with systemd itself creating the /var/lib/private/systemd/timesync directory (with the right login and group) before exec'ing the timesyncd binary. When timesyncd fails to start, systemd removes the login and group but leaves the directory behind, now not owned by any existing login or group.

(You might think that the 'failed to create state directory' error message would mean that timesyncd was the one actually creating the state directory, but strace says otherwise; the mkdir() happens before the exec() does, while the new process that will become timesyncd is still in systemd's code. timesyncd's code does try to create the directory, but presumably the internal systemd functions it's using are fine if the directory is already there with the right ownership and so on.)

I am rather unhappy about this situation, and I am even unhappier that there is effectively nothing that we can do about any aspect of it except to stop using timesyncd (which is now something that I will be arguing for, especially since this server drifted more than half a second out of synchronization before I found this issue entirely by coincidence). Reporting a bug to either systemd or to Ubuntu is hopeless (systemd will tell me to reproduce on the latest version, Ubuntu will ignore it as always). This is simply what happens when the systemd developers produce a design and an implementation that doesn't explain how it actually works and doesn't contain any real support for field diagnosis. Once again we get to return to the era of 'reboot the server, maybe that will fix it'. Given systemd's general current attitude, I don't expect this to change any time soon. Adding documentation of systemd's internals and diagnosis probes would be admitting that the internals can have bugs, problems, and issues, and that's just not supposed to happen.

PS: The extra stupid thing about the whole situation is that the only thing /var/lib/systemd/timesync is used for is to hold a zero-length file whose timestamp is used to track the last time the clock was synchronized, and non-root users can't even see this file on Ubuntu 18.04.

Update: I've identified the cause of this problem, which is described in my new entry on how systemd's DynamicUser feature is dangerous. The short version is that systemd silently failed to set up a custom namespace that would have given timesyncd access to /var/lib/private because it could not deal with FUSE mounts in NFS mounted user home directories that were not world-accessible.

linux/SystemdTimesyncdFailure written at 01:52:59; Add Comment

(Previous 10 or go back to August 2018 at 2018/08/06)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.