Wandering Thoughts

2019-11-11

An apparent hazard of removing Linux software RAID mirror devices

One of the data disks in my home machine has been increasingly problematic for, well, a while. I eventually bought a replacement HD, then even more eventually put it the machine along side the current two data disks, partitioned it, and added it as a third mirror to my software RAID partitions. After running everything as a three-way mirror for a while, I decided that problems on the failing disk were affecting system performance enough that I'd take the main software RAID partition on the disk out of service.

I did this as, roughly:

mdadm --manage /dev/md53 --fail /dev/sdd4
mdadm --manage /dev/md53 --remove /dev/sdd4
mdadm --manage /dev/md53 --raid-devices 2

(I didn't save the exact commands, so this is an approximation. The failing drive is sdd.)

The main software RAID device immediately stopped using /dev/sdd4 and everything was happy (and my Prometheus monitoring of disk latency no longer showed drastic latency spikes for sdd). The information in /proc/mdstat said that md53 was fine, with two out of two mirrors.

Then, today, my home machine locked up and rebooted (because it's the first significantly cold day in Toronto and I have a little issue with that). When it came back, I took a precautionary look at /proc/mdstat to see if any of my RAID arrays had decided to resync themselves. To my very large surprise, mdstat reported that md53 had two out of three failed devices and the only intact device was the outdated /dev/sdd4.

(The system then then started the outdated copy of the LVM volume group that sdd4 held, mounted outdated copies of the filesystems in it, and let things start writing to them as if they were the right copy of those filesystems. Fortunately I caught this very soon after boot and could immediately shut the system down to avoid further damage.)

This was not a disk failure; all of my other software RAID arrays on those disks showed three out of three devices, spanning the old sdc and sdd drives and the new sde drive. But rather than assemble the two-device new version of md53 with both mirrors fully available on sdc4 and sde4, the Fedora udev boot and software RAID assembly process had decided to assemble the old three-device version visible only on sdd4 with one out of three mirrors. Nor is this my old case of not updating my initramfs to have the correct number of RAID devices, because I never updated either the real /etc/mdadm.conf or the version in the initramfs to claim that any of my RAID arrays had three devices instead of two.

As I said on Twitter, I'm sufficiently used to ZFS's reliable behavior on device removal that I never even imagined that this could happen with software RAID. I can sort see how it did (for a start, I expect that marking a device as failed leaves its RAID superblock untouched), but I don't know why and the logs I have available contain no clues from udev and mdadm about its decision process for which array component to pick.

The next time I do this sort of device removal, I guess I will have to explicitly erase the software RAID superblock on the removed device with 'mdadm --zero-superblock'. I don't like doing this because if I make a mistake in the device name (and it is only a letter or a number away from something live), I've probably just blown things up.

The obvious conclusion is that mdadm should have an explicit way to say 'take this device out of service in this disk array', one that makes sure to update everything so that this can't happen even if the device remains physically present in the system. I don't care whether that involves adding a special mark to the device's RAID superblock or erasing it; I just want it to work. Perhaps what I did should already work in theory; if so, I regret to say that it didn't in practice.

(My short term solution is to physically disconnect sdd, the failing disk drive. This reduces the other three-way mirrors to two-way ones and I don't know what I'll do with the pulled sdd; it's probably not safe to let my home machine see it in any state at any time in the future. But at least this way I have working software RAID arrays.)

Sidebar: Why mdadm's --replace is not a solution for me

I explicitly wanted to run my new drive along side the existing two drives for a while, in case of infant mortality. Thus I wanted to run with three-way mirrors, instead of replacing one disk in a two-way mirror with another one.

SoftwareRaidRemovingDiskGotcha written at 22:28:46; Add Comment

2019-11-07

Some notes on getting email when your systemd timer services fail

Suppose, not hypothetically, that you have some things that are implemented through systemd timers instead of traditional cron.d jobs, and you would like to get email if and when they fail. The lack of this email by default is one of the known issues with turning cron.d entries into systemd timers and people have already come up with ways to do this with systemd tricks, so for full details I will refer you to the Arch Wiki section on this (brought to my attention by keur's comment on my initial entry) and this serverfault question and its answers (via @tvannahl on Twitter). This entry is my additional notes from having set this up for our Certbot systemd timers.

Systemd timers come in two parts; a .timer unit that controls timing and a .service unit that does the work. What we generally really care about is the .service unit failing. To detect this and get email about it, we add an OnFailure= to the timer's .service unit that triggers a specific instance of a template .service that sends email. So if we have certbot.timer and certbot.service, we add a .conf file in /etc/systemd/certbot.service.d that contains, say:

[Unit]
OnFailure=cslab-status-email@%n.service

Due to the use of '%n', this is generic; the stanza will be the same for anything we want to trigger email from on failure. The '%n' will expand to the full name of the service, eg 'certbot.service' and be available in the cslab-status-email@.service template unit. My view is that you should always use %n here even if you're only doing this for one service, because it automatically gets the unit name right for you (and why risk errors when you don't have to). In the cslab-status-email@.service unit, the full name of the unit triggering it will be available as '%i', as shown in the Arch Wiki's example. Here that will be 'certbot.service'.

(With probably excessive cleverness you could encode the local address to email to into what the template service will get as %i by triggering, eg, cslab-status-email@root-%n.service. We just hard code 'root' all through.)

The Arch Wiki's example script uses 'systemctl status --full <unit>'. Unfortunately this falls into the trap that by default systemd truncates the log output at the most recent ten lines. We found that we definitely wanted more; our script currently uses 'systemctl status --full -n 50 <unit>' (and also contains a warning postscript that it may be incomplete and to see journalctl on the system for full details). Having a large value here is harmless as far as I can tell, because systemd seems to only show the log output from the most recent activation attempt even if there's (much) less than your 50 lines or whatever.

(Unfortunately as far as I can see there is no easy way to get just the log output without the framing 'systemctl status' information about the unit, much of which is not particularly useful. We live with this.)

As with the Arch Wiki's example script, you definitely want to put the hostname into the email message if you have a fleet. We also embed more information into the Subject and From, and add a MIME-Version:

From: $HOSTNAME root <root@...>
Subject: $1 systemd unit failed on $HOSTNAME
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=UTF-8

You definitely want to label the email as UTF-8, as 'systemctl status' puts a UTF-8 '‚óŹ' in its output. The subject could be incorrect (we can't be sure the template unit was triggered through an 'OnFailure=', even that's how it's supposed to be used), but it's much more useful in the case where everything is working as intended. My bias is towards putting as much context into emails like this, because by the time we get one we'll have forgotten all about the issue and we don't want to be wondering why we got this weird email.

The Arch Wiki contains a nice little warning about how systemd may wind up killing child processes that the mail submission program creates (as noticed by @lathiat on Twitter). I decided that the easiest way for our script to ward off this was to just sleep for 10 or 15 seconds at the end. Having it exit immediately is not exactly critical and this is the easy (if brute force) way to hopefully work around any problems.

Finally, as the Arch Wiki kind of notes, this is not quite the same thing as what cron does. Cron will send you email if your job produces any output, whether or not it fails; this will send you the logged output (if any) if the job fails. If the job succeeds but produces output, that output will go only to the systemd journal and you will get no notification. As far as I know there's no good way to completely duplicate cron's behavior here.

(Also, on failure the journal messages you get will include both actual stuff printed by the service and also, I believe, anything it logged to places like syslog; with cron you only get the former. This is probably a useful feature.)

SystemdTimersMailNotes written at 23:30:42; Add Comment

2019-11-06

Systemd needs official documentation on best practices

Systemd is reasonably well documented on the whole, although there are areas that are less well covered than others (some of them probably deliberately). For example, as far as I know everything you can put in a unit file is covered somewhere in the manpages. However, as was noted in the comments on my entry on how timer units can hide errors, much of this information is split across multiple places (eg, systemd.unit, systemd.service, systemd.exec, systemd.resource-control, and systemd.kill). This split is okay at one level, because the systemd manpages are explicitly reference documentation and the split makes perfect sense there; things that are common to all units are in systemd.unit, things that are common to running programs (wherever from) are in systemd.exec, and so on and so forth. Systemd even gives us an index, in systemd.directives, which is more than some documentation does.

But having reference documentation alone is not enough. Reference documentation tells you what you can do, but it doesn't tell you what you should do (and how you should do it). Systemd is a complex system with many interactions between its various options, and there are many ways to write systemd units that are bad ideas or that hide subtle (or not so subtle) catches and gotchas. We saw one of them yesterday, with using timer units to replace /etc/cron.d jobs. There is nothing in the current systemd documentation that will point out the potential drawbacks of doing this (although there is third party documentation if you stumble over it, cf).

This is why I say that systemd needs official documentation on best practices and how to do things. This would (or should) cover what you should do and not do when creating units, what the subtle issues you might not think about are, common mistakes people make in systemd units, and what sort of things you should think about when considering replacing traditional things like cron.d jobs with systemd specific things like timer units. Not having anything on best practices invites people to do things like the Certbot packagers have done, where on systemd systems errors from automatic Certbot renewal attempts mostly vanish instead of actually being clearly communicated to the administrator.

(You cannot expect people to carefully read all of the way through all of the systemd reference documentation and assemble a perfect picture of how their units will operate and what the implications of that are. That is simply too complex for people to keep full track of, and anyway people don't work that way outside of very rare circumstances.)

SystemdNeedsBestPractices written at 01:04:34; Add Comment

2019-11-04

Systemd timer units have the unfortunate practical effect of hiding errors

We've switched over to using Certbot as our Let's Encrypt. As packaged for Ubuntu in their PPA, this is set up as a modern systemd-based package. In particular, it uses a systemd timer unit to trigger its periodic certificate renewal checks, instead of a cron job (which would be installed as a file in /etc/cron.d). This weekend, the TLS certificates on one of our machines silently failed to renew on schedule (at 30 days before it would expire, so this was not anywhere close to a crisis).

Upon investigation, we discovered a setup issue that had caused Certbot to error out (and then fixed it). However, this is not a new issue; in fact, Certbot has been reporting errors since October 22nd (every time certbot.service was triggered from certbot.timer, which is twice a day). That we hadn't heard about them points out a potentially significant difference between cron jobs and systemd timers, which is that cron jobs email you their errors and output, but systemd timers quietly swallow all errors and output into the systemd journal. This is a significant operational difference in practice, as we just found out.

(Technically it is the systemd service unit associated with the timer unit.)

Had Certbot been using a cron job, we would have gotten email on the morning of October 22nd when Certbot first found problems. But since it was using a systemd timer unit, that error output went to the journal and was effectively invisible to us, lost within a flood of messages that we don't normally look at and cannot possibly routinely monitor. We only found out about the problem when the symptoms of Certbot not running became apparent, ie when a certificate failed to be renewed as expected.

Unfortunately there's no good way to fix this, at least within systemd. The systemd.exec StandardOutput= setting has many options but none of them is 'send email to', and I don't think there's any good way to add mailing the output with a simple drop-in (eg, there is no option for 'send standard output and standard error through a pipe to this other command'). Making certbot.service send us email would require a wholesale replacement of the command it runs, and at that point we might as well disable the entire Certbot systemd timer setup and supply our own cron job.

(We do monitor the status of some systemd units through Prometheus's host agent, so perhaps we should be setting an alert for certbot.service being in a failed state. Possibly among other .service units for important timer units, but then we'd have to hand-curate that list as it evolves in Ubuntu.)

PS: I think that you can arrange to get emailed if certbot.service fails, by using a drop in to add an 'OnFailure=' that starts a unit that sends email when triggered. But I don't think there's a good way to dig the actual error messages from the most recent attempt to start the service out of the journal, so the email would just be 'certbot.service failed on this host, please come look at the logs to see why'. This is an improvement, but it isn't the same as getting emailed the actual output and error messages. And I'm not sure if OnFailure= has side effects that would be undesirable.

SystemdTimersAndErrors written at 23:02:04; Add Comment

2019-10-30

Chrony has been working well for us (on Linux, where we use it)

We have a variety of machines around here that run NTP servers, for various reasons. In the beginning they all ran some version of the classic NTP daemon, NTPD, basically because that was your only option and was what everyone provided. Later, OpenBSD changed over to OpenNTPD and so our OpenBSD machines followed along as they were upgraded. Then various Linuxes started switching their default NTP daemon to chrony, and eventually that spread to our usage (first for me personally and then for our servers). These days, when we need to set up a NTP daemon on one of our Ubuntu machines, we reach for chrony. It's what we use on our Ubuntu fileservers and also on an additional machine that we use to provide time to firewalls that are on one of our isolated management subnets.

At the moment this means we have three different NTP daemon implementations running in our environment. An assortment of OpenBSD machines of various versions run various versions of OpenNTPD, a small number of CentOS 7 machines run NTPD version '4.2.6p5' (plus whatever modifications Red Hat has done), and a number of Ubuntu machines run chrony. This has given us some interesting cross comparisons of how all of these work for us in practice, and the quick summary is that chrony is the least troublesome of the three implementations.

Our experience with the CentOS 7 NTPD is that it takes a surprisingly long time after the daemon is started or restarted (including from a system reboot) for the daemon to declare that it has good time. Chrony seems to synchronize faster, or at least be more willing to declare that it has good time (since what we get to see is mostly what chrony reports through SNTP). Chrony also appears to update the system clock the most frequently out of these three NTP implementations, which turns out to sometimes matter for ntpdate.

(I don't want to draw any conclusions from our OpenNTPD experience, since our primary experience is with versions that are many years out of date by now.)

I do mildly wish that Linux distributions could agree on where to put chrony's configuration file; Ubuntu puts it in /etc/chrony, while Fedora just puts it in /etc. But this only affects me, since all of our servers with chrony are Ubuntu (although we may someday get some CentOS 8 servers, which will presumably follow Fedora here).

(Chrony also has the reassuring property that it will retry failed DNS lookups. Normally this is not an issue for us, but we've had two power failures this year where our internal DNS infrastructure wasn't working afterward until various things got fixed. Hopefully this isn't a concern for most people.)

ChronyWorksWell written at 23:20:17; Add Comment

2019-10-29

Netplan's interface naming and issues with it

Back in March, I wrote an entry about our problem with Netplan and routes on Ubuntu 18.04. In a comment on the entry, Trent Lloyd wrote a long and quite detailed reply that covered how netplan actually works here. If you use Netplan to any deep level, it is well worth reading in whole. My short and perhaps inaccurate summary is that Netplan is mostly a configuration translation layer on top of networkd, and its translation is relatively blind and brute force. This straight translation then puts limits on what alterations and matchings you can do, because of how Netplan will translate these to networkd directives and how they will work (or not work).

One of the things that this creates is a confusing interface naming problem. Suppose that you have a standardly created Netplan YAML file that looks like this:

network:
  version: 2
  renderer: networkd
  ethernets:
    eno1:
      addresses: [...]

The eno1 looks like it is an interface name, but it is actually two things at once; it is both a Netplan section name (this is my name for it; Netplan generally calls it a 'device name') and an network interface name. This section will cause Netplan to create a file /run/systemd/network/10-netplan-eno1.network (where eno1 is being used as a section name) that will start out with:

[Match]
Name=eno1

My original problem with routes doesn't actually require us to attach routes to an interface by name, as I thought when I wrote the entry. Instead it requires us to attach routes to a Netplan section by name, and it is just that Ubuntu creates a Netplan configuration where the two are silently the same.

(This split is also part of my confusion over what was possible with netplan wildcards. Netplan wildcards are for matching interface names, not section names. Because of how Netplan creates networkd configuration files and how networkd works, all things that are going to apply to a given interface must have the same section name, as I understand the situation.)

Trent Lloyd ends his comment (except for a parenthetical) by asking:

[...] Perhaps we should look at changing the default configurations to show a functional 'name' so that this kind of task is more obvious to the average user?

I endorse this. I think that it would make things clearer and simpler if there was a visible split in the default configuration between the section name and the interface name, so that my previous example would be:

network:
  version: 2
  renderer: networkd
  ethernets:
    mainif:
      match:
        name: eno1
      addresses: [...]

This is more verbose for a simple case, but that is the YAML bed that Netplan has decided to lie in.

This would make it possible to write generic Netplan rules that applied to your main interface regardless of what it was called, and provide silent guidance for what I now feel are the best practices for any additional interfaces you might later set up.

(Then it would be good to document the merging rules for sections, such as that you absolutely have to use 'mainif:' (or whatever) for everything that you are going to merge together and there is no wildcard matching on that level. In general the Netplan documentation suffers badly from not actually describing what is actually going on; since what is actually going on strongly affects what you can do and what will and won't work, this is a serious issue.)

Another approach would be to allow defining a Netplan level 'section alias', so your section would still be called 'eno1' but it could have the alias of 'mainif', and then any other Netplan configuration for 'mainif' would be folded in to it when Netplan wrote out the networkd configuration files that actually do things.

PS: Since Netplan has two backends, networkd and NetworkManager, your guess is as good as mine for how this would get translated in a NetworkManager based setup. This uncertainty is one of the problems of making Netplan so tightly coupled to its backend in what I will politely call an underdocumented way.

PPS: None of this changes my general opinion of Netplan, which is that I hope it goes away.

NetplanNamingProblem written at 23:00:05; Add Comment

2019-10-23

The DBus daemon and out of memory conditions (and systemd)

We have a number of systems where for reasons beyond the scope of this entry, we enable strict overcommit. In this mode, when you reach the system's memory limits the Linux kernel will deny memory allocations but usually not trigger the OOM killer to terminate processes. It's up to programs to deal with failed memory allocations as best they can, which doesn't always go very well. In our current setup on the most common machines we operate this way, we've set the vm.admin_reserve_kbytes sysctl to reserve enough space for root so that most or all of our system management scripts keep working and we at least don't get deluged in email from cron about jobs failing. This mostly works.

(The sysctl is documented in vm.txt.)

Recently several of these machines hit an interesting failure mode that required rebooting them, even after the memory usage had finished. The problem is DBus, or more specifically the DBus daemon. The direct manifestation of the problem is that dbus-daemon logs an error message:

dbus-daemon[670]: [system] dbus-daemon transaction failed (OOM), sending error to sender inactive

After this error message is logged, attempts to do certain sorts of systemd-related DBus operations hang until they time out (if the software doing them has a timeout). Logins over SSH take quite a while to give you a shell, for example, as they fail to create sessions:

pam_systemd(sshd:session): Failed to create session: Connection timed out

The most relevant problem for us on these machines is that attempts to query metrics from the Prometheus host agent start hanging, likely because we have it set to pull information from systemd and this is done over DBus. Eventually there are enough hung metric probes so that the host agent starts refusing our attempts immediately.

The DBus daemon is not easy to restart (systemd will normally refuse to let you do it directly, for example), so I haven't found any good way of clearing this state. So far my method of recovering a system in this state is to reboot it, which I generally have to do with 'reboot -f' because a plain 'reboot' hangs (it's probably trying to talk to systemd over DBus).

I believe that part of what creates this issue is that the DBus daemon is not protected by vm.admin_reserve_kbytes. That sysctl specifically reserves space for UID 0 processes, but dbus-daemon doesn't run as UID 0; it runs as its own UID (often messagebus), for good security related reasons. As far as I know, there's no way to protect an arbitrary UID through vm.admin_reserve_kbytes; it specifically applies only to processes that hold a relatively powerful Linux security capability, cap_sys_admin. And unified cgroups (cgroup v2) don't have a true guaranteed memory reservation, just a best effort one (and we're using cgroup v1 anyway, which doesn't have anything here).

We're probably making this DBus issue much more likely to happen by having the Prometheus host agent talk to systemd, since this generates DBus traffic every time our Prometheus setup pulls host metrics from the agent (currently, every 15 seconds). At the same time, the systemd information is useful to find services that are dead when they shouldn't be and other problems.

(It would be an improvement if the Prometheus host agent would handle this sort of DBus timeout during queries, but that would only mean we got host metrics back, not that DBus was healthy again.)

PS: For us, all of this is happening on Ubuntu 18.04 with their version of systemd 237 and dbus 1.12.2. However I suspect that this isn't Ubuntu specific. I also doubt that this is systemd specific; I rather suspect that any DBus service using the system bus is potentially affected, and it's just that the most commonly used ones are from systemd and its related services.

(In fact on our Ubuntu 18.04 servers there doesn't seem to be much on the system bus apart from systemd related things, so if there are DBus problems at all, it's going to be experienced with them.)

DBusAndOOM written at 22:31:07; Add Comment

2019-10-19

Ubuntu LTS is (probably) still the best Linux for us and many people

I write a certain amount of unhappy things about Ubuntu here. This is not because I hate Ubuntu, contrary to what it may appear like; I don't write about things that I hate, because I try to think about them as little as possible (cf spam). Ubuntu is a tool for us, and I actually think it is a good tool, which is part of why we use it and keep using it. So today I'm going to write about the attractions of Ubuntu, specifically Ubuntu LTS, for people who want to get stuff done with their servers and for their users without too much fuss and bother (that would be us).

In no particular order:

  • It has a long support period, which reduces churn and the make work of rebuilding and testing a service that is exactly the same except on top of a new OS and a new version of packages. We routinely upgrade many of our machines every other LTS version (which means reinstalling them), which means we get around four years of life out of a given install (I wrote about this years ago here).

    (We have a whole raft of machines that were installed in the summer and fall of 2016, when 16.04 was fresh, and which will be rebuilt in the summer and fall of 2020 on 20.04.)

  • It has a regular and predictable release schedule, which is good for our planning in various ways. This includes figuring out if we want to hold off on building a new service up right now so that we can wait to base it on the next LTS release.

    (This regularity and predictability is one reason our Linux ZFS fileservers are based on Ubuntu instead of CentOS. 18.04 was there at the time, and CentOS 8 was unknown and uncertain.)

  • It has a large collection of packages (which mostly work, despite my grumbling). Building local copies of software is a pain in the rear and we want to do it as little as possible, ideally not at all.

  • It has relatively current software and refreshes its software on a regular basis (every two years, due to the LTS release cadence), which lets us avoid the problems caused by using zombie Linux distributions. This regular refresh is part of the appeal of the regular and predictable release schedule.

  • Since it's popular, it's well supported by software (often along with Debian). For two examples that are relevant to us, Grafana provides .debs and Certbot is available through a PPA.

  • Debian has made a number of good, sysadmin friendly decisions about how to organize configuration files for applications and Ubuntu has inherited them. For example, they have the right approach to Apache configuration.

I don't know of another Linux distribution that has all of these good things, and that includes both Debian and CentOS (despite what I said about Debian only a year ago). CentOS has very long support but not predictable releases and current software, and even with EPEL's improved state it may not have the package selection. Debian has unpredictable releases and a shorter support period.

(As a purely pragmatic matter we're unlikely to switch to something that is simply about as good as Ubuntu, even if it existed. Since switching or using two Linuxes has real costs, the new thing would have to be clearly better. We do use CentOS for some rare machines because the extremely long support period is useful enough for them.)

UbuntuLTSStillBestChoice written at 22:58:39; Add Comment

2019-10-15

The Ubuntu package roulette

Today I got to re-learn a valuable lesson, which is that just because something is packaged in Ubuntu doesn't mean that it actually works. Oh, it's probably not totally broken, but there's absolutely no guarantee that the package will be fully functional or won't contain problems that cause cron to email you errors at least once a day because of an issue that's been known since 2015.

I know the technical reasons for this, which is that Ubuntu pretty much blindly imports packages from Debian and Debian is an anarchy where partially broken packages can rot quietly. Possibly completely non-functional packages can rot too, I don't actually know how Debian handles that sort of situation. Ubuntu's import is mostly blind because Ubuntu doesn't have the people to do any better. This is also where people point out that the package in question is clearly in Ubuntu's universe repository, which the fine documentation euphemistically describes as 'community maintained'.

(I have my opinions on Ubuntu's community nature or lack thereof, but this is not the right entry for that.)

All of this doesn't matter; it is robot logic. What matters is the experience for people who attempt to use Ubuntu packages. Once you enable universe (and you probably will), Ubuntu's command line package management tools don't particularly make it clear where your packages live (not in the way that Fedora's dnf clearly names the repository that every package you install will come from, for example). It's relatively difficult to even see this after the fact for installed packages. The practical result is that an Ubuntu package is an Ubuntu package, and so most random packages are a spin on the roulette wheel with an uncertain bet. Probably it will pay off, but sometimes you lose.

(And then if you gripe about it, some people may show up to tell you that it's your fault for using something from universe. This is not a great experience for people using Ubuntu, either.)

I'm not particularly angry about this particular case; this is why I set up test machines. I'm used to this sort of thing from Ubuntu. I'm just disappointed, and I'm sad that Ubuntu has created a structure that gives people bad experiences every so often.

(And yes, I blame Ubuntu here, not Debian, for reasons beyond the scope of this entry.)

UbuntuPackageRoulette written at 23:25:14; Add Comment

2019-09-29

Understanding when to use and not use the -F option for flock(1)

A while back I wrote some notes on understanding how to use flock(1), but those notes omitted a potentially important option, partly because that option was added somewhere in between version util-linux version 2.27.1 (which is what Ubuntu 16.04 has) and version 2.31.1 (Ubuntu 18.04). That is the -F option, which is described in the manpage as:

Do not fork before executing command. Upon execution the flock process is replaced by command which continues to hold the lock. [...]

This option is incompatible with -o, as mentioned in the manpage.

The straightforward situation where you very much want to use -F is if you're trying to run a program that reacts specially to Control-C. If you run 'flock program', there will still be a flock process, it will get Control-C and exit, and undesirable things will probably happen. If you use 'flock -F program', there is only the program and it can react properly to Control-C without any side effects on other processes.

(I'm assuming here that if you ran flock and the program from inside a shell script, you ran it with 'exec flock ...'. If you're in a situation where you have to do things in your shell script after the program finishes, you can't solve the Control-C problem just with this.)

However, there is also a situation where you don't want to use -F, and to see it we need to understand how the flock lock is continued to be held by the command. As covered in the first note, flock(1) works through flock(2), which means that the lock is 'held' by having the flock()'d file descriptor still be open. Most programs are indifferent to inheriting extra file descriptors, so this additional descriptor from flock just hangs around, keeping the lock held. However, some programs actively seek out and close file descriptors they may have inherited, often to avoid leaking them into child processes. If you use 'flock -F' with such a program, your lock will be released prematurely (before the program exits) when the program does this.

(The existence of such programs is probably part of why flock -F is not the default behavior.)

Sidebar: Faking 'flock -F' if you don't have it

If you have a shell script that has to run on Ubuntu 16.04 and you need this behavior, you can fake it with 'flock -o'. It goes like this:

exec 9 >>/some/lockfile
flock -x -n 9 || exit 0
exec program ...

Since 'flock -F' locks some file descriptor and then exec's the program, we can imitate it by doing the same manually; we pick a random file descriptor number, get the shell to open a file on that file descriptor and leave it open, flock that file descriptor, and then have the shell exec our program. Our program will inherit the locked fd 9 and the lock remains for as long as fd 9 is open. When the program exits, all of its file descriptors will be closed, including fd 9, and the lock will be released.

FlockUsageNotesII written at 00:59:06; Add Comment

(Previous 10 or go back to September 2019 at 2019/09/27)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.