Wandering Thoughts archives

2019-09-29

Understanding when to use and not use the -F option for flock(1)

A while back I wrote some notes on understanding how to use flock(1), but those notes omitted a potentially important option, partly because that option was added somewhere in between version util-linux version 2.27.1 (which is what Ubuntu 16.04 has) and version 2.31.1 (Ubuntu 18.04). That is the -F option, which is described in the manpage as:

Do not fork before executing command. Upon execution the flock process is replaced by command which continues to hold the lock. [...]

This option is incompatible with -o, as mentioned in the manpage.

The straightforward situation where you very much want to use -F is if you're trying to run a program that reacts specially to Control-C. If you run 'flock program', there will still be a flock process, it will get Control-C and exit, and undesirable things will probably happen. If you use 'flock -F program', there is only the program and it can react properly to Control-C without any side effects on other processes.

(I'm assuming here that if you ran flock and the program from inside a shell script, you ran it with 'exec flock ...'. If you're in a situation where you have to do things in your shell script after the program finishes, you can't solve the Control-C problem just with this.)

However, there is also a situation where you don't want to use -F, and to see it we need to understand how the flock lock is continued to be held by the command. As covered in the first note, flock(1) works through flock(2), which means that the lock is 'held' by having the flock()'d file descriptor still be open. Most programs are indifferent to inheriting extra file descriptors, so this additional descriptor from flock just hangs around, keeping the lock held. However, some programs actively seek out and close file descriptors they may have inherited, often to avoid leaking them into child processes. If you use 'flock -F' with such a program, your lock will be released prematurely (before the program exits) when the program does this.

(The existence of such programs is probably part of why flock -F is not the default behavior.)

Sidebar: Faking 'flock -F' if you don't have it

If you have a shell script that has to run on Ubuntu 16.04 and you need this behavior, you can fake it with 'flock -o'. It goes like this:

exec 9 >>/some/lockfile
flock -x -n 9 || exit 0
exec program ...

Since 'flock -F' locks some file descriptor and then exec's the program, we can imitate it by doing the same manually; we pick a random file descriptor number, get the shell to open a file on that file descriptor and leave it open, flock that file descriptor, and then have the shell exec our program. Our program will inherit the locked fd 9 and the lock remains for as long as fd 9 is open. When the program exits, all of its file descriptors will be closed, including fd 9, and the lock will be released.

FlockUsageNotesII written at 00:59:06; Add Comment

2019-09-27

Some field notes on imposing memory resource limits on users on Ubuntu 18.04

As I mentioned in my entry on how we implement per-user CPU and memory limits, we have a number of shared general use servers where we've decided we need to impose limits on everyone all of the time so no one person can blow up the machine. Over the course of doing this, we've built up some practical experience and discovered a surprise or two.

As discussed, we impose our memory limits by setting systemd's MemoryLimit. In theory perhaps we should use MemoryMax, but for two issues. First, our scripts still need to work on Ubuntu 16.04, where MemoryMax isn't supported. Second, it's not clear if systemd makes this work if you're not using unified cgroups (cgroup v2), and the Ubuntu 18.04 default is to use the original v1 cgroups instead of the new cgroups. Since my impression is that there are still assorted issues with v2 cgroups, we're not inclined to switch away from the Ubuntu default here.

As documented, systemd's MemoryLimit sets the memory.limit_in_bytes cgroup attribute, which is sort of documented in the kernel's memory.txt. The important thing to know, which is only implicitly discussed in memory.txt, is that this only limits the amount of RAM that you can use, not the amount of swap space. In the Ubuntu 18.04 configuration of cgroup v1, there is simply no way to limit swap space usage, and on top of that systemd doesn't expose the property that you'd need.

Our experience is that this doesn't seem to matter for processes that use a lot of memory very rapidly; they run into their user's MemoryLimit almost immediately without causing swap thrashing and get killed by the cgroups OOM killer. However, processes that slowly grow in memory usage over time will wind up pushing things out to swap, themselves included, and as a result their actual memory usage can significantly exceed your MemoryLimit setting if you have enough swap. So far, we haven't experienced swap thrashing as a result of this, but I suspect that it's possible. Obviously, how much swap space you have strongly affects how much total memory a user can use before the cgroups OOM killer triggers. All of this can make your memory limit much more generous than you expect.

(We normally don't configure much swap on our servers, but a few have several gigabytes of it for various reasons. And even with only one GB of swap, that might be close to a GB more of 'memory' usage than you may have expected.)

PS: I was going to say that fast-growing processes don't seem to swap much, but our Prometheus system stats suggest that that's wrong and we do see significant and rapid swap usage. Since much of our swap is on SSDs these days, I suppose that I shouldn't be too impressed with how fast our systems can write it out; a GB or three over a minute is not all that fast in today's world, and SSDs are very good at random IO.

Sidebar: What I expect us to set with systemd v2 cgroups

If Ubuntu switches to v2 cgroups by default, I currently think we'd set a per-user MemorySwapMax that was at most a GB or half our swap space, whichever was smaller, make our current MemoryLimit be MemoryMax, and set MemoryHigh to a value a GB or so lower than MemoryMax. The thing I'm least certain about is what we'd want to set the swap limit to.

SystemdUserMemoryLimits written at 22:34:45; Add Comment

2019-09-25

Our workaround for Ubuntu 16.04 and 18.04 failing to reliably reboot some of our servers

A few years ago I wrote about how and why systemd on Ubuntu 16.04 couldn't reliably reboot some of our servers. At the time I finished off the entry by suggesting that we'd live with the intermittent failures that caused some of our systems to hang during reboot attempts, forcing us to go power cycle them. Shortly afterward, we changed our minds and decided to work around the situation by resorting to a bigger hammer. These days we use our bigger hammer on both Ubuntu 16.04 and Ubuntu 18.04; the latter may have improved some aspects of the shutdown situation, but our experience is that it hasn't fixed all of them.

The fundamental problem is that systemd can leave descendant processes running even when it has nominally terminated a systemd service, such as Apache, cron, or Exim. These lingering processes are not killed (or attempted to be killed) until very late and can cause a variety of problems during NFS unmounts, turning off swap, or various other portions of system shutdown. To deal with this, we use the big hammer of doing it ourselves; during shutdown, we run a script to kill lingering processes from various service units.

The script has a list of systemd services. For each service, it first looks in the systemd cgroup hierarchy to see if there are still processes associated with the service, by counting how many lines there are in /sys/fs/cgroup/systemd/system.slice/<what>.service/tasks. If there are processes still associated with the service, it kills them with SIGTERM and then SIGKILL (if necessary), using systemd itself to do the work with:

systemctl --kill-who=all --signal=SIG... kill <what>.service

(The actual implementation is slightly more complicated.)

The script has a bunch of logging to report on whether it had to do anything, what it did, and what the process tree looked like before and after it did various killing (as reported through systemd-cgls, because that will show us what systemd units the stray processes are under).

All of this is driven by a systemd .service unit with the following relevant bits:

[Unit]
After=remote-fs.target
Before=cron.service apache2.service exim4.service atd.service slurmd.service

[Service]
Type=oneshot
RemainAfterExit=True
ExecStop=/path/to/script

We set After so that our stop action is run before NFS unmounting starts, and Before so that the stop action happens after those listed services are shut down. Not all of those services exist and are enabled on all machines, but listing a Before service that isn't enabled is harmless. The Before list is basically 'what has caused us problems'; we add things to it as we run into problem services.

(Slurmd is a recent addition, for example.)

Right now the list of 'before' services is duplicated between the script and the systemd unit. It feels tempting to try to eliminate that, but on the other hand I'm not sure I want to be introspecting systemd too much during shutdown. We could also try to be more general by sniffing around the cgroup hierarchy to find stray processes from any unit we don't whitelist (or at least any unit that's theoretically been shut down). However, that might not be very useful on modern systems, where 'KillMode=control-group' is the default.

The good news is that the script's logging suggests that it usually doesn't need to do anything during system shutdown on our 18.04 machines. But usually isn't always, which is what prompted the addition of slurmd.service.

Sidebar: A potential alternate approach

Basically this is making these units behave as if they were set to 'KillMode=control-group' during shutdown. You can change systemd unit properties on the fly and only for the current system boot (with 'systemctl --runtime set-property', which we use for our per-user CPU and memory limits), so perhaps it would work to switch to this KillMode on the relevant service units early in the shutdown process.

This option didn't even occur to me until I wrote this entry, and in general it seems more uncertain and chancy than just killing things (even if we're killing things indirectly through systemd). But it'd give you a much smaller and simpler script.

SystemdUbuntuRebootWorkaround written at 00:44:54; Add Comment

2019-09-24

How we implement per-user CPU and memory resource limits on Ubuntu

A while back I wrote about imposing temporary CPU and memory limits on a user, using cgroups and systemd's features to fiddle around with them. Since then we have wound up with a number of shared general use machines where we've decided it's wiser to impose limits on everyone all of the time, so that one person can't blow up a general use server through either excessive CPU usage or excessive memory usage. We've done this on Ubuntu 16.04 and now 18.04, and with some limitations it works well to keep our systems from having too many problems.

We've found that at least on 18.04, it's impossible to implement this without running a script at user login time (or more generally when a session is established). We run our script through pam_exec in Ubuntu's /etc/pam.d/common-session:

session optional pam_exec.so type=open_session /path/to/script.sh

The 'optional' bit here is really important. If you leave it out and you ever have an error in your script, you will have just locked everyone out of the machine (yourself included). As they say, ask me how I know (fortunately I did this on a test virtual machine, not a live server).

(Because we've found that limiting cron and at jobs to be necessary in our environment, we've also put this into /etc/pam.d/cron and /etc/pam.d/atd. This requires cron and at jobs to be in user sessions, but we were already doing that for other reasons.)

The script has to do two things. First, it has to turn on fair share scheduling if it's not already on. You have to check this on every session startup, because if all existing user sessions go away (ie there's no one logged in and so on), the whole fair share scheduling setup disappears. Because we want to limit both CPU and memory usage, we set both 'CPUAccounting=true' and 'MemoryAccounting=true' for user.slice itself, all currently existing 'user-*.slice' slices, and all currently existing 'session-*.scope' scopes. It's possible that some of this is overkill.

Second, we set appropriate per-user limits (based on the various bits of information about the size of the machine) by setting appropriate 'CPUQuota=...%' and 'MemoryLimit=...' values on 'user-${PAM_UID}.slice'. We also set a TasksMax. As we currently have our script implemented, it blindly overwrites any existing settings for the user's slice any time the user starts a new session, which has both advantages and drawbacks.

(All of this setting is done with 'systemctl --runtime set-property'.)

We've chosen to not do any of this for sessions for system users, including root. If the script sees that ${PAM_UID} is outside our regular user UID range, it does nothing, so root's logins are unrestricted. We could have implemented this in the pam.d file itself, using pam_succeed_if, but I feel that scripts are a better place for conditional logic like this if possible.

In the future, some of this may be possible to do through systemd drop-ins for user.slice and individual user slices. However, it certainly won't be as flexible as you can be in a script, especially if you want to behave differently for different UIDs and you have enough users that you don't want to create and maintain individual files for each of them. It would be nice to be able to reliably set fair share scheduling once, though, and not have to keep re-setting it through the script.

Ubuntu1804SystemdUserLimits written at 00:20:37; Add Comment

2019-09-22

The increasing mess of libreadline versions in Linux distributions

GNU Readline is normally supplied and used by programs as a shared library (even if it's possible to statically link it, almost no one does). Some or perhaps many of those programs are not from the distribution; instead they're your local programs or third party programs. Shared libraries have major and minor versions (and also symbol versioning, but let's ignore that for now). The minor version of a shared library can be changed without upsetting programs linked to it, but the major version can't be; different major versions of a shared library are considered to be entirely different things. If your system has libreadline.so.6 and libreadline.so.8, and you're trying to run a program that was linked against libreadline.so.7, you're out of luck.

(Major shared library differences are required by ABI differences even if the API is the same and the program's source code can be immediately rebuilt against a different version of the shared library with no code changes.)

Unfortunately, two things are true. First, the GNU Readline people apparently do things that change the ABI on a regular basis, which causes new versions of the shared library to have new .so major versions (again, on a regular basis). Second, Linux distributions are increasingly providing an incomplete set of past libreadline shared library versions. This came up with Ubuntu 18.04 and libreadline.so.6, and recently I discovered that Fedora 30 has moved from libreadline.so.7 to libreadline.so.8 and not provided a compatibility package for version 7 (although they do versions for readline 6 and readline 5).

(I'm assuming here that the shared library version is changing due to genuine ABI incompatibility, instead of just the GNU Readline people deciding to call their latest release 'readline 8' and everyone following along in the .so versions.)

Just as with Ubuntu, the net effect is that it's impossible to build a local binary that uses GNU Readline that works on both Fedora 29 and Fedora 30, or even that can survive your system being upgraded from Fedora 29 to Fedora 30. If you upgrade your system, you get to immediately rebuild all programs using GNU Readline. I don't think you can even install the Fedora 29 readline 7 RPM on a Fedora 30 system without blowing things up.

It's my strong opinion that this overall situation is asinine. Linux distributions are not KYSTY environments, where the system packages are only for system use; a Linux distribution can reasonably expect people to build software against system shared libraries. Right now, using GNU Readline in such software is pointing a gun at your own foot, and not using GNU Readline is annoying for people who use your software (people like those readline features, and for good reason).

(At the same time, the Linux distributions are not the only people you can blame. The GNU Readline people are presumably unlikely to do bug fixes and security updates for GNU Readline 7, because they've moved on to GNU Readline 8. Linux distributions don't want to have to take on the burden of maintaining a long tail of GNU Readline versions that are no longer supported upstream.)

As a side note, it's very easy to miss that this has happened to some of your binaries if you only run them once in a while. I generally assume that Linux binaries are quite stable and so don't run around testing and rebuilding things after Fedora upgrades; generally I don't even think about the possibility of things like missing shared libraries.

PS: In their current development versions, Debian appears to have both libreadline7 and libreadline8; older versions of GNU Readline seem to be more spotty in general. The current stable Debian has libreadline7.

ReadlineDistroVersionMess written at 23:51:56; Add Comment

2019-09-12

The mystery of why my Fedora 30 office workstation was booting fine

The other day, I upgraded the kernel on my office workstation, much as I have any number of times before, and rebooted. Things did not go well:

So the latest Fedora 30 updates (including a kernel update) build an initramfs that refuses to bring up software RAID devices, including the one that my root filesystem is on. Things do not go well afterwards.

Then I said:

Fedora's systemd, Dracut and kernel parameters setup have now silently changed to require either rd.md.uuid for your root filesystem or rd.auto. The same kernel command line booted previous kernels with previous initramfs's.

The first part of this is wrong, and that leads to the mystery.

In Fedora 29, my kernel command line was specifying both the root filesystem device by name ('root=/dev/md20') and the software RAID arrays for the initramfs to bring up (as 'rd.md.uuid=...'). When I upgraded to Fedora 30 in mid-August, various things happened and I wound up removing both of those from the kernel command line, specifying the root filesystem device only by UUID ('root=UUID=...'). This kernel command line booted a series of Fedora 30 kernels, most recently 5.2.11 on September 4th, right up until yesterday.

However, it shouldn't have. As the dracut.cmdline manpage says, the default since Dracut 024 has been to not auto-assemble software RAID arrays in the absence of either rd.auto or rd.md.uuid. And the initramfs for older kernels (at least 5.2.11) was theoretically enforcing that; the journal for that September 4th boot contains a report of:

dracut-pre-trigger[492]: rd.md=0: removing MD RAID activation

But then a few lines later, md/raid1:md20 is activated:

kernel: md/raid1:md20: active with 2 out of 2 mirrors

(The boot log for the new kernel for a failed boot also had the dracut-pre-trigger line, but obviously no mention of the RAID being activated.)

I unpacked the initramfs for both kernels and as far as I can tell they're identical in terms of the kernel modules included and the configuration files and scripts (there are differences in some binaries, which is expected since systemd and some other things got upgraded between September 4th and now). Nor has the kernel configuration changed between the two kernels according to the config-* files in /boot.

So by all evidence, the old kernel and initramfs should not auto-assemble my root filesystem's software RAID and thus shouldn't boot. But, they do. In fact they did yesterday, because when the new kernel failed to boot the first thing I did was boot with the old one. I just don't know why, and that's the mystery.

My fix for my boot issue is straightforward; I've updated my kernel command line to have the 'rd.md.uuid=...' that it should have had all along. This works fine.

(My initial recovery from the boot failure was to use 'rd.auto', but I've decided that I don't want to auto-assemble anything and everything that the initramfs needs. I'll have the initramfs only assemble the bare minimum, just in case. While my swap is also on software RAID, I specifically decided to not assemble it in the initramfs; I don't really need it until later.)

Fedora30BootMystery written at 23:02:06; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.