Wandering Thoughts

2019-10-15

The Ubuntu package roulette

Today I got to re-learn a valuable lesson, which is that just because something is packaged in Ubuntu doesn't mean that it actually works. Oh, it's probably not totally broken, but there's absolutely no guarantee that the package will be fully functional or won't contain problems that cause cron to email you errors at least once a day because of an issue that's been known since 2015.

I know the technical reasons for this, which is that Ubuntu pretty much blindly imports packages from Debian and Debian is an anarchy where partially broken packages can rot quietly. Possibly completely non-functional packages can rot too, I don't actually know how Debian handles that sort of situation. Ubuntu's import is mostly blind because Ubuntu doesn't have the people to do any better. This is also where people point out that the package in question is clearly in Ubuntu's universe repository, which the fine documentation euphemistically describes as 'community maintained'.

(I have my opinions on Ubuntu's community nature or lack thereof, but this is not the right entry for that.)

All of this doesn't matter; it is robot logic. What matters is the experience for people who attempt to use Ubuntu packages. Once you enable universe (and you probably will), Ubuntu's command line package management tools don't particularly make it clear where your packages live (not in the way that Fedora's dnf clearly names the repository that every package you install will come from, for example). It's relatively difficult to even see this after the fact for installed packages. The practical result is that an Ubuntu package is an Ubuntu package, and so most random packages are a spin on the roulette wheel with an uncertain bet. Probably it will pay off, but sometimes you lose.

(And then if you gripe about it, some people may show up to tell you that it's your fault for using something from universe. This is not a great experience for people using Ubuntu, either.)

I'm not particularly angry about this particular case; this is why I set up test machines. I'm used to this sort of thing from Ubuntu. I'm just disappointed, and I'm sad that Ubuntu has created a structure that gives people bad experiences every so often.

(And yes, I blame Ubuntu here, not Debian, for reasons beyond the scope of this entry.)

UbuntuPackageRoulette written at 23:25:14; Add Comment

2019-09-29

Understanding when to use and not use the -F option for flock(1)

A while back I wrote some notes on understanding how to use flock(1), but those notes omitted a potentially important option, partly because that option was added somewhere in between version util-linux version 2.27.1 (which is what Ubuntu 16.04 has) and version 2.31.1 (Ubuntu 18.04). That is the -F option, which is described in the manpage as:

Do not fork before executing command. Upon execution the flock process is replaced by command which continues to hold the lock. [...]

This option is incompatible with -o, as mentioned in the manpage.

The straightforward situation where you very much want to use -F is if you're trying to run a program that reacts specially to Control-C. If you run 'flock program', there will still be a flock process, it will get Control-C and exit, and undesirable things will probably happen. If you use 'flock -F program', there is only the program and it can react properly to Control-C without any side effects on other processes.

(I'm assuming here that if you ran flock and the program from inside a shell script, you ran it with 'exec flock ...'. If you're in a situation where you have to do things in your shell script after the program finishes, you can't solve the Control-C problem just with this.)

However, there is also a situation where you don't want to use -F, and to see it we need to understand how the flock lock is continued to be held by the command. As covered in the first note, flock(1) works through flock(2), which means that the lock is 'held' by having the flock()'d file descriptor still be open. Most programs are indifferent to inheriting extra file descriptors, so this additional descriptor from flock just hangs around, keeping the lock held. However, some programs actively seek out and close file descriptors they may have inherited, often to avoid leaking them into child processes. If you use 'flock -F' with such a program, your lock will be released prematurely (before the program exits) when the program does this.

(The existence of such programs is probably part of why flock -F is not the default behavior.)

Sidebar: Faking 'flock -F' if you don't have it

If you have a shell script that has to run on Ubuntu 16.04 and you need this behavior, you can fake it with 'flock -o'. It goes like this:

exec 9 >>/some/lockfile
flock -x -n 9 || exit 0
exec program ...

Since 'flock -F' locks some file descriptor and then exec's the program, we can imitate it by doing the same manually; we pick a random file descriptor number, get the shell to open a file on that file descriptor and leave it open, flock that file descriptor, and then have the shell exec our program. Our program will inherit the locked fd 9 and the lock remains for as long as fd 9 is open. When the program exits, all of its file descriptors will be closed, including fd 9, and the lock will be released.

FlockUsageNotesII written at 00:59:06; Add Comment

2019-09-27

Some field notes on imposing memory resource limits on users on Ubuntu 18.04

As I mentioned in my entry on how we implement per-user CPU and memory limits, we have a number of shared general use servers where we've decided we need to impose limits on everyone all of the time so no one person can blow up the machine. Over the course of doing this, we've built up some practical experience and discovered a surprise or two.

As discussed, we impose our memory limits by setting systemd's MemoryLimit. In theory perhaps we should use MemoryMax, but for two issues. First, our scripts still need to work on Ubuntu 16.04, where MemoryMax isn't supported. Second, it's not clear if systemd makes this work if you're not using unified cgroups (cgroup v2), and the Ubuntu 18.04 default is to use the original v1 cgroups instead of the new cgroups. Since my impression is that there are still assorted issues with v2 cgroups, we're not inclined to switch away from the Ubuntu default here.

As documented, systemd's MemoryLimit sets the memory.limit_in_bytes cgroup attribute, which is sort of documented in the kernel's memory.txt. The important thing to know, which is only implicitly discussed in memory.txt, is that this only limits the amount of RAM that you can use, not the amount of swap space. In the Ubuntu 18.04 configuration of cgroup v1, there is simply no way to limit swap space usage, and on top of that systemd doesn't expose the property that you'd need.

Our experience is that this doesn't seem to matter for processes that use a lot of memory very rapidly; they run into their user's MemoryLimit almost immediately without causing swap thrashing and get killed by the cgroups OOM killer. However, processes that slowly grow in memory usage over time will wind up pushing things out to swap, themselves included, and as a result their actual memory usage can significantly exceed your MemoryLimit setting if you have enough swap. So far, we haven't experienced swap thrashing as a result of this, but I suspect that it's possible. Obviously, how much swap space you have strongly affects how much total memory a user can use before the cgroups OOM killer triggers. All of this can make your memory limit much more generous than you expect.

(We normally don't configure much swap on our servers, but a few have several gigabytes of it for various reasons. And even with only one GB of swap, that might be close to a GB more of 'memory' usage than you may have expected.)

PS: I was going to say that fast-growing processes don't seem to swap much, but our Prometheus system stats suggest that that's wrong and we do see significant and rapid swap usage. Since much of our swap is on SSDs these days, I suppose that I shouldn't be too impressed with how fast our systems can write it out; a GB or three over a minute is not all that fast in today's world, and SSDs are very good at random IO.

Sidebar: What I expect us to set with systemd v2 cgroups

If Ubuntu switches to v2 cgroups by default, I currently think we'd set a per-user MemorySwapMax that was at most a GB or half our swap space, whichever was smaller, make our current MemoryLimit be MemoryMax, and set MemoryHigh to a value a GB or so lower than MemoryMax. The thing I'm least certain about is what we'd want to set the swap limit to.

SystemdUserMemoryLimits written at 22:34:45; Add Comment

2019-09-25

Our workaround for Ubuntu 16.04 and 18.04 failing to reliably reboot some of our servers

A few years ago I wrote about how and why systemd on Ubuntu 16.04 couldn't reliably reboot some of our servers. At the time I finished off the entry by suggesting that we'd live with the intermittent failures that caused some of our systems to hang during reboot attempts, forcing us to go power cycle them. Shortly afterward, we changed our minds and decided to work around the situation by resorting to a bigger hammer. These days we use our bigger hammer on both Ubuntu 16.04 and Ubuntu 18.04; the latter may have improved some aspects of the shutdown situation, but our experience is that it hasn't fixed all of them.

The fundamental problem is that systemd can leave descendant processes running even when it has nominally terminated a systemd service, such as Apache, cron, or Exim. These lingering processes are not killed (or attempted to be killed) until very late and can cause a variety of problems during NFS unmounts, turning off swap, or various other portions of system shutdown. To deal with this, we use the big hammer of doing it ourselves; during shutdown, we run a script to kill lingering processes from various service units.

The script has a list of systemd services. For each service, it first looks in the systemd cgroup hierarchy to see if there are still processes associated with the service, by counting how many lines there are in /sys/fs/cgroup/systemd/system.slice/<what>.service/tasks. If there are processes still associated with the service, it kills them with SIGTERM and then SIGKILL (if necessary), using systemd itself to do the work with:

systemctl --kill-who=all --signal=SIG... kill <what>.service

(The actual implementation is slightly more complicated.)

The script has a bunch of logging to report on whether it had to do anything, what it did, and what the process tree looked like before and after it did various killing (as reported through systemd-cgls, because that will show us what systemd units the stray processes are under).

All of this is driven by a systemd .service unit with the following relevant bits:

[Unit]
After=remote-fs.target
Before=cron.service apache2.service exim4.service atd.service slurmd.service

[Service]
Type=oneshot
RemainAfterExit=True
ExecStop=/path/to/script

We set After so that our stop action is run before NFS unmounting starts, and Before so that the stop action happens after those listed services are shut down. Not all of those services exist and are enabled on all machines, but listing a Before service that isn't enabled is harmless. The Before list is basically 'what has caused us problems'; we add things to it as we run into problem services.

(Slurmd is a recent addition, for example.)

Right now the list of 'before' services is duplicated between the script and the systemd unit. It feels tempting to try to eliminate that, but on the other hand I'm not sure I want to be introspecting systemd too much during shutdown. We could also try to be more general by sniffing around the cgroup hierarchy to find stray processes from any unit we don't whitelist (or at least any unit that's theoretically been shut down). However, that might not be very useful on modern systems, where 'KillMode=control-group' is the default.

The good news is that the script's logging suggests that it usually doesn't need to do anything during system shutdown on our 18.04 machines. But usually isn't always, which is what prompted the addition of slurmd.service.

Sidebar: A potential alternate approach

Basically this is making these units behave as if they were set to 'KillMode=control-group' during shutdown. You can change systemd unit properties on the fly and only for the current system boot (with 'systemctl --runtime set-property', which we use for our per-user CPU and memory limits), so perhaps it would work to switch to this KillMode on the relevant service units early in the shutdown process.

This option didn't even occur to me until I wrote this entry, and in general it seems more uncertain and chancy than just killing things (even if we're killing things indirectly through systemd). But it'd give you a much smaller and simpler script.

SystemdUbuntuRebootWorkaround written at 00:44:54; Add Comment

2019-09-24

How we implement per-user CPU and memory resource limits on Ubuntu

A while back I wrote about imposing temporary CPU and memory limits on a user, using cgroups and systemd's features to fiddle around with them. Since then we have wound up with a number of shared general use machines where we've decided it's wiser to impose limits on everyone all of the time, so that one person can't blow up a general use server through either excessive CPU usage or excessive memory usage. We've done this on Ubuntu 16.04 and now 18.04, and with some limitations it works well to keep our systems from having too many problems.

We've found that at least on 18.04, it's impossible to implement this without running a script at user login time (or more generally when a session is established). We run our script through pam_exec in Ubuntu's /etc/pam.d/common-session:

session optional pam_exec.so type=open_session /path/to/script.sh

The 'optional' bit here is really important. If you leave it out and you ever have an error in your script, you will have just locked everyone out of the machine (yourself included). As they say, ask me how I know (fortunately I did this on a test virtual machine, not a live server).

(Because we've found that limiting cron and at jobs to be necessary in our environment, we've also put this into /etc/pam.d/cron and /etc/pam.d/atd. This requires cron and at jobs to be in user sessions, but we were already doing that for other reasons.)

The script has to do two things. First, it has to turn on fair share scheduling if it's not already on. You have to check this on every session startup, because if all existing user sessions go away (ie there's no one logged in and so on), the whole fair share scheduling setup disappears. Because we want to limit both CPU and memory usage, we set both 'CPUAccounting=true' and 'MemoryAccounting=true' for user.slice itself, all currently existing 'user-*.slice' slices, and all currently existing 'session-*.scope' scopes. It's possible that some of this is overkill.

Second, we set appropriate per-user limits (based on the various bits of information about the size of the machine) by setting appropriate 'CPUQuota=...%' and 'MemoryLimit=...' values on 'user-${PAM_UID}.slice'. We also set a TasksMax. As we currently have our script implemented, it blindly overwrites any existing settings for the user's slice any time the user starts a new session, which has both advantages and drawbacks.

(All of this setting is done with 'systemctl --runtime set-property'.)

We've chosen to not do any of this for sessions for system users, including root. If the script sees that ${PAM_UID} is outside our regular user UID range, it does nothing, so root's logins are unrestricted. We could have implemented this in the pam.d file itself, using pam_succeed_if, but I feel that scripts are a better place for conditional logic like this if possible.

In the future, some of this may be possible to do through systemd drop-ins for user.slice and individual user slices. However, it certainly won't be as flexible as you can be in a script, especially if you want to behave differently for different UIDs and you have enough users that you don't want to create and maintain individual files for each of them. It would be nice to be able to reliably set fair share scheduling once, though, and not have to keep re-setting it through the script.

Ubuntu1804SystemdUserLimits written at 00:20:37; Add Comment

2019-09-22

The increasing mess of libreadline versions in Linux distributions

GNU Readline is normally supplied and used by programs as a shared library (even if it's possible to statically link it, almost no one does). Some or perhaps many of those programs are not from the distribution; instead they're your local programs or third party programs. Shared libraries have major and minor versions (and also symbol versioning, but let's ignore that for now). The minor version of a shared library can be changed without upsetting programs linked to it, but the major version can't be; different major versions of a shared library are considered to be entirely different things. If your system has libreadline.so.6 and libreadline.so.8, and you're trying to run a program that was linked against libreadline.so.7, you're out of luck.

(Major shared library differences are required by ABI differences even if the API is the same and the program's source code can be immediately rebuilt against a different version of the shared library with no code changes.)

Unfortunately, two things are true. First, the GNU Readline people apparently do things that change the ABI on a regular basis, which causes new versions of the shared library to have new .so major versions (again, on a regular basis). Second, Linux distributions are increasingly providing an incomplete set of past libreadline shared library versions. This came up with Ubuntu 18.04 and libreadline.so.6, and recently I discovered that Fedora 30 has moved from libreadline.so.7 to libreadline.so.8 and not provided a compatibility package for version 7 (although they do versions for readline 6 and readline 5).

(I'm assuming here that the shared library version is changing due to genuine ABI incompatibility, instead of just the GNU Readline people deciding to call their latest release 'readline 8' and everyone following along in the .so versions.)

Just as with Ubuntu, the net effect is that it's impossible to build a local binary that uses GNU Readline that works on both Fedora 29 and Fedora 30, or even that can survive your system being upgraded from Fedora 29 to Fedora 30. If you upgrade your system, you get to immediately rebuild all programs using GNU Readline. I don't think you can even install the Fedora 29 readline 7 RPM on a Fedora 30 system without blowing things up.

It's my strong opinion that this overall situation is asinine. Linux distributions are not KYSTY environments, where the system packages are only for system use; a Linux distribution can reasonably expect people to build software against system shared libraries. Right now, using GNU Readline in such software is pointing a gun at your own foot, and not using GNU Readline is annoying for people who use your software (people like those readline features, and for good reason).

(At the same time, the Linux distributions are not the only people you can blame. The GNU Readline people are presumably unlikely to do bug fixes and security updates for GNU Readline 7, because they've moved on to GNU Readline 8. Linux distributions don't want to have to take on the burden of maintaining a long tail of GNU Readline versions that are no longer supported upstream.)

As a side note, it's very easy to miss that this has happened to some of your binaries if you only run them once in a while. I generally assume that Linux binaries are quite stable and so don't run around testing and rebuilding things after Fedora upgrades; generally I don't even think about the possibility of things like missing shared libraries.

PS: In their current development versions, Debian appears to have both libreadline7 and libreadline8; older versions of GNU Readline seem to be more spotty in general. The current stable Debian has libreadline7.

ReadlineDistroVersionMess written at 23:51:56; Add Comment

2019-09-12

The mystery of why my Fedora 30 office workstation was booting fine

The other day, I upgraded the kernel on my office workstation, much as I have any number of times before, and rebooted. Things did not go well:

So the latest Fedora 30 updates (including a kernel update) build an initramfs that refuses to bring up software RAID devices, including the one that my root filesystem is on. Things do not go well afterwards.

Then I said:

Fedora's systemd, Dracut and kernel parameters setup have now silently changed to require either rd.md.uuid for your root filesystem or rd.auto. The same kernel command line booted previous kernels with previous initramfs's.

The first part of this is wrong, and that leads to the mystery.

In Fedora 29, my kernel command line was specifying both the root filesystem device by name ('root=/dev/md20') and the software RAID arrays for the initramfs to bring up (as 'rd.md.uuid=...'). When I upgraded to Fedora 30 in mid-August, various things happened and I wound up removing both of those from the kernel command line, specifying the root filesystem device only by UUID ('root=UUID=...'). This kernel command line booted a series of Fedora 30 kernels, most recently 5.2.11 on September 4th, right up until yesterday.

However, it shouldn't have. As the dracut.cmdline manpage says, the default since Dracut 024 has been to not auto-assemble software RAID arrays in the absence of either rd.auto or rd.md.uuid. And the initramfs for older kernels (at least 5.2.11) was theoretically enforcing that; the journal for that September 4th boot contains a report of:

dracut-pre-trigger[492]: rd.md=0: removing MD RAID activation

But then a few lines later, md/raid1:md20 is activated:

kernel: md/raid1:md20: active with 2 out of 2 mirrors

(The boot log for the new kernel for a failed boot also had the dracut-pre-trigger line, but obviously no mention of the RAID being activated.)

I unpacked the initramfs for both kernels and as far as I can tell they're identical in terms of the kernel modules included and the configuration files and scripts (there are differences in some binaries, which is expected since systemd and some other things got upgraded between September 4th and now). Nor has the kernel configuration changed between the two kernels according to the config-* files in /boot.

So by all evidence, the old kernel and initramfs should not auto-assemble my root filesystem's software RAID and thus shouldn't boot. But, they do. In fact they did yesterday, because when the new kernel failed to boot the first thing I did was boot with the old one. I just don't know why, and that's the mystery.

My fix for my boot issue is straightforward; I've updated my kernel command line to have the 'rd.md.uuid=...' that it should have had all along. This works fine.

(My initial recovery from the boot failure was to use 'rd.auto', but I've decided that I don't want to auto-assemble anything and everything that the initramfs needs. I'll have the initramfs only assemble the bare minimum, just in case. While my swap is also on software RAID, I specifically decided to not assemble it in the initramfs; I don't really need it until later.)

Fedora30BootMystery written at 23:02:06; Add Comment

2019-08-16

A gotcha with Fedora 30's switch of Grub to BootLoaderSpec based configuration

I upgraded my office workstation from Fedora 29 to Fedora 30 yesterday. In the past, such upgrades been problem free, but this time around things went fairly badly, with the first and largest problem being that after the upgrade, booting any kernel gave me a brief burst of kernel messages, then a blank screen and after a few minutes a return to the BIOS and Grub main menu. To get my desktop to boot at all, I had to add 'nomodeset' to the kernel command line; among other consequences, this made my desktop a single display machine instead of a dual display one.

(It was remarkably disorienting to have my screen mirrored across both displays. I kept trying to change to the 'other' display and having things not work.)

The short version of the root cause is that my grub.cfg was rebuilt using outdated kernel command line arguments that came from /etc/default/grub, instead of the current command line arguments that had previously been used in my original grub.cfg. Because of how the Fedora 30 grub.cfg is implemented, these wrong command line arguments were then remarkably sticky and it wasn't clear how to change them.

In Fedora 29 and earlier, your grub.cfg is probably being maintained through grubby, Fedora's program for this. When grubby adds a menu entry for a new kernel, it more or less copies the kernel command line arguments from your current one. While there is a GRUB_CMDLINE_LINUX setting in /etc/default/grub, its contents are ignored until and unless you rebuild your grub.cfg from scratch, and there's nothing that tries to update it from what your current kernels in your current grub.cfg are actually using. This means that your /etc/default/grub version can wind up being very different from what you're currently using and actually need to make your kernels work.

One of the things that usually happens by default when you upgrade to Fedora 30 is that Fedora switches how grub.cfg is created and updated from the old way of doing it itself via grubby to using a Boot Loader Specification (BLS) based scheme; you can read about this switch in the Fedora wiki. This switch regenerates your grub.cfg using a shell script called (in Fedora) grub2-switch-to-blscfg, and this shell script of course uses /etc/default/grub's GRUB_CMDLINE_LINUX as the source of the kernel arguments.

(This is controlled by whether GRUB_ENABLE_BLSCFG is set to true or false in your /etc/default/grub. If it's not set at all, grub2-switch-to-blscfg adds a 'GRUB_ENABLE_BLSCFG=true' setting to /etc/default/grub for you, and of course goes on to regenerate your grub.cfg. grub2-switch-to-blscfg itself is run from the Fedora 30 grub2-tools RPM posttrans scriptlet if GRUB_ENABLE_BLSCFG is not already set to something in your /etc/default/grub.)

A regenerated grub.cfg has a default_kernelopts setting, and that looks like it should be what you want to change. However, it is not. The real kernel command line for normal BLS entries is actually in the Grub2 $kernelopts environment variable, which is loaded from the grubenv file, normally /boot/grub2/grubenv (which may be a symlink to /boot/efi/EFI/fedora/grubenv, even if you're not actually using EFI boot). The best way to change this is to use 'grub2-editenv - list' and 'grub2-editenv - set kernelopts="..."'. I assume that default_kernelopts is magically used by the blscfg Grub2 module if $kernelopts is unset, and possibly gets written back to grubenv by Grub2 in that case.

(You can check that your kernels are using $kernelopts by inspecting an entry in /boot/loader/entries and seeing that it has 'options $kernelopts' instead of anything else. You can manually change that for a specific entry if you want to.)

This is going to make it more interesting (by which I mean annoying) if and when I need to change my standard kernel options. I think I'm going to have to change all of /etc/default/grub, the kernelopts in grubenv, and the default_kernelopts in grub.cfg, just to be sure. If I was happy with the auto-generated grub.cfg, I could just change /etc/default/grub and force a regeneration, but I'm not and I have not yet worked out how to make its handling of the video modes and the menus agree with what I want (which is a basic text experience).

(While I was initially tempted to leave my system as a non-BLS system, I changed my mind because of long term issues. Fedora will probably drop support for grubby based setups sooner or later, so I might as well get on the BLS train now.)

To give credit where it's due, one (lucky) reason that I was able to eventually work out all of this is that I'd already heard about problems with the BLS transition in Fedora 30 in things like Fedora 30: When grub2-mkconfig Doesn’t Work, and My experiences upgrading to Fedora 30. Without that initial awareness of the existence of the BLS transition in Fedora 30 (and the problems it caused people), I might have been flailing around for even longer than I was.

PS: As a result of all of this, I've discovered that you no longer need to specify the root device in the kernel command line arguments. I assume the necessary information for that is in the dracut-built initramfs. As far as the blank screen and kernel panics go, I suspect that the cause is either or both of 'amdgpu.dpm=0' and 'logo.nologo', which were still present in the /etc/default/grub arguments but which I'd long since removed from my actual kernel command lines.

(I could conduct more experiments to try to find out which kernel argument is the fatal one, but my interest in more reboots is rather low.)

Update, August 21st: I needed to reboot my machine to apply a Fedora kernel update, so I did some experiments and the fatal kernel command line argument is amdgpu.dpm=0, which I needed when the machine was new but had turned off since then.

Fedora30GrubBLSGotcha written at 20:58:09; Add Comment

Systemd and waiting until network interfaces or addresses are configured

One of the things that systemd is very down on is the idea of running services after 'the network is up', whatever that means; the systemd people have an entire web page on the subject. This is all well and good in theory, but in practice there are plenty of situations where I need to only start certain things after either a named network interface is present or an IP address exists. For a concrete example, you can't set up various pieces of policy based routing for an interface until the interface actually exists. If you're configuring this on boot in a systemd based system (especially one using networkd), you need some way to insure the ordering. Similarly, sometimes you need to listen only on some specific IP addresses and the software you're using doesn't have Linux specific hacks to do that when the IP address doesn't exist yet.

(As a grumpy sysadmin, I actually don't like the behavior of binding to an IP address that doesn't exist, because it means that daemons will start and run even if the system will never have the IP address. I would much rather delay daemon startup until the IP address exists.)

Systemd does not have direct native support for any of this, of course. There's no way to directly say that you depend on an interface or an IP address, and in general the dependency structure has long been under-documented. The closest you can get to waiting until a named network interface exists is to specify an After= and perhaps a Want= or a Requires= on the pseudo-unit for the network interface, 'sys-subsystem-net-devices-<iface>.device'. However, as I found out, the lack of a .device unit doesn't always mean that the interface doesn't exist.

You might think that in order to wait for an IP address to exist, you could specify an After= for the .device unit it's created in and by. However, this has historically had issues for me; under at least some versions of systemd, the .device unit would be created before the IP address was configured. In my particular situation, what worked at the time was to wait for a VLAN interface .device that was on top of the real interface that had the IP address (and yes, I mix tagged VLANs with an untagged network). By the time the VLAN .device existed, the IP address had relatively reliably been set up.

If you're using systemd-networkd and care about network interfaces, the easiest approach is probably to rely on systemd-networkd-wait-online.service; how it works and what it waits for is probably about as good as you can get. For IP addresses, as far as I know there's no native thing that specifically waits until some or all of your static IP addresses are present. Waiting for systemd-networkd-wait-online is probably going to be good enough for most circumstances, but if I needed better I would probably write a shell script (and a .service unit for it) that simply waited until the IP addresses I needed were present.

(I continue to think that it's a real pity that you can't configure networkd .network files to have 'network up' and 'network down' scripts, especially since their stuff for routing and policy based routing is really very verbose.)

PS: One of the unfortunate effects of the under-documented dependency structure and the lack of clarity of what to wait on is a certain amount of what I will call 'superstitious dependencies', things that you've put into your systemd units without fully understanding whether or not you needed them, and why (often also without fully documenting them). This is fine most of the time, but then one day an unnecessary dependency fails to start or perhaps exist and then you're unhappy. That's part of why I would like explicit and reliable ways to do all of this.

SystemdNetworkThereIssue written at 00:26:26; Add Comment

2019-08-12

Linux can run out of memory without triggering the Out-Of-Memory killer

If you have a machine with strict overcommit turned on, your memory allocation requests will start to fail once enough virtual address space has been committed, because that's what you told the kernel to do. Hitting your strict overcommit limit doesn't trigger the Out-Of-Memory killer, because the two care about different things; strict memory overcommit cares about committed address space, while the global OOM killer cares about physical RAM. Hitting the commit limit may kill programs anyway, because many programs die if their allocations fail. Also, under the right situations, you can trigger the OOM killer on a machine set to strict overcommit.

Until recently, if you had asked me about how Linux behaved in the default 'heuristic overcommit' mode, I would have told you that ordinary memory allocations would never fail in it; instead, if you ran out of memory (really RAM), the OOM killer would trigger. We've recently found out that this is not the case, at least in the Ubuntu 18.04 LTS '4.15.0' kernel. Under (un)suitable loads, various of our systems can run out of memory without triggering the OOM killer and persist in this state for some time. When it happens, the symptoms are basically the same as what happens under strict overcommit; all sorts of things can't fork, can't map shared libraries, and so on. Sometimes the OOM killer is eventually invoked, other times the situation resolves itself, and every so often we have to reboot a machine to recover it.

I would like to be able to tell you why and how this happens, but I can't. Based on the kernel code involved, the memory allocations aren't being refused because of heuristic overcommit, which still has its very liberal limits on how much memory you can ask for (see __vm_enough_memory in mm/util.c). Instead something else is causing forks, mmap()s of shared libraries, and so on to fail with 'out of memory' errno values, and whatever that something is it doesn't trigger the OOM killer during the failure and doesn't cause the kernel to log any other messages, such as the ones you can see for page allocation failures.

(Well, the messages you see for certain page allocations. Page allocations can be flagged as __GFP_NOWARN, which suppresses these.)

PS: Unlike the first time we saw this, the recent cases have committed address space rising along with active anonymous pages, and the kernel's available memory dropping in sync and hitting zero at about the time we see failures start.

NoMemoryButNoOOM written at 22:23:33; Add Comment

(Previous 10 or go back to August 2019 at 2019/08/11)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.