Wandering Thoughts

2019-06-23

What it takes to run a 32-bit x86 program on a 64-bit x86 Linux system

Suppose that you have a modern 64-bit x86 Linux system (often called an x86_64 environment) and that you want to run an old 32-bit x86 program on it (a plain x86 program). What does this require from the overall system, both the kernel and the rest of the environment?

(I am restricting this to ELF programs, not very old a.out ones.)

At a minimum, this requires that the (64-bit) kernel support programs running in 32-bit mode ('IA32') and making 32-bit kernel calls. Supporting this is a configuration option in the kernel (or actually a whole collection of them, but they mostly depend on one called IA32_EMULATION). Supporting 32-bit calls on a 64-bit kernel is not entirely easy because many kernel calls involve structures; those structures must be translated back and forth between the kernel's native 64-bit version and the emulated 32-bit version. This can raise questions of how to handle native values that exceed what can fit in the fields of the 32-bit structures. The kernel also has a barrel full of fun in the form of ioctl(), which smuggles a lot of structures in and out of the kernel in relatively opaque ways. A 64-bit kernel does want to support at least some 32-bit ioctls, such as the ones that deal with (pseudo-)terminals.

(I suspect that there are people in the Linux kernel community who hope that all of this emulation and compatibility code can someday be removed.)

A modern kernel dealing with modern 32-bit programs also needs to provide a 32-bit vDSO, and the necessary information to let the program find it. This requires the kernel to carry around a 32-bit ELF image, which has to be generated somehow (at some point). The vDSO is mapped into the memory space of even statically compiled 32-bit programs, although they may or may not use it.

(In ldd output on dynamically linked 32-bit programs, I believe this often shows up as a magic 'linux-gate.so.1'.)

This is enough for statically compiled programs, but of course very few programs are statically compiled. Instead, almost all 32-bit programs that you're likely to encounter are dynamically linked and so require a collection of additional compiled things. Running a dynamically linked program requires at least a 32-bit version of its dynamic linker (the 'ELF interpreter'), which is usually 'ld-linux.so.2'. Generally the 32-bit program will then go on to require additional 32-bit shared libraries, starting with the 32-bit C library ('libc.so.6' and 'libdl.so.2' for glibc) and expanding from there. The basic shared libraries usually come from glibc, but you can easily need additional ones from other packages for things like curses or the collection of X11 shared libraries. C++ programs will need libstdc++, which comes from GCC instead of glibc.

(The basic dynamic linker, ld-linux.so.2, is also from glibc.)

In order to do things like hostname lookups correctly, a 32-bit program will also need 32-bit versions of any NSS modules that are used in your /etc/nsswitch.conf, since all of these are shared libraries that are loaded dynamically by glibc. Some of these modules come from glibc itself, but others are provided by a variety of additional software packages. I'm not certain what happens to your program's name lookups if a relevant NSS module is not available, but at a minimum you won't be able to correctly resolve names that come from that module.

(You may not get any result for the name, or you might get an incorrect or incomplete result if another configured NSS module also has an answer for you. Multiple NSS modules are common for things like hostname resolution.)

I believe that generally all of these 32-bit shared libraries will have to be built with a 32-bit compiler toolchain in an environment that itself looks and behaves as 32-bit as possible. Building 32-bit binaries from a 64-bit environment is theoretically possible and nominally simple, but in practice there have been problems, and on top of that many build systems don't support this sort of cross-building.

(Of course, many people and distributions already have a collection of 32-bit shared libraries that have been built. But if they need to be rebuilt or updated for some reason, this might be an issue. And of course the relevant shared library needs to (still) support being built as 32-bit instead of 64-bit, as does the compiler toolchain.)

32BitProgramOn64BitSystem written at 22:19:30; Add Comment

2019-06-12

My weird problem with the Fedora 29 version of Firefox 67

On Twitter, I said:

So the Fedora version of Firefox 67 (or perhaps all versions of Firefox 67) have a little issue where starting Firefox with a URL, as 'firefox SOME-URL', will sometimes start Firefox without loading the web page properly. This is very irritating for me, so back to Firefox 66.

While the effect here is reproducible from the command line (well, for me), it comes up for me because I have a bunch of tools, dmenu setups and window manager automation that ends up doing the equivalent of this. One case is transferring a URL to my Javascript-enabled Firefox profile; the JS-enabled Firefox is not usually running, so the transfer process runs into this some of the time. As you can imagine, trying to open a URL and getting a blank page or some other failure is kind of annoying. I want things to work the first time around, not have to be repeated (occasionally more than once).

I also did some testing of the Fedora Firefox 67 with a completely new $HOME (by doing 'export HOME=/tmp/fox-scratch; mkdir $HOME') and I got some really weird results by repeatedly starting it with an URL, quitting, and repeating it. For a while at home, I could get the Fedora version of Firefox 67 to report that Environment Canada's website had an invalid TLS certificate because Entrust was an unknown CA.

Initially I wasn't sure if this was Firefox 67 in general or not, but yesterday I got sufficiently interested and irritated to fetch the official Mozilla build of Firefox 67 and try it instead ('installed', ie unpacked, into a non-system location). This official version of 67.0.2 works completely fine for me both at home and at work, with all of my normal profiles and extensions, so the problem seems to be in some way specific to Fedora's build of Firefox 67 (I saw it with both 67.0-2 and 67.0-4 on Fedora 29). On the other hand, the Fedora 29 Firefox 67 seems to work fine in my Cinnamon session on my laptop.

(I haven't tried the latest unreleased version of Fedora's Firefox that's available through the Bodhi page for Firefox or via the steps in Fetching really new Fedora packages with Bodhi. As I write this it's only 19 hours old, so I'll let the dust settle on those packages a bit.)

PS: I've not yet upgraded to Fedora 30 for various reasons, but certainly of them is this bug. I suspect it may be July before I make the leap.

PPS: For interested parties, the bug I filed with Fedora is Fedora Bugzilla #1713924.

FedoraFirefox67Problem written at 22:18:46; Add Comment

An interesting Fedora 29 DNF update loop with the createrepo package

For a while, my Fedora 29 home and work machines have been complaining during 'dnf update' with a very peculiar complaint:

# dnf update
Last metadata expiration check: 0:09:16 ago [...]
Dependencies resolved.

 Problem: cannot install both createrepo_c-0.11.1-1.fc29.x86_64 and createrepo_c-0.13.2-2.fc29.x86_64
  - cannot install the best update candidate for package createrepo_c-0.13.2-2.fc29.x86_64
  - cannot install the best update candidate for package createrepo-0.10.3-15.fc28.noarch
========= [....]
 Package        Architecture  Version        Repository  Size
Skipping packages with conflicts:
(add '--best --allowerasing' to command line to force their upgrade):
 createrepo_c   x86_64        0.11.1-1.fc29  fedora      59 k

(Using '--best --allowerasing' did not in fact force the upgrade.)

For a while I've been ignoring this or taking very light stabs at trying to fix it, but tonight I got irritated enough to finally do a full investigation. To start with, the initial situation is that I have both createrepo 0.10.3-15 and createrepo_c 0.13.2-2 installed. DNF is trying to upgrade createrepo to createrepo_c 0.11.1-1, but this is naturally conflicting with the more recent version of createrepo_c that I already have installed.

I don't know quite how I got into the underlying situation, but I believe that interested parties can reproduce it on a current Fedora 29 system (and possibly on a Fedora 30 one as well) by first installing mock, which pulls in createrepo_c 0.13.2-2, and then the older and now apparently obsolete mach, which will pull in createrepo. At this point, a 'dnf update' will likely produce what you see here. To get out of the situation, you must DNF remove mach and createrepo (conveniently, 'dnf remove createrepo' will ripple through to remove mach as well).

To start understanding the situation, let's do an additional DNF command:

# dnf provides createrepo
createrepo-0.10.3-15.fc28.noarch : Creates a common metadata repository
[...]
Provide    : createrepo = 0.10.3-15.fc28

createrepo_c-0.11.1-1.fc29.x86_64 : Creates a common metadata repository
[...]
Provide    : createrepo = 0.11.1-1.fc29

In the beginning, there was createrepo, written in Python, and it was used by various programs and packages that wanted to create local RPM repositories, including both Mach and Mock. As a result of this, the Fedora packages for various things explicitly required 'createrepo'. Eventually the RPM people decided that they needed a version of createrepo written in C, so they created createrepo_c. In Fedora 29, Fedora appears to have switched which createrepo implementation they used to the C version. Likely to ease the transition, they made the initial version or versions of their createrepo_c RPM also pretend that it was createrepo, by explicitly doing an RPM provides of that name. This made createrepo_c 0.11.1-1 both a substitute for the createrepo RPM and an upgrade candidate for it, since it has a more recent version (this is the surprise of 'Provides' in RPM).

(The RPM changelog says this was introduced in 0.10.0-20, for which the only change is 'Obsolete and provide createrepo'.)

Over time, most RPMs were updated to require createrepo_c instead of createrepo, including the mock RPM. However, the mach RPM was not updated, probably because Mach itself is neglected and likely considered obsolete or abandoned. Then at some point the the Fedora people stopped having their createrepo_c RPM fill in for createrepo this way. Based on the RPM changelog for createrepo_c, this happened in 0.13.2-1, which includes a cryptic changelog line of:

  • Do not obsolete createrepo on Fedora < 31

Presumably the Fedora people have their reasons, and if I wanted to trawl the Fedora Bugzilla I might even find them. However, the effect of this change is that older createrepo_c RPMs in Fedora 29 are updates for createrepo but newer ones aren't.

So, if you 'dnf install mock', you will get mock and the current version of createrepo_c, which doesn't provide createrepo. If you then 'dnf install mach', it requires createrepo and the best version it can actually install is the actual createrepo 0.10.3-15 RPM that was built on Fedora 28. However, once that is installed, DNF will see the 0.11.1-1 version of createrepo_c from the Fedora 29 release package set as an update candidate for it, but that can't be installed because you already have a more recent version of createrepo_c.

(I suspect that if you install mach first and mock second, you will get only createrepo_c but will be unable to upgrade it past 0.11.1-1 without erasing mach.)

FedoraCreaterepoUpdateLoop written at 00:51:43; Add Comment

2019-05-28

Distribution packaging of software needs to be informed (and useful)

In my recent entry on a RPM name clash between grafana.com's and Fedora's packaging of Grafana, Josef "Jeff" Sipek wrote in a comment:

I think that's a bit backwards. I always consider the distro as the owner of the package namespace. So, it's really up to grafana.com to use a non-colliding name.

The distro provides a self-contained ecosystem, if a 3rd party package wants to work with it, then it should be up to the 3rd party to do all the work. There's no reasonable way for Fedora (or any other distro) to know about every other possible package on the web that may get installed.

[...]

I fundamentally disagree with this view as it plays out in the case of the 'grafana' RPM package name (where grafana.com packaged it long before Fedora did). At the immediate level, when the upstream already distributes a package for the thing (either as a standalone package, the case here, or through an addon repository), it is up to the distribution, as the second person on the scene, to make sure that they work with the existing situation unless it is completely untenable to do so.

(On a purely practical level, saying 'the distribution can take over your existing package name any time it feels like and cause arbitrary problems for current users' is a great way to get upstreams to never make their own packages for a distribution and to only every release tarballs. I feel strongly that this would be a loss; tarballs are strongly inferior to proper packages for various reasons.)

More broadly, when a distribution creates a package for something, they absolutely should inform themselves about how the thing is currently packaged, distributed, installed, and used on their distribution, if it is, and how it will be used and evolve in the future. Fundamentally, a distribution should be creating useful packages of programs and being informed is part of that. Blindly grabbing a release of something and packaging it as an official distribution package is not necessarily creating a useful thing for anyone, either current users or potential future users who might install the distribution's package. Doing a good, useful package fundamentally requires understanding things like how the upstream distributes things, what their release schedule is like, how they support old releases (if they do), and so on. It cannot be done blindly, even in cases where the upstream is not already providing its own packages.

(For example, if you package and freeze a version of something that will have that version abandoned immediately by the upstream and not have fixes, security updates and so on backported by you, you are not creating a useful package; instead, you're creating a dangerous one. In some cases this means that you cannot create a distribution package that is both in compliance with distribution packaging policies and useful to your users; in that case, you should not package it at all. If users keep asking, set up a web page for 'why we cannot provide a package for this'.)

PS: Some of this is moot if the upstream does not distribute their own pre-built binaries, but even then you really want to know the upstream's release schedule, length of version support, degree of version to version change, and so on. If the upstream believes in routine significant change, no support of old versions, and frequent releases, you probably do not want to touch that minefield. In the modern world, it is an unfortunate fact of life that not every upstream project is suitable for being packaged by distributions, even if this leaves your users to deal with the problem themselves. It's better to be honest about the upstream project being incompatible with what your users expect from your packages.

PackagingMustBeInformed written at 23:57:21; Add Comment

2019-05-27

Something that Linux distributions should not do when packaging things

Right now I am a bit unhappy at Fedora for a specific packaging situation, so let me tell you a little story of what I, as a system administrator, would really like distributions to not do.

For reasons beyond the scope of this blog entry, I run a Prometheus and Grafana setup on both my home and office Fedora Linux machines (among other things, it gives me a place to test out various things involving them). When I set this up, I used the official upstream versions of both, because I needed to match what we are running (or would soon be). The Grafana people supply Grafana in a variety of package formats, and because Grafana has a bunch of files and paths I opted to use their RPM package instead of their tarball. The Grafana people give their RPM package the package name of 'grafana', which is perfectly reasonable of them.

(We use the .deb on our Ubuntu 18.04 based production server for the same reason. Life is too short to spend patiently setting tons of command line switches or configuration file paths to tell something where to find all of its bits if the people provide a nice pre-packaged artifact.)

Recently, Fedora decided to package Grafana themselves (as a RPM), and they called this RPM package 'grafana'. Since the two different packages are different versions of the same thing as far as package management tools are concerned, Fedora basically took over the 'grafana' package name from Grafana. This caused my systems to offer to upgrade me from the Grafana.com 'grafana-6.1.5-1' package to the Fedora 'grafana-6.1.6-1.fc29' one, which I actually did after taking reasonable steps to make sure that the Fedora version of 6.1.6 was compatible with the file layouts and so on from the Grafana version of 6.1.5.

So far, I have no objection to what Fedora did. They provided the latest released version of Grafana, and their new package was a drop in replacement for the upstream Grafana RPM. The problem is what happened next, which is that the Grafana people released Grafana 6.2 on May 22nd (cf) and currently there is no sign of any update to the Fedora package (the Bodhi page for grafana has no activity since 6.1.6, for example). At this point it is unclear to me if Fedora has any plans to update from 6.1.6 at all, for example; perhaps they have decided to freeze on this initial version.

Why is this a problem? It's simple. If you're going to take over a package name from the upstream, you should keep up with the upstream releases. If you take over a package name and don't keep up to date or keep up to date only sporadically, you cause all sorts of heartburn for system administrators who use the package. The least annoying future of this situation is that Fedora has abandoned Grafana at 6.1.6 and I am going to 'upgrade' it with the upstream 6.2.1, which will hopefully be a transparent replacement and not blow up in my face. The most annoying future is that Fedora and Grafana keep ping-ponging versions back and forth, which will make 'dnf upgrade' into a minefield (because it will frequently try to give me a 'grafana' upgrade that I don't want and that would be dangerous to accept). And of course this situation turns Fedora version upgrades into their own minefield, since now I risk an upgrade to Fedora 30 actually reverting the 'grafana' package version on me.

You can hardly miss that Grafana.com already supplies a 'grafana' RPM; it's right there on their download page. In this situation I feel that the correct thing for a Linux distribution to do is to pick another package name, one that doesn't clash with the upstream's established packaging. If you can't stand doing this, don't package the software at all.

(Fedora's packaging of Prometheus itself is fairly amusing in a terrible way, since they only provide the extremely obsolete 1.8.0 release (which is no longer supported upstream or really by anyone). Prometheus 2.x is a major improvement that everyone should be using, and 2.0.0 was released way back in November of 2017, more than a year and a half ago. At this point, Fedora should just remove their Prometheus packages from the next version of Fedora.)

PackageNameClashProblem written at 21:57:51; Add Comment

2019-05-15

An infrequent odd kernel panic on our Ubuntu 18.04 fileservers

I have in the past talked about our shiny new Prometheus based metrics system and some interesting things we've seen due to its metrics, especially its per-host system metrics (collected through node_exporter, its host agent). What I haven't mentioned is that we're not running the host agent on one important group of our machines, namely our new Linux fileservers. This isn't because we don't care about metrics from those machines. It's because when we do run the host agent, we get very infrequent but repeating kernel panics, or I should say what seems to be a single panic.

The panic we see is this:

BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
IP: __atime_needs_update+0x5/0x190
[...]
CPU: 7 PID: 10553 Comm: node_exporter Tainted: P  O  4.15.0-30-generic #32-Ubuntu
RIP: 0010:__atime_needs_update+0x5/0x190
[...]
Call Trace:
 ? link_path_walk+0x3e4/0x5a0
 ? path_init+0x177/0x2f0
 path_openat+0xe4/0x1770
[... sometimes bogus frames here ...]
 do_filp_open+0x9b/0x110
 ? __check_object_size+0xaf/0x1b0
 do_sys_open+0x1bb/0x2c0
 ? do_sys_open+0x1bb/0x2c0
 ? _cond_resched+0x19/0x40
 SyS_openat+0x14/0x20
 do_syscall_64+0x73/0x130
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

The start and the end of this call trace is consistent between panics; the middle sometimes has various peculiar and apparently bogus frames.

This panic occurs only on our ZFS fileservers, which are a small minority of the servers where we have the host agent running, and only generally after server-months of operation (including intensive pounding of the host agent on test servers). The three obvious things that are different about our ZFS fileservers are that they are our only machines with this particular set of SuperMicro hardware, they are the only machines with ZFS, and they are our only 18.04 NFS servers. However, this panic has happened on a test server with no ZFS pools and no NFS exports.

If I believe the consistent portions of the call trace, this panic happens while following a symlink during an openat() system call. I have strace'd the Prometheus host agent and there turn out to not be very many such things it opens; my notes say /proc/mounts, /proc/net, some things under /sys/class/hwmon, and some things under /sys/devices/system/cpu/cpu*/cpufreq. Of these, the /proc entries are looked at on all machines and seem unlikely suspects, while the hwmon stuff is definitely suspect. In fact we have another machine where trying to look at those entries produces constant kernel reports about ACPI problems:

ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20170831/exfield-427)
ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20170831/psparse-550)
ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20170831/power_meter-338)

(ACPI is an area I suspect because it's part of the BIOS and so varies from system to system.)

However, it doesn't seem to be the hwmon stuff alone. You can tell the Prometheus host agent to not try to look at it (with a command line argument), and while running the host agent in this mode, we have had a crash on one of our test fileservers. Based on /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver being 'acpi-cpufreq', I suspect that ACPI is involved in this area as well on these machines.

(There is some documentation on kernel CPU frequency stuff in user-guide.txt, in the cpu-freq kernel documentation directory.)

Even if motherboard-specific ACPI stuff is what triggers this panic, the panic itself is worryingly mysterious. The actual panic is clearly a dereference of a NULL pointer, as __atime_needs_update attempts to refer to a struct field (cf). Based on this happening very early on in the function and the code involved, this is probably the path argument, since this is used almost immediately. However, I can't entirely follow how we get there, especially with a NULL path. Some of the context of the call is relatively clear; the call path probably runs from part of link_path_walk through an inlined call to get_link to a mysteriously not listed touch_atime and then to __atime_needs_update.

(I would be more confident of this if I knew how to use gdb to disassemble bits of the Ubuntu kernel to verify and map back the reported raw byte positions in these functions.)

I admit that this is the kind of situation that makes me yearn for crash dumps. Having a kernel crash dump to poke around in might well give us a better understanding of what's going on, possibly including a better call trace. Unfortunately even if it's theoretically possible to get kernel crash dumps out of Linux with the right setup, it's not standard for installers to actually set that up or offer it as a choice so as a practical matter it's mostly not there.

PS: We haven't tried upgrading the kernel version we're using on the fileservers because stable fileservers are more important to us than host metrics, and we know they're stable on this specific kernel because that's what we did extensive testing on. We might consider upgrading if we could find a specific bug fix for this, but so far I haven't spotted any smoking guns.

Ubuntu1804OddKernelPanic written at 02:00:57; Add Comment

2019-05-13

Fixing Alpine to work over NFS on Ubuntu 18.04 (and probably other modern Linuxes)

Last September we discovered that the Ubuntu 18.04 LTS version of Alpine was badly broken when used over NFS, and eventually traced this to a general issue with NFS in Ubuntu 18.04's kernel and probably all modern Linux kernels. Initially we thought that this was a bug in the Linux NFS client, but after discussions on the Linux NFS mailing list it appears that this is a feature, although I was unable to get clarity on what NFS client behavior is guaranteed in general. To cut a long story short, in the end we were able to find a way to change Alpine to fix our problem, at least on Ubuntu 18.04's normal 4.15.x based server kernel.

To explain the fix, I'll start with short version of how to reproduce the problem:

  1. Create a file that is not an exact multiple of 4K in size.
  2. On a NFS client, open the file read-write and read all the way to the end. Keep the file open.
  3. On another machine, append to the file.
  4. In your program on the first NFS client, wait until stat() says the file's size has changed.
  5. Try to read the new data. The new data from the end of the old file up to the next 4 KB boundary will be zero bytes.

The general magic fix is to flock() the file after stat() says the file size has changed; in other words, flock() between steps four and five. If you do this, the 18.04 kernel NFS client code magically forgets whatever 'bad' state information it has cached and (re-)reads the real data from the NFS server. It's possible that other variations of this sequence might work, such as flock()'ing after you've finished reading the file but before it changes, but we haven't tested them.

(We haven't tested the flock() behavior on later kernels, either 18.04's HWE kernels or others, and as mentioned I could not get the Linux NFS people to say whether or not this is guaranteed behavior or just a coincidence of the current implementation, as the non-flock() version working properly was.)

Even better, Alpine turns out to already flock() mailboxes in general. The reason this is not happening here is that Alpine specifically disables flock() on NFS filesystems on Linux (see flocklnx.c) due to a bug that has now been fixed for more than ten years (really). So all we need to do to Alpine to fix the whole issue (on kernels where the flock() fix works in general) is to take out the check for being on a NFS filesystem and always flock() mailboxes regardless of what filesystem they're on, which as a bonus makes the code simpler (and avoids a fstatfs()).

To save people the effort of developing a patch for this themselves, I have added the patch we use for the Ubuntu 18.04 LTS Alpine package to my directory of stuff on this issue; you want cslab-flock.patch. If you build an updated Alpine package, you will want to put a dpkg hold on Alpine after installing your version, because an errant update to a new version of the stock package would create serious problems.

If you're going to use this on something other than Ubuntu 18.04 LTS, you should use my nfsnulls.py test program to test that the problem exists (and that you can reproduce it) and to verify that using flock() fixes it (with the --flock command line argument). I would welcome reports on what happens on kernels more recent than Ubuntu 18.04's 4.15.x.

For reasons beyond the scope of this entry, so far we have not attempted to report this issue or propagate this change to any of Ubuntu's official Alpine package, Debian's official Alpine package, or the upstream Alpine project and their git repository. I welcome other, more energetic people doing so. My personal view is that using flock() on Linux NFS mounts is the right behavior in Alpine in general, entirely independent of this bug; Alpine flock()s in all other filesystems on Linux, and disabled it on NFS only due to a very old bug (from the days before flock() even worked on NFS).

(I'm writing this entry partly because we've received a few queries about this Alpine issue, because other people turn out to have run into the problem too. Somewhat to my surprise, I never explicitly wrote up our solution, so here it is.)

AlpineOverNFSFix written at 22:52:24; Add Comment

2019-05-12

Committed address space versus active anonymous pages in Linux: a mystery

In Linux, there are at least two things that can happen when your system runs out of memory (or the kernel at least thinks it has); the kernel can activate the Out-of-Memory killer, killing one or more processes but leaving the rest alone, or it can start denying new allocation requests, which causes a random assortment of programs to start failing. As I found out recently, systems with strict overcommit on can still trigger the OOM killer, depending on your settings for how much memory the system uses (see here). Normally systems with strict overcommit turned off don't get themselves into situations where they're so out of memory that they start denying allocation requests.

Starting early this morning, some of our compute servers have periodically been reporting 'out of memory, cannot allocate/fork/etc' sorts of errors. There are two things that make this unusual. The first is that these are single-user compute servers, where we turn strict overcommit off; as a result, I would expect them to trigger the OOM killer but never actually run out of memory and start refusing allocations. The second is that according to all of the data I have, these machines have only modest and flat use of committed address space, which is my usual proxy for 'how much memory programs have allocated'.

(The kernel tracks committed address space even when strict overcommit is off, and while it doesn't necessarily represent how much memory programs actually need, it should normally be an upper bound on how much they can use. In fact until today I would have asserted that it definitely was.)

These machines have 96 GB of RAM, and during an incident I can see the committed address space be constant at 3.7 GB while /proc/meminfo's MemAvailable declines to 0 and its Active and Active(anon) numbers climb up to 90 GB or so. I find this quite mysterious, because as far as I understand Linux memory accounting, it should be impossible to have anonymous pages that are not part of the committed address space. You get anonymous pages by operations such as a MAP_ANONYMOUS mmap(), and those are exactly the operations that the kernel is supposed to carefully account for in working out Committed_AS, for obvious reasons.

Inspecting /proc/<pid>/smaps and other data for the sole gigantic Python process currently running on such a machine says that it has a resident set size of 91 GB, a significant number of 'rw-' anonymous mappings (roughly 96 GB worth, mostly in 64 MB mappings), and on hand inspection, a surprising number of those mappings have a VmFlags: field that does not have the ac flag that apparently is associated with an 'accountable area' (per the proc(5) manpage and other documentation). I don't know if not having an ac flag causes an anonymous mapping to not count against committed address space, but it seems plausible, or at least the best theory I currently have.

(It would help if I could create such mappings myself to test what happens to the committed address space and so on, but so far I have only a vague theory that perhaps they can be produced through use of mremap() with MAP_PRIVATE and MREMAP_MAYMOVE on a MAP_SHARED region. This is where I need to write a C test program, because sadly I don't think I can do this through something like Python. Python can do a lot of direct OS syscall testing, but playing around with memory remapping is asking a bit much of it)

CommittedASVsActiveAnon written at 00:22:15; Add Comment

2019-05-08

A Linux machine with a strict overcommit limit can still trigger the OOM killer

We've been running our general use compute servers with strict overcommit handling for total virtual memory for years, because on compute servers we feel we have to assume that if you ask for a lot of memory, you're going to use it for your compute job. As we discovered last fall, hitting the strict overcommit limit doesn't trigger the OOM killer, which can be inconvenient since instead all sorts of random processes start failing since they can't get any more memory. However, we've recently also discovered that our machines with strict overcommit turned on can still sometimes trigger the OOM killer.

At first this made no sense to me and I thought that something was wrong, but then I realized what is probably going on. You see, strict overcommit really has two parts, although we don't often think about the second one; there's the setting itself, ie having vm.overcommit_memory be 2, and then how much your commit limit is, set by vm.overcommit_ratio as your swap space plus some percentage of RAM. Because we couldn't find an overcommit percentage that worked for us across our disparate fleet of compute servers with very varying amounts of RAM, we set this to '100' some years ago, theoretically allowing our machines with strict overcommit to use all of RAM plus swap space. Of course, this is not actually possible in practice, because the kernel needs some amount of memory to operate itself; how much memory is unpredictable and possibly unknowable.

This gap between what we set and what's actually possible creates three states the system can wind up in. If you ask for as much memory as you can allocate (or in general enough memory), you run the system into the strict overcommit limit; either your request fails immediately or other processes start failing later when their memory allocation requests fail. If you don't ask for too much memory, everything is happy; what you asked for plus what the kernel needs fits into RAM and swap space. But if you ask for just the right large amount of memory, you push the system into a narrow middle ground; you're under the strict overcommit limit so your allocations succeed, but over what the kernel can actually provide, so when processes start trying to use enough memory, the kernel will trigger the OOM killer.

There is probably no good way to avoid this for us, so I suspect we'll just live with the little surprise of the OOM killer triggering every so often and likely terminating a RAM-heavy compute process. I don't think it happens very often, and these days we have a raft of single-user compute servers that avoid the problem.

Sidebar: The problems with attempting to turn down the memory limit

First, we don't have any idea how much memory we'd need to reserve for the kernel to avoid OOM. Being cautious here means that some of the RAM will go idle unless we add a bunch of swap space (and risk death through swap trashing).

Further, not only would the vm.overcommit_ratio setting be machine specific and have to be derived on the fly from the amount of memory, but it's probably too coarse-grained. 1% of RAM on a 256 GB machine is 2.5 GB, although I suppose perhaps the kernel might need that much reserved to avoid OOM. We could switch to using the more recent vm.overcommit_kbytes (cf), but since its value is how much RAM to allow instead of how much RAM to reserve for the kernel, we would definitely have to make it machine specific and derived from how much RAM is visible when the machine boots.

On the whole, living with the possibility of OOM is easier and less troublesome.

StrictOvercommitCanOOM written at 00:58:04; Add Comment

2019-04-24

How we're making updated versions of a file rapidly visible on our Linux NFS clients

Part of our automounter replacement is a file with a master list of all NFS mounts that client machines should have, which we hold in our central administrative filesystem that all clients NFS mount. When we migrate filesystems from our old fileservers to our new fileservers, one of the steps is to regenerate this list with the old filesystem mount not present, then run a mount update on all of the NFS clients to actually unmount the filesystem from the old fileserver. For a long time, we almost always had to wait a bit of time before all of the NFS clients would reliably see the new version of the NFS mounts file, which had the unfortunate effect of slowing down filesystem migrations.

(The NFS mount list is regenerated on the NFS fileserver for our central administrative filesystem, so the update is definitely known to the server once it's finished. Any propagation delays are purely on the side of the NFS clients, who are holding on to some sort of cached information.)

In the past, I've made a couple of attempts to find a way to reliably get the NFS clients to see that there was a new version of the file by doing things like flock(1)'ing it before reading it. These all failed. Recently, one of my co-workers discovered a reliable way of making this work, which was to regenerate the NFS mount list twice instead of once. You didn't have to delay between the two regenerations; running them back to back was fine. At first this struck me as pretty mysterious, but then I came up with a theory for what's probably going on and why this makes sense.

You see, we update this file in a NFS-safe way that leaves the old version of the file around under a different name so that programs on NFS clients that are reading it at the time don't have it yanked out from underneath them. As I understand it, Linux NFS clients cache the mapping from filesystem names to NFS filehandles for some amount of time, to reduce various sorts of NFS lookup traffic (now that I look, there is a discussion pointing to this in the nfs(5) manpage). When we do one regeneration of our nfs-mounts file, the cached filehandle that clients have for that name mapping is still valid (and the file's attributes are basically unchanged); it's just that it's for the file that is now nfs-mounts.bak instead of the new file that is now nfs-mounts. Client kernels are apparently still perfectly happy to use it, and so they read and use the old NFS mount information. However, when we regenerate the file twice, this file is removed outright and the cached filehandle is no longer valid. My theory and assumption is that modern Linux kernels detect this situation and trigger some kind of revalidation that winds up with them looking up and using the correct nfs-mounts file (instead of, say, failing with an error).

(It feels ironic that apparently the way to make this work for us here in our NFS environment is to effectively update the file in an NFS-unsafe way for once.)

PS: All of our NFS clients here are using either Ubuntu 16.04 or 18.04, using their stock (non-HWE) kernels, so various versions of what Ubuntu calls '4.4.0' (16.04) and '4.15.0' (18.04). Your mileage may vary on different kernels and in different Linux environments.

NFSClientFileVisible written at 23:20:58; Add Comment

(Previous 10 or go back to April 2019 at 2019/04/15)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.