What it takes to run a 32-bit x86 program on a 64-bit x86 Linux system
Suppose that you have a modern 64-bit x86 Linux system (often called an x86_64 environment) and that you want to run an old 32-bit x86 program on it (a plain x86 program). What does this require from the overall system, both the kernel and the rest of the environment?
At a minimum, this requires that the (64-bit) kernel support
programs running in 32-bit mode ('IA32') and making 32-bit kernel
calls. Supporting this is a configuration option in the kernel
(or actually a whole collection of them, but they mostly depend
on one called IA32_EMULATION). Supporting 32-bit calls on a
64-bit kernel is not entirely easy because many kernel calls involve
structures; those structures must be translated back and forth
between the kernel's native 64-bit version and the emulated 32-bit
version. This can raise questions of how to handle native values
that exceed what can fit in the fields of the 32-bit structures. The
kernel also has a barrel full of fun in the form of
ioctl(), which smuggles a
lot of structures in and out of the kernel in relatively opaque ways. A
64-bit kernel does want to support at least some 32-bit ioctls, such as
the ones that deal with (pseudo-)terminals.
(I suspect that there are people in the Linux kernel community who hope that all of this emulation and compatibility code can someday be removed.)
A modern kernel dealing with modern 32-bit programs also needs to provide a 32-bit vDSO, and the necessary information to let the program find it. This requires the kernel to carry around a 32-bit ELF image, which has to be generated somehow (at some point). The vDSO is mapped into the memory space of even statically compiled 32-bit programs, although they may or may not use it.
ldd output on dynamically linked 32-bit programs, I believe
this often shows up as a magic 'linux-gate.so.1'.)
This is enough for statically compiled programs, but of course very few programs are statically compiled. Instead, almost all 32-bit programs that you're likely to encounter are dynamically linked and so require a collection of additional compiled things. Running a dynamically linked program requires at least a 32-bit version of its dynamic linker (the 'ELF interpreter'), which is usually 'ld-linux.so.2'. Generally the 32-bit program will then go on to require additional 32-bit shared libraries, starting with the 32-bit C library ('libc.so.6' and 'libdl.so.2' for glibc) and expanding from there. The basic shared libraries usually come from glibc, but you can easily need additional ones from other packages for things like curses or the collection of X11 shared libraries. C++ programs will need libstdc++, which comes from GCC instead of glibc.
(The basic dynamic linker, ld-linux.so.2, is also from glibc.)
In order to do things like hostname lookups correctly, a 32-bit
program will also need 32-bit versions of any NSS modules that are used
/etc/nsswitch.conf, since all of these are shared libraries
that are loaded dynamically by glibc. Some of these modules come
from glibc itself, but others are provided by a variety of additional
software packages. I'm not certain what happens to your program's
name lookups if a relevant NSS module is not available, but at a
minimum you won't be able to correctly resolve names that come from
(You may not get any result for the name, or you might get an incorrect or incomplete result if another configured NSS module also has an answer for you. Multiple NSS modules are common for things like hostname resolution.)
I believe that generally all of these 32-bit shared libraries will have to be built with a 32-bit compiler toolchain in an environment that itself looks and behaves as 32-bit as possible. Building 32-bit binaries from a 64-bit environment is theoretically possible and nominally simple, but in practice there have been problems, and on top of that many build systems don't support this sort of cross-building.
(Of course, many people and distributions already have a collection of 32-bit shared libraries that have been built. But if they need to be rebuilt or updated for some reason, this might be an issue. And of course the relevant shared library needs to (still) support being built as 32-bit instead of 64-bit, as does the compiler toolchain.)
My weird problem with the Fedora 29 version of Firefox 67
On Twitter, I said:
So the Fedora version of Firefox 67 (or perhaps all versions of Firefox 67) have a little issue where starting Firefox with a URL, as 'firefox SOME-URL', will sometimes start Firefox without loading the web page properly. This is very irritating for me, so back to Firefox 66.
I also did some testing of the Fedora Firefox 67 with a completely
$HOME (by doing '
export HOME=/tmp/fox-scratch; mkdir $HOME')
and I got some really weird results by repeatedly starting it with
an URL, quitting, and repeating it. For a while at home, I could
get the Fedora version of Firefox 67 to report that Environment
had an invalid TLS certificate because Entrust was an unknown
Initially I wasn't sure if this was Firefox 67 in general or not, but yesterday I got sufficiently interested and irritated to fetch the official Mozilla build of Firefox 67 and try it instead ('installed', ie unpacked, into a non-system location). This official version of 67.0.2 works completely fine for me both at home and at work, with all of my normal profiles and extensions, so the problem seems to be in some way specific to Fedora's build of Firefox 67 (I saw it with both 67.0-2 and 67.0-4 on Fedora 29). On the other hand, the Fedora 29 Firefox 67 seems to work fine in my Cinnamon session on my laptop.
(I haven't tried the latest unreleased version of Fedora's Firefox that's available through the Bodhi page for Firefox or via the steps in Fetching really new Fedora packages with Bodhi. As I write this it's only 19 hours old, so I'll let the dust settle on those packages a bit.)
PS: I've not yet upgraded to Fedora 30 for various reasons, but certainly of them is this bug. I suspect it may be July before I make the leap.
PPS: For interested parties, the bug I filed with Fedora is Fedora Bugzilla #1713924.
An interesting Fedora 29 DNF update loop with the createrepo package
For a while, my Fedora 29 home and work machines have been complaining
dnf update' with a very peculiar complaint:
# dnf update Last metadata expiration check: 0:09:16 ago [...] Dependencies resolved. Problem: cannot install both createrepo_c-0.11.1-1.fc29.x86_64 and createrepo_c-0.13.2-2.fc29.x86_64 - cannot install the best update candidate for package createrepo_c-0.13.2-2.fc29.x86_64 - cannot install the best update candidate for package createrepo-0.10.3-15.fc28.noarch ========= [....] Package Architecture Version Repository Size Skipping packages with conflicts: (add '--best --allowerasing' to command line to force their upgrade): createrepo_c x86_64 0.11.1-1.fc29 fedora 59 k
(Using '--best --allowerasing' did not in fact force the upgrade.)
For a while I've been ignoring this or taking very light stabs
at trying to fix it, but tonight I got irritated enough to finally
do a full investigation. To start with, the initial situation is
that I have both
createrepo 0.10.3-15 and
installed. DNF is trying to upgrade
0.11.1-1, but this is naturally conflicting with the more recent
createrepo_c that I already have installed.
I don't know quite how I got into the underlying situation, but I
believe that interested parties can reproduce it on a current Fedora
29 system (and possibly on a Fedora 30 one as well) by first
mock, which pulls in
createrepo_c 0.13.2-2, and
then the older and now apparently obsolete
mach, which will pull in
createrepo. At this point, a '
dnf update' will likely produce
what you see here. To get out of the situation, you must DNF remove
createrepo (conveniently, '
dnf remove createrepo'
will ripple through to remove
mach as well).
To start understanding the situation, let's do an additional DNF command:
# dnf provides createrepo createrepo-0.10.3-15.fc28.noarch : Creates a common metadata repository [...] Provide : createrepo = 0.10.3-15.fc28 createrepo_c-0.11.1-1.fc29.x86_64 : Creates a common metadata repository [...] Provide : createrepo = 0.11.1-1.fc29
In the beginning, there was
in Python, and it was used by various programs and packages that
wanted to create local RPM repositories, including both Mach and Mock. As a result of
this, the Fedora packages for various things explicitly required
createrepo'. Eventually the RPM people decided that they needed
a version of createrepo written in C, so they created
createrepo_c. In Fedora
29, Fedora appears to have switched which createrepo implementation
they used to the C version. Likely to ease the transition, they
made the initial version or versions of their createrepo_c RPM
also pretend that it was
createrepo, by explicitly doing an RPM
provides of that name. This made createrepo_c 0.11.1-1 both a
substitute for the
createrepo RPM and an upgrade candidate for
it, since it has a more recent version (this is the surprise of
Provides' in RPM).
(The RPM changelog says this was introduced in 0.10.0-20, for which the only change is 'Obsolete and provide createrepo'.)
Over time, most RPMs were updated to require createrepo_c instead
of createrepo, including the
mock RPM. However, the
RPM was not updated, probably because Mach itself is neglected and
likely considered obsolete or abandoned. Then at some point the the
Fedora people stopped having their createrepo_c RPM fill in for
createrepo this way. Based on the RPM changelog for createrepo_c,
this happened in 0.13.2-1, which includes a cryptic changelog line of:
- Do not obsolete createrepo on Fedora < 31
Presumably the Fedora people have their reasons, and if I wanted
to trawl the Fedora Bugzilla I might even find them. However, the
effect of this change is that older
createrepo_c RPMs in Fedora
29 are updates for
createrepo but newer ones aren't.
So, if you '
dnf install mock', you will get
mock and the current
createrepo_c, which doesn't provide
If you then '
dnf install mach', it requires
the best version it can actually install is the actual
0.10.3-15 RPM that was built on Fedora 28. However, once that is
installed, DNF will see the 0.11.1-1 version of
from the Fedora 29 release package set as an update candidate for it,
but that can't be installed because you already have a more recent
(I suspect that if you install
mach first and
mock second, you
will get only
createrepo_c but will be unable to upgrade it
past 0.11.1-1 without erasing
Distribution packaging of software needs to be informed (and useful)
I think that's a bit backwards. I always consider the distro as the owner of the package namespace. So, it's really up to grafana.com to use a non-colliding name.
The distro provides a self-contained ecosystem, if a 3rd party package wants to work with it, then it should be up to the 3rd party to do all the work. There's no reasonable way for Fedora (or any other distro) to know about every other possible package on the web that may get installed.
I fundamentally disagree with this view as it plays out in the case
of the '
grafana' RPM package name (where grafana.com packaged it
long before Fedora did). At the immediate level, when the upstream
already distributes a package for the thing (either as a standalone
package, the case here, or through an addon repository), it is up
to the distribution, as the second person on the scene, to make
sure that they work with the existing situation unless it is
completely untenable to do so.
(On a purely practical level, saying 'the distribution can take over your existing package name any time it feels like and cause arbitrary problems for current users' is a great way to get upstreams to never make their own packages for a distribution and to only every release tarballs. I feel strongly that this would be a loss; tarballs are strongly inferior to proper packages for various reasons.)
More broadly, when a distribution creates a package for something, they absolutely should inform themselves about how the thing is currently packaged, distributed, installed, and used on their distribution, if it is, and how it will be used and evolve in the future. Fundamentally, a distribution should be creating useful packages of programs and being informed is part of that. Blindly grabbing a release of something and packaging it as an official distribution package is not necessarily creating a useful thing for anyone, either current users or potential future users who might install the distribution's package. Doing a good, useful package fundamentally requires understanding things like how the upstream distributes things, what their release schedule is like, how they support old releases (if they do), and so on. It cannot be done blindly, even in cases where the upstream is not already providing its own packages.
(For example, if you package and freeze a version of something that will have that version abandoned immediately by the upstream and not have fixes, security updates and so on backported by you, you are not creating a useful package; instead, you're creating a dangerous one. In some cases this means that you cannot create a distribution package that is both in compliance with distribution packaging policies and useful to your users; in that case, you should not package it at all. If users keep asking, set up a web page for 'why we cannot provide a package for this'.)
PS: Some of this is moot if the upstream does not distribute their own pre-built binaries, but even then you really want to know the upstream's release schedule, length of version support, degree of version to version change, and so on. If the upstream believes in routine significant change, no support of old versions, and frequent releases, you probably do not want to touch that minefield. In the modern world, it is an unfortunate fact of life that not every upstream project is suitable for being packaged by distributions, even if this leaves your users to deal with the problem themselves. It's better to be honest about the upstream project being incompatible with what your users expect from your packages.
Something that Linux distributions should not do when packaging things
Right now I am a bit unhappy at Fedora for a specific packaging situation, so let me tell you a little story of what I, as a system administrator, would really like distributions to not do.
For reasons beyond the scope of this blog entry, I run a Prometheus and Grafana setup
on both my home and office Fedora Linux machines (among other things,
it gives me a place to test out various things involving them).
When I set this up, I used the official upstream versions of both,
because I needed to match what we are running (or would soon be). The Grafana people supply
Grafana in a variety of package formats, and because Grafana has a
bunch of files and paths I opted to use their RPM package instead
of their tarball. The Grafana people give their RPM package the
package name of '
grafana', which is perfectly reasonable of them.
(We use the .deb on our Ubuntu 18.04 based production server for the same reason. Life is too short to spend patiently setting tons of command line switches or configuration file paths to tell something where to find all of its bits if the people provide a nice pre-packaged artifact.)
Recently, Fedora decided to package Grafana themselves (as a RPM),
and they called this RPM package '
grafana'. Since the two different
packages are different versions of the same thing as far as package
management tools are concerned, Fedora basically took over the
grafana' package name from Grafana. This caused my systems to
offer to upgrade me from the Grafana.com 'grafana-6.1.5-1' package
to the Fedora 'grafana-6.1.6-1.fc29' one, which I actually did after
taking reasonable steps to make sure that the Fedora version of
6.1.6 was compatible with the file layouts and so on from the Grafana
version of 6.1.5.
So far, I have no objection to what Fedora did. They provided the latest released version of Grafana, and their new package was a drop in replacement for the upstream Grafana RPM. The problem is what happened next, which is that the Grafana people released Grafana 6.2 on May 22nd (cf) and currently there is no sign of any update to the Fedora package (the Bodhi page for grafana has no activity since 6.1.6, for example). At this point it is unclear to me if Fedora has any plans to update from 6.1.6 at all, for example; perhaps they have decided to freeze on this initial version.
Why is this a problem? It's simple. If you're going to take over
a package name from the upstream, you should keep up with the
upstream releases. If you take over a package name and don't keep
up to date or keep up to date only sporadically, you cause all sorts
of heartburn for system administrators who use the package. The
least annoying future of this situation is that Fedora has abandoned
Grafana at 6.1.6 and I am going to 'upgrade' it with the upstream
6.2.1, which will hopefully be a transparent replacement and not
blow up in my face. The most annoying future is that Fedora and
Grafana keep ping-ponging versions back and forth, which will make
dnf upgrade' into a minefield (because it will frequently try
to give me a '
grafana' upgrade that I don't want and that would
be dangerous to accept). And of course this situation turns Fedora
version upgrades into their own minefield, since now I risk an
upgrade to Fedora 30 actually reverting the '
version on me.
You can hardly miss that Grafana.com already supplies a '
RPM; it's right there on their download page. In this situation I feel
that the correct thing for a Linux distribution to do is to pick
another package name, one that doesn't clash with the upstream's
established packaging. If you can't stand doing this, don't package
the software at all.
(Fedora's packaging of Prometheus itself is fairly amusing in a terrible way, since they only provide the extremely obsolete 1.8.0 release (which is no longer supported upstream or really by anyone). Prometheus 2.x is a major improvement that everyone should be using, and 2.0.0 was released way back in November of 2017, more than a year and a half ago. At this point, Fedora should just remove their Prometheus packages from the next version of Fedora.)
An infrequent odd kernel panic on our Ubuntu 18.04 fileservers
I have in the past talked about our shiny new Prometheus based metrics system and some interesting
things we've seen due to
its metrics, especially its per-host system metrics (collected
its host agent). What I haven't mentioned is that we're not running
the host agent on one important group of our machines, namely our
new Linux fileservers. This isn't because
we don't care about metrics from those machines. It's because when
we do run the host agent, we get very infrequent but repeating
kernel panics, or I should say what seems to be a single panic.
The panic we see is this:
BUG: unable to handle kernel NULL pointer dereference at 000000000000000c IP: __atime_needs_update+0x5/0x190 [...] CPU: 7 PID: 10553 Comm: node_exporter Tainted: P O 4.15.0-30-generic #32-Ubuntu RIP: 0010:__atime_needs_update+0x5/0x190 [...] Call Trace: ? link_path_walk+0x3e4/0x5a0 ? path_init+0x177/0x2f0 path_openat+0xe4/0x1770 [... sometimes bogus frames here ...] do_filp_open+0x9b/0x110 ? __check_object_size+0xaf/0x1b0 do_sys_open+0x1bb/0x2c0 ? do_sys_open+0x1bb/0x2c0 ? _cond_resched+0x19/0x40 SyS_openat+0x14/0x20 do_syscall_64+0x73/0x130 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
The start and the end of this call trace is consistent between panics; the middle sometimes has various peculiar and apparently bogus frames.
This panic occurs only on our ZFS fileservers, which are a small minority of the servers where we have the host agent running, and only generally after server-months of operation (including intensive pounding of the host agent on test servers). The three obvious things that are different about our ZFS fileservers are that they are our only machines with this particular set of SuperMicro hardware, they are the only machines with ZFS, and they are our only 18.04 NFS servers. However, this panic has happened on a test server with no ZFS pools and no NFS exports.
If I believe the consistent portions of the call trace, this panic
happens while following a symlink during an
openat() system call.
strace'd the Prometheus host agent and there turn out to
not be very many such things it opens; my notes say
/proc/net, some things under
/sys/class/hwmon, and some things
/sys/devices/system/cpu/cpu*/cpufreq. Of these, the
entries are looked at on all machines and seem unlikely suspects,
hwmon stuff is definitely suspect. In fact we have
another machine where trying to look at those entries produces
constant kernel reports about ACPI problems:
ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20170831/exfield-427) ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20170831/psparse-550) ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20170831/power_meter-338)
(ACPI is an area I suspect because it's part of the BIOS and so varies from system to system.)
However, it doesn't seem to be the
hwmon stuff alone. You can
tell the Prometheus host agent to not try to look at it (with a
command line argument), and while running the host agent in this
mode, we have had a crash on one of our test fileservers. Based
'acpi-cpufreq', I suspect that ACPI is involved in this area as
well on these machines.
Even if motherboard-specific ACPI stuff is what triggers this panic,
the panic itself is worryingly mysterious. The actual panic is
clearly a dereference of a NULL pointer, as
attempts to refer to a struct field (cf).
Based on this happening very early on in the function and the
this is probably the
path argument, since this is used almost
immediately. However, I can't entirely follow how we get there,
especially with a NULL
path. Some of the context of the call is
relatively clear; the call path probably runs from part of
through an inlined call to
to a mysteriously not listed
and then to
(I would be more confident of this if I knew how to use
disassemble bits of the Ubuntu kernel to verify and map back the
reported raw byte positions in these functions.)
I admit that this is the kind of situation that makes me yearn for crash dumps. Having a kernel crash dump to poke around in might well give us a better understanding of what's going on, possibly including a better call trace. Unfortunately even if it's theoretically possible to get kernel crash dumps out of Linux with the right setup, it's not standard for installers to actually set that up or offer it as a choice so as a practical matter it's mostly not there.
PS: We haven't tried upgrading the kernel version we're using on the fileservers because stable fileservers are more important to us than host metrics, and we know they're stable on this specific kernel because that's what we did extensive testing on. We might consider upgrading if we could find a specific bug fix for this, but so far I haven't spotted any smoking guns.
Fixing Alpine to work over NFS on Ubuntu 18.04 (and probably other modern Linuxes)
Last September we discovered that the Ubuntu 18.04 LTS version of Alpine was badly broken when used over NFS, and eventually traced this to a general issue with NFS in Ubuntu 18.04's kernel and probably all modern Linux kernels. Initially we thought that this was a bug in the Linux NFS client, but after discussions on the Linux NFS mailing list it appears that this is a feature, although I was unable to get clarity on what NFS client behavior is guaranteed in general. To cut a long story short, in the end we were able to find a way to change Alpine to fix our problem, at least on Ubuntu 18.04's normal 4.15.x based server kernel.
To explain the fix, I'll start with short version of how to reproduce the problem:
- Create a file that is not an exact multiple of 4K in size.
- On a NFS client, open the file read-write and read all the way to the end. Keep the file open.
- On another machine, append to the file.
- In your program on the first NFS client, wait until
stat()says the file's size has changed.
- Try to read the new data. The new data from the end of the old file up to the next 4 KB boundary will be zero bytes.
The general magic fix is to
flock() the file after
the file size has changed; in other words,
flock() between steps
four and five. If you do this, the 18.04 kernel NFS client code
magically forgets whatever 'bad' state information it has cached
and (re-)reads the real data from the NFS server. It's possible
that other variations of this sequence might work, such as
after you've finished reading the file but before it changes, but
we haven't tested them.
(We haven't tested the
flock() behavior on later kernels, either
18.04's HWE kernels
or others, and as mentioned I
could not get the Linux NFS people to say whether or not this is
guaranteed behavior or just a coincidence of the current implementation,
as the non-
flock() version working properly was.)
Even better, Alpine
turns out to already
flock() mailboxes in general. The reason
this is not happening here is that Alpine specifically disables
flock() on NFS filesystems on Linux (see flocklnx.c)
due to a bug that has now been fixed for more than ten years (really). So
all we need to do to Alpine to fix the whole issue (on kernels where
flock() fix works in general) is to take out the check for
being on a NFS filesystem and always
flock() mailboxes regardless
of what filesystem they're on, which as a bonus makes the code
simpler (and avoids a
To save people the effort of developing a patch for this themselves, I have added the patch we use for the Ubuntu 18.04 LTS Alpine package to my directory of stuff on this issue; you want cslab-flock.patch. If you build an updated Alpine package, you will want to put a dpkg hold on Alpine after installing your version, because an errant update to a new version of the stock package would create serious problems.
If you're going to use this on something other than Ubuntu 18.04
LTS, you should use my nfsnulls.py test program to
test that the problem exists (and that you can reproduce it) and
to verify that using
flock() fixes it (with the
line argument). I would welcome reports on what happens on kernels
more recent than Ubuntu 18.04's 4.15.x.
For reasons beyond the scope of this entry, so far we have not
attempted to report this issue or propagate this change to any of
Ubuntu's official Alpine package, Debian's official Alpine package,
or the upstream Alpine project
and their git repository. I welcome
other, more energetic people doing so. My personal view is that
flock() on Linux NFS mounts is the right behavior in Alpine
in general, entirely independent of this bug; Alpine
all other filesystems on Linux, and disabled it on NFS only due to
a very old bug (from the days before
flock() even worked on NFS).
(I'm writing this entry partly because we've received a few queries about this Alpine issue, because other people turn out to have run into the problem too. Somewhat to my surprise, I never explicitly wrote up our solution, so here it is.)
Committed address space versus active anonymous pages in Linux: a mystery
In Linux, there are at least two things that can happen when your system runs out of memory (or the kernel at least thinks it has); the kernel can activate the Out-of-Memory killer, killing one or more processes but leaving the rest alone, or it can start denying new allocation requests, which causes a random assortment of programs to start failing. As I found out recently, systems with strict overcommit on can still trigger the OOM killer, depending on your settings for how much memory the system uses (see here). Normally systems with strict overcommit turned off don't get themselves into situations where they're so out of memory that they start denying allocation requests.
Starting early this morning, some of our compute servers have periodically been reporting 'out of memory, cannot allocate/fork/etc' sorts of errors. There are two things that make this unusual. The first is that these are single-user compute servers, where we turn strict overcommit off; as a result, I would expect them to trigger the OOM killer but never actually run out of memory and start refusing allocations. The second is that according to all of the data I have, these machines have only modest and flat use of committed address space, which is my usual proxy for 'how much memory programs have allocated'.
(The kernel tracks committed address space even when strict overcommit is off, and while it doesn't necessarily represent how much memory programs actually need, it should normally be an upper bound on how much they can use. In fact until today I would have asserted that it definitely was.)
These machines have 96 GB of RAM, and during an incident I can see
the committed address space be constant at 3.7 GB while
MemAvailable declines to 0 and its Active and Active(anon) numbers
climb up to 90 GB or so. I find this quite mysterious, because as
far as I understand Linux memory accounting, it should be impossible
to have anonymous pages that are not part of the committed address
space. You get anonymous pages by operations such as a
mmap(), and those are exactly the operations that the kernel is
supposed to carefully account for in working out Committed_AS,
for obvious reasons.
/proc/<pid>/smaps and other data for
the sole gigantic Python process currently running on such a machine
says that it has a resident set size of 91 GB, a significant number
of 'rw-' anonymous mappings (roughly 96 GB worth, mostly in 64
MB mappings), and on hand inspection, a surprising number of those
mappings have a VmFlags: field that does not have the
that apparently is associated with an 'accountable area' (per the
and other documentation). I don't know if not having an
causes an anonymous mapping to not count against committed address
space, but it seems plausible, or at least the best theory I currently
(It would help if I could create such mappings myself to test what
happens to the committed address space and so on, but so far I
have only a vague theory that perhaps they can be produced through
MREMAP_MAYMOVE on a
MAP_SHARED region. This is where I need to write a C test program,
because sadly I don't think I can do this through something like
Python. Python can do a lot of direct OS syscall testing, but playing
around with memory remapping is asking a bit much of it)
A Linux machine with a strict overcommit limit can still trigger the OOM killer
We've been running our general use compute servers with strict overcommit handling for total virtual memory for years, because on compute servers we feel we have to assume that if you ask for a lot of memory, you're going to use it for your compute job. As we discovered last fall, hitting the strict overcommit limit doesn't trigger the OOM killer, which can be inconvenient since instead all sorts of random processes start failing since they can't get any more memory. However, we've recently also discovered that our machines with strict overcommit turned on can still sometimes trigger the OOM killer.
At first this made no sense to me and I thought that something was
wrong, but then I realized what is probably going on. You see,
strict overcommit really has two parts, although we don't often
think about the second one; there's the setting itself, ie having
vm.overcommit_memory be 2, and then how much your commit limit
is, set by
vm.overcommit_ratio as your swap space plus some
percentage of RAM. Because we couldn't find an overcommit percentage
that worked for us across our disparate fleet of compute servers
with very varying amounts of RAM, we set this to '100' some years
ago, theoretically allowing our machines with strict overcommit to
use all of RAM plus swap space. Of course, this is not actually
possible in practice, because the kernel needs some amount of memory
to operate itself; how much memory is unpredictable and possibly
This gap between what we set and what's actually possible creates three states the system can wind up in. If you ask for as much memory as you can allocate (or in general enough memory), you run the system into the strict overcommit limit; either your request fails immediately or other processes start failing later when their memory allocation requests fail. If you don't ask for too much memory, everything is happy; what you asked for plus what the kernel needs fits into RAM and swap space. But if you ask for just the right large amount of memory, you push the system into a narrow middle ground; you're under the strict overcommit limit so your allocations succeed, but over what the kernel can actually provide, so when processes start trying to use enough memory, the kernel will trigger the OOM killer.
There is probably no good way to avoid this for us, so I suspect we'll just live with the little surprise of the OOM killer triggering every so often and likely terminating a RAM-heavy compute process. I don't think it happens very often, and these days we have a raft of single-user compute servers that avoid the problem.
Sidebar: The problems with attempting to turn down the memory limit
First, we don't have any idea how much memory we'd need to reserve for the kernel to avoid OOM. Being cautious here means that some of the RAM will go idle unless we add a bunch of swap space (and risk death through swap trashing).
Further, not only would the
vm.overcommit_ratio setting be
machine specific and have to be derived on the fly from the amount
of memory, but it's probably too coarse-grained. 1% of RAM on a 256
GB machine is 2.5 GB, although I suppose perhaps the kernel might
need that much reserved to avoid OOM.
We could switch to using the more recent
vm.overcommit_kbytes (cf), but since
its value is how much RAM to allow instead of how much RAM to reserve
for the kernel, we would definitely have to make it machine specific
and derived from how much RAM is visible when the machine boots.
On the whole, living with the possibility of OOM is easier and less troublesome.
How we're making updated versions of a file rapidly visible on our Linux NFS clients
Part of our automounter replacement is a file with a master list of all NFS mounts that client machines should have, which we hold in our central administrative filesystem that all clients NFS mount. When we migrate filesystems from our old fileservers to our new fileservers, one of the steps is to regenerate this list with the old filesystem mount not present, then run a mount update on all of the NFS clients to actually unmount the filesystem from the old fileserver. For a long time, we almost always had to wait a bit of time before all of the NFS clients would reliably see the new version of the NFS mounts file, which had the unfortunate effect of slowing down filesystem migrations.
(The NFS mount list is regenerated on the NFS fileserver for our central administrative filesystem, so the update is definitely known to the server once it's finished. Any propagation delays are purely on the side of the NFS clients, who are holding on to some sort of cached information.)
In the past, I've made a couple of attempts to find a way to reliably
get the NFS clients to see that there was a new version of the file
by doing things like
flock(1)'ing it before
reading it. These all failed. Recently, one of my co-workers
discovered a reliable way of making this work, which was to regenerate
the NFS mount list twice instead of once. You didn't have to delay
between the two regenerations; running them back to back was fine.
At first this struck me as pretty mysterious, but then I came up
with a theory for what's probably going on and why this makes sense.
You see, we update this file in a NFS-safe way that leaves the old version
of the file around under a different name so that programs on NFS
clients that are reading it at the time don't have it yanked out
from underneath them.
As I understand it, Linux NFS clients cache the mapping from
filesystem names to NFS filehandles
for some amount of time, to reduce various sorts of NFS lookup
traffic (now that I look, there is a discussion pointing to this
nfs(5) manpage). When we do one
regeneration of our
nfs-mounts file, the cached filehandle that
clients have for that name mapping is still valid (and the file's
attributes are basically unchanged); it's just that it's for the
file that is now
nfs-mounts.bak instead of the new file that is
nfs-mounts. Client kernels are apparently still perfectly
happy to use it, and so they read and use the old NFS mount
information. However, when we regenerate the file twice, this file
is removed outright and the cached filehandle is no longer valid.
My theory and assumption is that modern Linux kernels detect this
situation and trigger some kind of revalidation that winds up with
them looking up and using the correct
nfs-mounts file (instead of,
say, failing with an error).
(It feels ironic that apparently the way to make this work for us here in our NFS environment is to effectively update the file in an NFS-unsafe way for once.)
PS: All of our NFS clients here are using either Ubuntu 16.04 or 18.04, using their stock (non-HWE) kernels, so various versions of what Ubuntu calls '4.4.0' (16.04) and '4.15.0' (18.04). Your mileage may vary on different kernels and in different Linux environments.