Wandering Thoughts archives

2019-05-28

Distribution packaging of software needs to be informed (and useful)

In my recent entry on a RPM name clash between grafana.com's and Fedora's packaging of Grafana, Josef "Jeff" Sipek wrote in a comment:

I think that's a bit backwards. I always consider the distro as the owner of the package namespace. So, it's really up to grafana.com to use a non-colliding name.

The distro provides a self-contained ecosystem, if a 3rd party package wants to work with it, then it should be up to the 3rd party to do all the work. There's no reasonable way for Fedora (or any other distro) to know about every other possible package on the web that may get installed.

[...]

I fundamentally disagree with this view as it plays out in the case of the 'grafana' RPM package name (where grafana.com packaged it long before Fedora did). At the immediate level, when the upstream already distributes a package for the thing (either as a standalone package, the case here, or through an addon repository), it is up to the distribution, as the second person on the scene, to make sure that they work with the existing situation unless it is completely untenable to do so.

(On a purely practical level, saying 'the distribution can take over your existing package name any time it feels like and cause arbitrary problems for current users' is a great way to get upstreams to never make their own packages for a distribution and to only every release tarballs. I feel strongly that this would be a loss; tarballs are strongly inferior to proper packages for various reasons.)

More broadly, when a distribution creates a package for something, they absolutely should inform themselves about how the thing is currently packaged, distributed, installed, and used on their distribution, if it is, and how it will be used and evolve in the future. Fundamentally, a distribution should be creating useful packages of programs and being informed is part of that. Blindly grabbing a release of something and packaging it as an official distribution package is not necessarily creating a useful thing for anyone, either current users or potential future users who might install the distribution's package. Doing a good, useful package fundamentally requires understanding things like how the upstream distributes things, what their release schedule is like, how they support old releases (if they do), and so on. It cannot be done blindly, even in cases where the upstream is not already providing its own packages.

(For example, if you package and freeze a version of something that will have that version abandoned immediately by the upstream and not have fixes, security updates and so on backported by you, you are not creating a useful package; instead, you're creating a dangerous one. In some cases this means that you cannot create a distribution package that is both in compliance with distribution packaging policies and useful to your users; in that case, you should not package it at all. If users keep asking, set up a web page for 'why we cannot provide a package for this'.)

PS: Some of this is moot if the upstream does not distribute their own pre-built binaries, but even then you really want to know the upstream's release schedule, length of version support, degree of version to version change, and so on. If the upstream believes in routine significant change, no support of old versions, and frequent releases, you probably do not want to touch that minefield. In the modern world, it is an unfortunate fact of life that not every upstream project is suitable for being packaged by distributions, even if this leaves your users to deal with the problem themselves. It's better to be honest about the upstream project being incompatible with what your users expect from your packages.

PackagingMustBeInformed written at 23:57:21; Add Comment

2019-05-27

Something that Linux distributions should not do when packaging things

Right now I am a bit unhappy at Fedora for a specific packaging situation, so let me tell you a little story of what I, as a system administrator, would really like distributions to not do.

For reasons beyond the scope of this blog entry, I run a Prometheus and Grafana setup on both my home and office Fedora Linux machines (among other things, it gives me a place to test out various things involving them). When I set this up, I used the official upstream versions of both, because I needed to match what we are running (or would soon be). The Grafana people supply Grafana in a variety of package formats, and because Grafana has a bunch of files and paths I opted to use their RPM package instead of their tarball. The Grafana people give their RPM package the package name of 'grafana', which is perfectly reasonable of them.

(We use the .deb on our Ubuntu 18.04 based production server for the same reason. Life is too short to spend patiently setting tons of command line switches or configuration file paths to tell something where to find all of its bits if the people provide a nice pre-packaged artifact.)

Recently, Fedora decided to package Grafana themselves (as a RPM), and they called this RPM package 'grafana'. Since the two different packages are different versions of the same thing as far as package management tools are concerned, Fedora basically took over the 'grafana' package name from Grafana. This caused my systems to offer to upgrade me from the Grafana.com 'grafana-6.1.5-1' package to the Fedora 'grafana-6.1.6-1.fc29' one, which I actually did after taking reasonable steps to make sure that the Fedora version of 6.1.6 was compatible with the file layouts and so on from the Grafana version of 6.1.5.

So far, I have no objection to what Fedora did. They provided the latest released version of Grafana, and their new package was a drop in replacement for the upstream Grafana RPM. The problem is what happened next, which is that the Grafana people released Grafana 6.2 on May 22nd (cf) and currently there is no sign of any update to the Fedora package (the Bodhi page for grafana has no activity since 6.1.6, for example). At this point it is unclear to me if Fedora has any plans to update from 6.1.6 at all, for example; perhaps they have decided to freeze on this initial version.

Why is this a problem? It's simple. If you're going to take over a package name from the upstream, you should keep up with the upstream releases. If you take over a package name and don't keep up to date or keep up to date only sporadically, you cause all sorts of heartburn for system administrators who use the package. The least annoying future of this situation is that Fedora has abandoned Grafana at 6.1.6 and I am going to 'upgrade' it with the upstream 6.2.1, which will hopefully be a transparent replacement and not blow up in my face. The most annoying future is that Fedora and Grafana keep ping-ponging versions back and forth, which will make 'dnf upgrade' into a minefield (because it will frequently try to give me a 'grafana' upgrade that I don't want and that would be dangerous to accept). And of course this situation turns Fedora version upgrades into their own minefield, since now I risk an upgrade to Fedora 30 actually reverting the 'grafana' package version on me.

You can hardly miss that Grafana.com already supplies a 'grafana' RPM; it's right there on their download page. In this situation I feel that the correct thing for a Linux distribution to do is to pick another package name, one that doesn't clash with the upstream's established packaging. If you can't stand doing this, don't package the software at all.

(Fedora's packaging of Prometheus itself is fairly amusing in a terrible way, since they only provide the extremely obsolete 1.8.0 release (which is no longer supported upstream or really by anyone). Prometheus 2.x is a major improvement that everyone should be using, and 2.0.0 was released way back in November of 2017, more than a year and a half ago. At this point, Fedora should just remove their Prometheus packages from the next version of Fedora.)

PackageNameClashProblem written at 21:57:51; Add Comment

2019-05-15

An infrequent odd kernel panic on our Ubuntu 18.04 fileservers

I have in the past talked about our shiny new Prometheus based metrics system and some interesting things we've seen due to its metrics, especially its per-host system metrics (collected through node_exporter, its host agent). What I haven't mentioned is that we're not running the host agent on one important group of our machines, namely our new Linux fileservers. This isn't because we don't care about metrics from those machines. It's because when we do run the host agent, we get very infrequent but repeating kernel panics, or I should say what seems to be a single panic.

The panic we see is this:

BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
IP: __atime_needs_update+0x5/0x190
[...]
CPU: 7 PID: 10553 Comm: node_exporter Tainted: P  O  4.15.0-30-generic #32-Ubuntu
RIP: 0010:__atime_needs_update+0x5/0x190
[...]
Call Trace:
 ? link_path_walk+0x3e4/0x5a0
 ? path_init+0x177/0x2f0
 path_openat+0xe4/0x1770
[... sometimes bogus frames here ...]
 do_filp_open+0x9b/0x110
 ? __check_object_size+0xaf/0x1b0
 do_sys_open+0x1bb/0x2c0
 ? do_sys_open+0x1bb/0x2c0
 ? _cond_resched+0x19/0x40
 SyS_openat+0x14/0x20
 do_syscall_64+0x73/0x130
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

The start and the end of this call trace is consistent between panics; the middle sometimes has various peculiar and apparently bogus frames.

This panic occurs only on our ZFS fileservers, which are a small minority of the servers where we have the host agent running, and only generally after server-months of operation (including intensive pounding of the host agent on test servers). The three obvious things that are different about our ZFS fileservers are that they are our only machines with this particular set of SuperMicro hardware, they are the only machines with ZFS, and they are our only 18.04 NFS servers. However, this panic has happened on a test server with no ZFS pools and no NFS exports.

If I believe the consistent portions of the call trace, this panic happens while following a symlink during an openat() system call. I have strace'd the Prometheus host agent and there turn out to not be very many such things it opens; my notes say /proc/mounts, /proc/net, some things under /sys/class/hwmon, and some things under /sys/devices/system/cpu/cpu*/cpufreq. Of these, the /proc entries are looked at on all machines and seem unlikely suspects, while the hwmon stuff is definitely suspect. In fact we have another machine where trying to look at those entries produces constant kernel reports about ACPI problems:

ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20170831/exfield-427)
ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20170831/psparse-550)
ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20170831/power_meter-338)

(ACPI is an area I suspect because it's part of the BIOS and so varies from system to system.)

However, it doesn't seem to be the hwmon stuff alone. You can tell the Prometheus host agent to not try to look at it (with a command line argument), and while running the host agent in this mode, we have had a crash on one of our test fileservers. Based on /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver being 'acpi-cpufreq', I suspect that ACPI is involved in this area as well on these machines.

(There is some documentation on kernel CPU frequency stuff in user-guide.txt, in the cpu-freq kernel documentation directory.)

Even if motherboard-specific ACPI stuff is what triggers this panic, the panic itself is worryingly mysterious. The actual panic is clearly a dereference of a NULL pointer, as __atime_needs_update attempts to refer to a struct field (cf). Based on this happening very early on in the function and the code involved, this is probably the path argument, since this is used almost immediately. However, I can't entirely follow how we get there, especially with a NULL path. Some of the context of the call is relatively clear; the call path probably runs from part of link_path_walk through an inlined call to get_link to a mysteriously not listed touch_atime and then to __atime_needs_update.

(I would be more confident of this if I knew how to use gdb to disassemble bits of the Ubuntu kernel to verify and map back the reported raw byte positions in these functions.)

I admit that this is the kind of situation that makes me yearn for crash dumps. Having a kernel crash dump to poke around in might well give us a better understanding of what's going on, possibly including a better call trace. Unfortunately even if it's theoretically possible to get kernel crash dumps out of Linux with the right setup, it's not standard for installers to actually set that up or offer it as a choice so as a practical matter it's mostly not there.

PS: We haven't tried upgrading the kernel version we're using on the fileservers because stable fileservers are more important to us than host metrics, and we know they're stable on this specific kernel because that's what we did extensive testing on. We might consider upgrading if we could find a specific bug fix for this, but so far I haven't spotted any smoking guns.

Ubuntu1804OddKernelPanic written at 02:00:57; Add Comment

2019-05-13

Fixing Alpine to work over NFS on Ubuntu 18.04 (and probably other modern Linuxes)

Last September we discovered that the Ubuntu 18.04 LTS version of Alpine was badly broken when used over NFS, and eventually traced this to a general issue with NFS in Ubuntu 18.04's kernel and probably all modern Linux kernels. Initially we thought that this was a bug in the Linux NFS client, but after discussions on the Linux NFS mailing list it appears that this is a feature, although I was unable to get clarity on what NFS client behavior is guaranteed in general. To cut a long story short, in the end we were able to find a way to change Alpine to fix our problem, at least on Ubuntu 18.04's normal 4.15.x based server kernel.

To explain the fix, I'll start with short version of how to reproduce the problem:

  1. Create a file that is not an exact multiple of 4K in size.
  2. On a NFS client, open the file read-write and read all the way to the end. Keep the file open.
  3. On another machine, append to the file.
  4. In your program on the first NFS client, wait until stat() says the file's size has changed.
  5. Try to read the new data. The new data from the end of the old file up to the next 4 KB boundary will be zero bytes.

The general magic fix is to flock() the file after stat() says the file size has changed; in other words, flock() between steps four and five. If you do this, the 18.04 kernel NFS client code magically forgets whatever 'bad' state information it has cached and (re-)reads the real data from the NFS server. It's possible that other variations of this sequence might work, such as flock()'ing after you've finished reading the file but before it changes, but we haven't tested them.

(We haven't tested the flock() behavior on later kernels, either 18.04's HWE kernels or others, and as mentioned I could not get the Linux NFS people to say whether or not this is guaranteed behavior or just a coincidence of the current implementation, as the non-flock() version working properly was.)

Even better, Alpine turns out to already flock() mailboxes in general. The reason this is not happening here is that Alpine specifically disables flock() on NFS filesystems on Linux (see flocklnx.c) due to a bug that has now been fixed for more than ten years (really). So all we need to do to Alpine to fix the whole issue (on kernels where the flock() fix works in general) is to take out the check for being on a NFS filesystem and always flock() mailboxes regardless of what filesystem they're on, which as a bonus makes the code simpler (and avoids a fstatfs()).

To save people the effort of developing a patch for this themselves, I have added the patch we use for the Ubuntu 18.04 LTS Alpine package to my directory of stuff on this issue; you want cslab-flock.patch. If you build an updated Alpine package, you will want to put a dpkg hold on Alpine after installing your version, because an errant update to a new version of the stock package would create serious problems.

If you're going to use this on something other than Ubuntu 18.04 LTS, you should use my nfsnulls.py test program to test that the problem exists (and that you can reproduce it) and to verify that using flock() fixes it (with the --flock command line argument). I would welcome reports on what happens on kernels more recent than Ubuntu 18.04's 4.15.x.

For reasons beyond the scope of this entry, so far we have not attempted to report this issue or propagate this change to any of Ubuntu's official Alpine package, Debian's official Alpine package, or the upstream Alpine project and their git repository. I welcome other, more energetic people doing so. My personal view is that using flock() on Linux NFS mounts is the right behavior in Alpine in general, entirely independent of this bug; Alpine flock()s in all other filesystems on Linux, and disabled it on NFS only due to a very old bug (from the days before flock() even worked on NFS).

(I'm writing this entry partly because we've received a few queries about this Alpine issue, because other people turn out to have run into the problem too. Somewhat to my surprise, I never explicitly wrote up our solution, so here it is.)

AlpineOverNFSFix written at 22:52:24; Add Comment

2019-05-12

Committed address space versus active anonymous pages in Linux: a mystery

In Linux, there are at least two things that can happen when your system runs out of memory (or the kernel at least thinks it has); the kernel can activate the Out-of-Memory killer, killing one or more processes but leaving the rest alone, or it can start denying new allocation requests, which causes a random assortment of programs to start failing. As I found out recently, systems with strict overcommit on can still trigger the OOM killer, depending on your settings for how much memory the system uses (see here). Normally systems with strict overcommit turned off don't get themselves into situations where they're so out of memory that they start denying allocation requests.

Starting early this morning, some of our compute servers have periodically been reporting 'out of memory, cannot allocate/fork/etc' sorts of errors. There are two things that make this unusual. The first is that these are single-user compute servers, where we turn strict overcommit off; as a result, I would expect them to trigger the OOM killer but never actually run out of memory and start refusing allocations. The second is that according to all of the data I have, these machines have only modest and flat use of committed address space, which is my usual proxy for 'how much memory programs have allocated'.

(The kernel tracks committed address space even when strict overcommit is off, and while it doesn't necessarily represent how much memory programs actually need, it should normally be an upper bound on how much they can use. In fact until today I would have asserted that it definitely was.)

These machines have 96 GB of RAM, and during an incident I can see the committed address space be constant at 3.7 GB while /proc/meminfo's MemAvailable declines to 0 and its Active and Active(anon) numbers climb up to 90 GB or so. I find this quite mysterious, because as far as I understand Linux memory accounting, it should be impossible to have anonymous pages that are not part of the committed address space. You get anonymous pages by operations such as a MAP_ANONYMOUS mmap(), and those are exactly the operations that the kernel is supposed to carefully account for in working out Committed_AS, for obvious reasons.

Inspecting /proc/<pid>/smaps and other data for the sole gigantic Python process currently running on such a machine says that it has a resident set size of 91 GB, a significant number of 'rw-' anonymous mappings (roughly 96 GB worth, mostly in 64 MB mappings), and on hand inspection, a surprising number of those mappings have a VmFlags: field that does not have the ac flag that apparently is associated with an 'accountable area' (per the proc(5) manpage and other documentation). I don't know if not having an ac flag causes an anonymous mapping to not count against committed address space, but it seems plausible, or at least the best theory I currently have.

(It would help if I could create such mappings myself to test what happens to the committed address space and so on, but so far I have only a vague theory that perhaps they can be produced through use of mremap() with MAP_PRIVATE and MREMAP_MAYMOVE on a MAP_SHARED region. This is where I need to write a C test program, because sadly I don't think I can do this through something like Python. Python can do a lot of direct OS syscall testing, but playing around with memory remapping is asking a bit much of it)

CommittedASVsActiveAnon written at 00:22:15; Add Comment

2019-05-08

A Linux machine with a strict overcommit limit can still trigger the OOM killer

We've been running our general use compute servers with strict overcommit handling for total virtual memory for years, because on compute servers we feel we have to assume that if you ask for a lot of memory, you're going to use it for your compute job. As we discovered last fall, hitting the strict overcommit limit doesn't trigger the OOM killer, which can be inconvenient since instead all sorts of random processes start failing since they can't get any more memory. However, we've recently also discovered that our machines with strict overcommit turned on can still sometimes trigger the OOM killer.

At first this made no sense to me and I thought that something was wrong, but then I realized what is probably going on. You see, strict overcommit really has two parts, although we don't often think about the second one; there's the setting itself, ie having vm.overcommit_memory be 2, and then how much your commit limit is, set by vm.overcommit_ratio as your swap space plus some percentage of RAM. Because we couldn't find an overcommit percentage that worked for us across our disparate fleet of compute servers with very varying amounts of RAM, we set this to '100' some years ago, theoretically allowing our machines with strict overcommit to use all of RAM plus swap space. Of course, this is not actually possible in practice, because the kernel needs some amount of memory to operate itself; how much memory is unpredictable and possibly unknowable.

This gap between what we set and what's actually possible creates three states the system can wind up in. If you ask for as much memory as you can allocate (or in general enough memory), you run the system into the strict overcommit limit; either your request fails immediately or other processes start failing later when their memory allocation requests fail. If you don't ask for too much memory, everything is happy; what you asked for plus what the kernel needs fits into RAM and swap space. But if you ask for just the right large amount of memory, you push the system into a narrow middle ground; you're under the strict overcommit limit so your allocations succeed, but over what the kernel can actually provide, so when processes start trying to use enough memory, the kernel will trigger the OOM killer.

There is probably no good way to avoid this for us, so I suspect we'll just live with the little surprise of the OOM killer triggering every so often and likely terminating a RAM-heavy compute process. I don't think it happens very often, and these days we have a raft of single-user compute servers that avoid the problem.

Sidebar: The problems with attempting to turn down the memory limit

First, we don't have any idea how much memory we'd need to reserve for the kernel to avoid OOM. Being cautious here means that some of the RAM will go idle unless we add a bunch of swap space (and risk death through swap trashing).

Further, not only would the vm.overcommit_ratio setting be machine specific and have to be derived on the fly from the amount of memory, but it's probably too coarse-grained. 1% of RAM on a 256 GB machine is 2.5 GB, although I suppose perhaps the kernel might need that much reserved to avoid OOM. We could switch to using the more recent vm.overcommit_kbytes (cf), but since its value is how much RAM to allow instead of how much RAM to reserve for the kernel, we would definitely have to make it machine specific and derived from how much RAM is visible when the machine boots.

On the whole, living with the possibility of OOM is easier and less troublesome.

StrictOvercommitCanOOM written at 00:58:04; Add Comment

By day for May 2019: 8 12 13 15 27 28; before May; after May.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.