Wandering Thoughts

2018-10-10

Even systemd services and dependencies are not self-documenting

I tweeted:

I'm sure that past-me had a good reason for configuring my Wireguard tunnel to only start during boot after the VMWare modules had been loaded. I just wish he'd written it down for present-me.

Systemd units are really easy to write, straightforward to read, and quite easy to hack on and modify. But, just like everything else in system administration, they aren't really self documenting. Systemd units will generally tell you clearly what they're doing, but they won't (and can't) tell you why you set them up that way, and one of the places where this can be very acute is in what their dependencies are. Sometimes those dependencies are entirely obvious, and sometimes they are sort of obvious and also sort of obviously superstitious. But sometimes, as in this case, they are outright mysterious, and then your future self (if no one else) is going to have a problem.

(Systemd dependencies are often superstitious because systemd still generally still lacks clear documentation for standard dependencies and 'depend on this if you want to be started only when <X> is ready'. Admittedly, some of this is because the systemd people disagree with everyone else about how to handle certain sorts of issues, like services that want to activate only when networking is nicely set up and the machine has all its configured static IP addresses or has acquired its IP address via DHCP.)

Dependencies are also dangerous for this because it is so easy to add another one. If you're in a hurry and you're slapping dependencies on in an attempt to get something to work right, this means that adding a comment to explain yourself adds proportionally much more work than it would if you already had to do a fair bit of work to add the dependency itself. Since it's so much extra work, it's that much more tempting to not write a comment explaining it, especially if you're in a hurry or can talk yourself into believing that it's obvious (or both). I'm going to have to be on the watch for this, and in general I should take more care to document my systemd dependency additions and other modifications in the future.

(This is one of the thing that version controlled configuration files are good for. Sooner or later you'll have to write a commit message for your change, and when you do hopefully you'll get pushed to explain it.)

As for this particular case, I believe that what happened is that I added the VMWare dependency back when I was having mysteries Wireguard issues on boot because, it eventually turned out, I had forgotten to set some important .service options. When I was working on the issue, one of my theories was that Wireguard was setting up its networking, then VMWare's own networking stuff was starting up and taking Wireguard's interface down because the VMWare code didn't recognize this 'wireguard' type interface. So I set a dependency so that Wireguard would start after VMWare, then when I found the real problem I never went back to remove the spurious dependency.

(I uncovered this issue today as part of trying to make my machine boot faster, which is partially achieved now.)

DocumentStartupDependencies written at 01:37:27; Add Comment

2018-10-09

Something systemd is missing for diagnosing oddly slow boots

In yesterday's entry, I mentioned that my office machine has always been oddly slow to boot (and this goes back years, as far back as when I switched to systemd's networkd). Over the years I have made various attempts to figure out why this was so and what I could do about it, generally not getting anywhere with them. As part of this, I have of course poked around at what systemd-analyze could tell me, looking at things like 'systemd-analyze blame', 'systemd-analyze critical-path', and 'systemd-analyze plot'. None of them have given me any useful understanding of what's going on.

(One of the things that drives this is that I have a very similar Fedora system at home, and it has always booted visibly faster. This has driven me not just with the general question of 'why does my office machine boot what feels like slowly' but also the specific question of 'what makes my office machine slower than my home machine'. However I've just realized that I've never systematically explored the details of this difference, for example by carefully comparing 'systemd-analyze blame' on both machines.)

Based on 'systemd-analyze blame', I suspect that some of the delay is happening inside the startup of specific units and even specific programs (for example, starting up ZFS seems to take significantly longer on my office machine than at home). Diagnosing this is something that systemd can only partially help with, although it could do more than it currently does. It would be quite helpful to get the timings of each of the separate Exec steps in a single .service unit, so that I could see, for example, which bit of Unbound's multi-step .service is making it take over 12 seconds to start.

(In Unbound's case I suspect that it's 'unbound-anchor' that's so slow, but it would be nice to know for sure. This information may inconsistently be available in the giant blob of stuff you get from 'systemd-analyze dump'; it is on one machine and it isn't on the other. It may depend on how long ago the system was booted.)

However, there's something important that systemd could do that it doesn't, which is record what it was that finally allowed (or caused) a service to be started. Every boot time unit has a set of dependencies and preconditioned, and it only starts when all of them are ready; the whole thing forms an ordered graph. But right now, as far as I can tell, systemd does not explicitly record the information required to reconstruct the critical path through the graph for any given service. To put it another way, you can't reliably ask 'why didn't this service start before then', at least not easily. In an ideal world you could load up a plot of your entire boot, click on any unit, and have at least its critical predecessor shown.

(In theory 'systemd-analyze critical-chain' might answer this question. In practice it gives me answers that are clearly off; for example, it currently claims that one of the services on the critical chain for getty@tty1.service started more than ten seconds after it says that the getty did. Also, the manual page specifically mentions various cautions.)

In a related issue, I also wish that systemd-analyze would give you the unit start time and duration information in text form, instead of only embedded in a SVG through 'systemd-analyze plot'. This would make it easier to pick through and do my own analysis of this sort of thing, and the information is clearly available. It's just that systemd-analyze has apparently decided that the only form we could want it in is a SVG.

Sidebar: Deciding on what you care about for startup duration

If you run 'systemd-analyze critical-chain', it gives you the critical chain for multi-user.target by default. But is this actually what you care about? In my case, I've decided that one thing that I care about is when I can log in to my just rebooted machine, which seems to happen well before multi-user.target is considered officially finished. That's why I've also been looking at getty@tty1.service, which that's a reasonable indication of reaching this point.

SystemdBootTimingWish written at 00:03:41; Add Comment

2018-10-07

It's good to check systemd for new boot-time things every so often

I tweeted:

Oh. The reason my office workstation always took a somewhat unusually long time to (re)boot is that I had an ancient, nonexistent iSCSI target configured, so finding all disk devices had to wait for the iSCSI connection attempt to time out. That was silly.

This turns out to not quite be the case. While my ancient iSCSI target configuration had been there for a very long time, I only installed the iSCSI initiator programs recently, on September 4th, when I seem to have installed a bunch of QEMU-related packages. The dependencies pulled in the Fedora iSCSI initiator package, and that package decided that it should enable itself by default.

(Specifically it enabled the iscsi.service unit, which attempts to automatically login to all suitably configured iSCSI targets. Also, I'm not sure my diagnosis in the tweet is correct, although I saw it delaying the boot in general; I just jumped to that explanation when I made the tweet.)

What this experience taught me is that it's a good idea to go look at what boot-time things are enabled every so often, to spot new ones that I don't want. On Fedora, clearly a certain number of things feel free to activate themselves on install, on the assumption that they're harmless (I may disagree). Checking for this every so often is a good idea.

I think that the easiest and most official way is to go look for recently changed things in /etc/systemd/system and important .wants subdirectories of this, especially multi-user.target.wants and sockets.target.wants. Conveniently, enabling units creates symlinks and the modification time of those symlinks is when the unit was enabled. A quick 'ls -lt' will thus turn up any recent new activations.

On the one hand, this feels sort of like a hack. On the other hand, it's a perfectly good Unix way of answering this question in a system where state is represented in the filesystem in straightforward ways. There's no need to provide an API and a command and so on when existing Unix tools can already answer the question perfectly well.

Now that I've stubbed my toe on this, I'm going to see if I can remember to check for new boot-time services every so often. I've already turned up one surprise on my office workstation, where sssd.service got enabled on August 8th (which was when I upgraded the machine from Fedora 27 to Fedora 28, although this didn't happen on my home machine when I upgraded it). This is probably harmless, so I've let it be so far.

(Since the iSCSI stuff has only been there for a month or so, it's not the long term cause of my slower than expected boots (although it may have been delaying them lately). But that's a discussion for another entry.)

PS: This isn't specific to systemd, of course. Any init system is subject to having new or updated packages decide that they absolutely should be enabled. Systemd does give you a somewhat richer set of paths to this due to socket activation, and any dependency-based system can hide extra activations because they can happen concurrently and not obviously delay your boot (most of the time).

SystemdCheckNewServices written at 22:48:06; Add Comment

2018-09-30

Why updating my Fedora kernels is a complicated multi-step affair

I tweeted:

Ah, the delicate yak shaving dance of updating my Fedora system's kernel. Life will be easier when WireGuard is in the upstream kernel. (It would be even easier if ZFS in the upstream, but that's sadly probably never going to happen unless Oracle gets very strange.)

You might wonder why updating the kernel is such a complicated thing for me. Part of the answer is that updating the kernel is, in one sense, not complicated at all. If all I want to do on my home machine is update the kernel by itself, all I need to do is run 'dnf update "kernel*"' and then wait a while as the RPMs are installed and then DKMS rebuilds the kernel modules for WireGuard and ZFS on Linux for me. In this sense, having both upstream in the main kernel wouldn't really change anything (although it would mean no DKMS rebuilds).

But usually this is not all that I want to do, because I'm almost always taking the opportunity of a kernel update to also update WireGuard and ZFS on Linux. I get WireGuard from its Fedora COPR repo, where it updates pretty frequently; there's almost always at least one update by the time there's a new Fedora kernel. ZFS on Linux makes new releases only infrequently, but I don't run release versions; instead I use the latest development versions straight from the ZoL Github repo, where it sees ongoing development.

(It appears that WireGuard currently updates roughly once a week. There have been four updates this September, dated the 4th, the 10th, the 18th, and the 25th.)

Once I'm updating additional DKMS-based packages, DKMS adds its own wrinkle here, because you need to force things to update in the right order. If you install a new kernel in Fedora, part of the kernel updates insure that DKMS will rebuild all current DKMS modules for the new kernel. But if you install or update a DKMS based package, the Fedora setup only has DKMS (re)build kernel modules for the current kernel, not for any kernels that you may have installed and be waiting to reboot. So if you're going to update some DKMS packages and also your kernel, you must update the DKMS packages first and the kernel second.

So my whole update process looks like this:

  1. Sync up my local ZFS on Linux build repo and build new RPMs.
  2. Use 'dnf update zfs/*.rpm' to install the new version, which (slowly) builds the new DKMS kernel modules for my current kernel.
  3. Do 'dnf update "wireguard*"' to update the WireGuard packages, which will build its new kernel module for my current kernel.
  4. Finally do 'dnf update "kernel*"', which installs the new kernel and has DKMS rebuild the just-installed new ZFS and WireGuard kernel modules for the new kernel.

When WireGuard goes into the kernel, what will be different is that it will stop changing so much, or at least the version I use would stop changing so much because it would be whatever version was in the Fedora kernel. The same would be true if ZFS was somehow put into the Linux kernel; I would stop running the latest development version and just stick with whatever the Fedora kernel had.

So, really, I've done this to myself. I could run a released version of ZFS on Linux, which would basically stop that from updating all the time, and I could probably just freeze my WireGuard version for a month or two if I wanted to.

(Updating the kernel on my work machine takes an extra step because I have not yet put in the effort to DKMS-ize the special kernel modules for VMWare Workstation, so I have to run a couple more commands to rebuild them every time. These days I can at least do this before I reboot into the new kernel; I used to have to reboot into the new kernel, rebuild the modules, and then reboot again to get everything fully activated with the right magic.)

MyKernelUpdateSteps written at 00:21:55; Add Comment

2018-09-28

Using a very old ZFS filesystem can give you a kernel panic on Linux

I recently wrote an entry saying that how you migrate ZFS filesystems matters, because if you use ZFS's native features for copying filesystems around a great deal of ZFS's internal on-disk data is retained completely intact. If that data has problems, you have just replicated those problems from your old environment to your new one. Unfortunately this turns out not to be a theoretical problem; some people have hit a situation where having used 'zfs send' to copy very old ZFS filesystems on to a modern Linux kernel has now given them a filesystem with files that cause kernel panics when accessed.

The full details are in ZFS on Linux issue #7910. Summarized, it starts with that ZFS has had a number of different ways to store ACLs in ZFS filesystems, including a very old one that was used (we think) relatively early on on Solaris (sufficiently early that it seems to predate ZFS system attributes, which are from 2010). This old ACL format embeds sufficiently small ACLs into its section of ZFS dnodes, where there is room for 72 bytes of it, and the ACL format also has a (byte) size for how many of those 72 bytes are actually used by 'ACEs' (Access Control Entries, apparently).

Some of these very old Solaris ZFS filesystems contain files with old format ACLs that claim that they both use embedded ACLs and their ACLs are longer than 72 bytes. Given that embedded ACLs can only be at most 72 bytes long, this is impossible, but that's what the on-disk data says (one example claims to have a 112 byte set of ACEs). The ZFS code believes this sufficiently to try to copy however many bytes from the fixed size, 72-byte embedded ACE area to an ACL buffer when you try to access a file and ZFS has to check its ACL. Solaris, OmniOS, and old Linux kernels did not notice that this memory copy was copying some number of random bytes at the end (anything over 72 bytes), and so just sailed blithely onward to checking the resulting 'ACL' (which is some amount of real ACL and some amount of random bytes), which apparently generally worked well enough that no one noticed anything wrong. Modern Linux kernels are generally built with a special kernel configuration option that detects this sort of memory over-copying and panics as a safety and security precaution.

(Specifically, this is the FORTIFY_SOURCE kernel configuration option, which was apparently added in kernel 4.13, in September of 2017 or so, per Kees Cook's entry on security things in 4.13. Ubuntu 18.04 LTS has a kernel that is recent enough to contain this option and Ubuntu built their kernel with it turned on, while the normal 16.04 LTS server kernel is far too old to have it. As a result, some people upgrading from 16.04 to 18.04 hit this panic on their existing ZFS filesystems.)

Most people are extremely unlikely to run into this specific problem; it requires a decade-old ZFS filesystem that has some specific and apparently fairly rare malfunctions in it. But it makes an uncomfortably good illustration of how 'zfs send | zfs recv' will faithfully replicate even things that you don't want, and this can cause real problems if you're sufficiently unlucky.

(We're probably going to still use ZFS send streams for our filesystem migrations, even though some of our filesystems are old enough that they have files with these very old format ACLs. We've never actually used ACLs, so hopefully we're not going to have any files with too-large ones.)

PS: I expect that at some point there will be a proposed change to ZFS on Linux that avoids this kernel panic, but since there are a number of open questions about how best to handle this situation, it will probably take a while before one appears.

(This sort of elaborates on a tweet of mine, which I believe I made when I determined that some of our filesystems did have these very old format ACLs.)

ZFSOldFilesystemPanic written at 01:11:20; Add Comment

2018-09-25

A problem with unmounting FUSE mount points that are on NFS filesystems

If you have NFS filesystems with directories that are not world accessible, and people mount FUSE filesystems at spots under those directories, you are probably going to have problems in modern Linux kernels. Specifically, it is very difficult to unmount these FUSE mounts (even if the program providing the FUSE filesystem exits).

Here is what happens:

; ls -ld .; stat -f -c '%T' .
drwx--S--- 108 cks itdirgrp 242 Sep 25 22:55 ./
nfs
; sshfs cks@nosuchhost: demo
read: Connection reset by peer
fusermount: failed to chdir to /h/281/cks: Permission denied
; ls demo
ls: cannot access 'demo': Transport endpoint is not connected

[... become root ...]
# strace -e trace=umount2 /root/umount2 /h/281/cks/demo
umount2("/h/281/cks/demo", MNT_FORCE|MNT_DETACH) = -1 EACCES (Permission denied)

(umount2 here is a little program that does a umount2() on its argument; I'm using it to completely eliminate anything else that various programs are trying to do and failing at. fusermount and /bin/umount also fail, but in more elaborate ways.)

What appears to be happening here is that modern Linux kernels have decided that they will do a full lookup through the path you give them to unmount (I assume that they have good reasons for this). Since umount2() must be done as root, this path walking is done with root's permissions. On NFS mounts, UID 0 generally has no special privileges, so if the path to unmount through a restricted NFS directory, the kernel's traversal of the path will fail and the kernel will reject the umount2().

In an ideal world, the initial FUSE mount would have failed for the same reason, which would at least limit the damage. In this world, as we can see, the initial FUSE mount succeeds for some reason and you wind up with a stuck FUSE mount. This stuck FUSE mount will then block unmounting the NFS filesystem, because you can't unmount a filesystem that has a mount inside it.

There is a way around this but it requires a very special trick and I'm not certain it's going to work forever. The Linux kernel has an extra, non-portable notion of a 'filesystem user ID', which is the UID that the kernel uses for all accesses to the filesystem. With appropriate privileges, you can set this with setfsuid(2). The kernel uses the filesystem UID during the umount2() path walk, so if you have a program that setfsuid()s to the necessary target user and then calls umount2(), it will work (when run by root, since you need to be root to unmount things (or have the CAP_SYS_ADMIN capability, which is often pretty close to root)).

We now have a little program that does just this. However, we've decided that our real solution to this problem is to remove the Ubuntu 'fuse' package so people can't mount FUSE filesystems in the first place, because sshfs and its friends are not widely used here and we don't want to deal with the hassles.

(I was hoping to be able to just blacklist the FUSE kernel module, but Ubuntu builds FUSE directly into their kernels.)

PS: I really wish that FUSE filesystems were automatically unmounted when their transport endpoints died. Naturally this should be in the kernel and not involve path walking shenanigans, since as we've seen those can fail.

(This elaborates on some tweets of mine.)

FUSEOnNFSUnmounting written at 23:32:46; Add Comment

2018-09-20

Ubuntu pretty much is the 'universe' repository for us

Recently, HL left a comment on my entry about Ubuntu bug reports being useless pointing out that part of the problem is Ubuntu's "Universe" repository. To quote HL:

The issue is of course that Universe is essentially a static snapshot of Debian unstable at the time of the relevant Ubuntu release. Some updates do happen, generally if someone contributes them, but Ubuntu does not consider the huge set of packages that only exist in Universe to actually be supported. [...] (In many cases it'd be best to just steer clear of Universe entirely, but most users have no idea about this distinction.)

In my book, Universe is one of Ubuntu's big weaknesses, and it goes way beyond bug reports.

At one level I can't argue with this. I don't think that Ubuntu does much bug fixing even for packages in "Main" (our experience has not been positive), but a package being in "Universe" basically guarantees that bugs are going to be there for the lifetime of the LTS release.

(It turns out that I looked into some factors here for Ubuntu and Debian a while back, in my entry on how much support you get from distributions.)

At another level, "Universe" is a large part of the reason that we use Ubuntu. Around half of the packages installed on our user-facing Ubuntu 18.04 LTS machines come from "Universe", and looking at the list of those packages suggests that users would notice if they all disappeared (certainly I would, since both xterm and urxvt are in "Universe", and let's ignore that Amanda is also a "Universe" package). For us, the attraction of Ubuntu is that it has three things in combination: reasonably frequent releases so that the most recent release is never too far behind, the five year nominal support period for any specific LTS release, and the large package selection. As far as I know, no other major Linux distribution has all of them.

(CentOS has more of them than it used to, since EPEL is fairly robust, but it misses out on release frequency.)

If we take "Universe" out of Ubuntu, as mentioned we lose about half of the packages that we install. What remains is not useless (and it includes more packages than I expected), but I don't think it would really meet our needs. If we had to drop "Universe", I suspect that we would wind up with a mix of Debian, for user-facing machines that we update every two years anyway, and CentOS, for machines that we can mostly freeze and live with old packages.

(Like Ubuntu, Debian has different sections of their package repository, but I believe they are much more likely to actually update "contrib" packages than Ubuntu is to update "Universe". If nothing else, there are more steps involved in Ubuntu than in Debian; you need a Debian update and then for someone to push it through the Ubuntu processes as well.)

PS: I'm only mildly interested in hearing if there's a real alternative to Ubuntu that meets all three of our criteria and improves on Ubuntu's support record and so on, because even if there is we might well not switch over to it. It's not that we love Ubuntu, but it mostly works okay and we already know how to make it do what we need (and have various pieces of tooling built for it and so on). It's our default Linux and Unix.

Sidebar: How I looked at the package split on our machines

First we generate a list of packages in each section:

cd /var/lib/apt/lists
cat *_main_*Packages | grep '^Package: ' | field 2 | sort -u >/tmp/main-pkgs
cat *_universe_*Packages | grep '^Package: ' | field 2 | sort -u >/tmp/univ-pkgs
cat *_multiverse_*Packages | grep '^Package: ' | field 2 | sort -u >/tmp/multi-pkgs

(field is one of my little utility scripts; you can find it here. Several years ago I evidently thought that you'd need to explicitly download some package index files, but it turns out that apt already has copies stashed away.)

Now we need a list of local packages:

dpkg --get-selections | field 1 | sed 's/:.*//' | sort -u >/tmp/inst-pkgs

The sed transforms package names like 'gcc-8-base:amd64' into plain 'gcc-8-base'. For my crude purposes, it's okay to crush multiple architectures together this way.

Then you can use comm to count up how more or less how many packages come from, eg, "Main":

comm -12 /tmp/inst-pkgs /tmp/main-pkgs | wc -l

This will have a lot of library packages, so you may want to crudely exclude them with 'grep -v "^lib"' or the like.

We turn out to have a few packages installed from "Multiverse", somewhat to my surprise. I think I knew that rar was non-free, but the OCaml documentation is a little bit of a surprise.

UbuntuUniverseImportance written at 00:43:39; Add Comment

2018-09-11

The Linux kernel's internals showing through in the specifics of an NFS bug

On Mastodon, I said:

What's fascinating about this particular kernel bug to me is how clearly I can see the kernel's implementation poking through in what the bug is and what's required to reproduce it. The more I refine things, the more I can guess where the problem probably is.

Let me give you the details on this, so you can see how the kernel's implementation is poking through. I'll start with the bug and its reproduction.

We recently ran into a deadly problem with Alpine on Ubuntu 18.04 that is actually a general kernel NFS client problem. After I refined my test program down far enough, here is what is required to manifest the bug:

  1. On a NFS client, open a file read-write and read all the way to the end of the file. It's probably sufficient to just read the last N bytes or KB of the file, for some value of N (it might even be enough to read the last byte of the file).
  2. In your program, keep the file open read-write and wait for it to grow in size.
  3. On another machine (either another NFS client or the fileserver), append data to the end of the file.
  4. In your program, attempt to read the new data after the old end of file. The new data from immediately after the old end of file up to the next 4 KB boundary will be zero bytes; after that, it will be regular contents.

You must hold the file open in read-write mode while you wait in order for this bug to manifest; if you close the file or hold it open read-only, this doesn't happen (even if you open it read-write again after you detect the size change). This happens with both NFSv3 and NFSv4, and the OS of the NFS fileserver doesn't matter.

So now let's talk about this shows the bones of the kernel in action (assuming that I'm correct about what's going on inside the kernel).

Like pretty much everyone these days, the Linux kernel caches file data in memory, in the page cache. As you might suspect from the name, the page cache stores things in units of pages, which are almost always 4 KB (at least on x86 machines). However, files are not always even multiples of 4 KB in size, which means that the very end of a file, when cached in memory, will not take up all of a page; what you have is a partial page, where some amount of the front of the page is valid but the rest is not. It seems both plausible and likely that the kernel zeroes page cache pages (at least partial ones) before trying to put data in them, rather than leaving random stale bytes sitting around in them (not zeroing them would be a great way to accidentally leak kernel memory).

In NFS, file data can change behind the client kernel's back, and in particular a file can be extended. When the NFS client code has a partial page from the end of the file in the page cache and the file's size grows, it has to remember that the rest of the file's data is not in the page but must be filled in from the server. When you don't hold the file open read-write, this process of filling in clearly works correctly. When you hold the file open read-write, for some reason the kernel appears to lose track of the fact that it has to fill in the rest of the partial page from the server; instead it believes that it has a full page and so it gives you whatever data is in the remainder of the page. This data is, fortunately, all zero bytes.

(I say fortunately because this means that it's both obvious and not a kernel memory data leak. If the kernel gave you whatever random bytes were in the physical page of RAM from its previous use, this could be very bad.)

This doesn't happen for local files (at least normally) because local files are coherent; all writes go through the page cache, so when you extend a file the new data fills in the existing partial page in the page cache. I suspect that this local coherence is part of how the bug happens, and perhaps there is a bit of the general kernel code that assumes that this incomplete partial page situation just can't happen for files open read-write; if the file says its length is X, and that X fully covers a page in the page cache, all the contents of that page are always valid.

PS: Interested parties can find a program to demonstrate this here. It takes various arguments so you can play around with some things to reproduce or not reproduce the bug. I have deliberately resisted my natural temptation to provide and explore all possible permutations of what the test program does, because I don't think the permutations matter. I put in some things because I wanted to test them (and doing so was useful, because it discovered that keeping the file in read-write mode was a crucial element), and other things because I wanted to demonstrate that they don't matter.

(This entry is partly a dry run for sending a bug report to the Linux NFS mailing list; I wanted to make sure I could explain it reasonably coherently and that I had things straight in my head.)

KernelNFSPageBug written at 00:37:00; Add Comment

2018-08-31

Making Ubuntu bug reports seems to be useless (or pointless)

I've mentioned this in passing in a few places, so I might as well say it here: I've mostly given up on making Ubuntu bug reports for the simple reason that doing so seems to be useless. Every so often I'll file one for no particularly strong reason (eg, which I'd actually forgotten I'd filed), but even when I do file a bug I usually don't expect anything. When I say that I don't expect anything, I mean more than I don't expect a response to my bug; I mean that I expect that my bug report will have basically no effect not only on the bug in this Ubuntu LTS version but also on whether or not it's in future ones.

(I'm also much less likely to file bugs that might require me to argue with someone, such as over 18.04's packaging of libreadline.)

It's possible that Ubuntu makes use of bug reports for some internal purposes and so filing them is not technically pointless. But from my perspective as an outsider, filing Ubuntu bugs is certainly useless and to me that makes them pointless as well. There really isn't much more to say about the situation than that. Ubuntu can run its bug tracker however it wants to, and it's not being actively hostile to people submitting bug reports in the way that some environments are. It's just that Ubuntu has created a situation where there's no point in submitting bug reports, so I'm mostly not going to bother.

(Of course, Ubuntu has never been a distribution that did very many bugfix updates. A long time ago I wrote a grumpy entry about this lack of such updates, and nothing has changed since then.)

Many Ubuntu packages are inherited more or less untouched from Debian and Debian is generally reasonably responsive to bug reports. It's potentially worth keeping a Debian system around so you can reproduce bugs and submit them to Debian in the hopes that an update will trickle through to at least the next Ubuntu release (or LTS release, if that's the only Ubuntu version you use). You likely need to use Debian 'testing' for this, since it's generally what Ubuntu draws packages from.

(Our Amanda packaging bug was fixed in Debian, for example.)

PS: Occasionally useful discussions do break out in Ubuntu bug reports between a group of people with the problem who are working together to diagnose it and perhaps come up with fixes; I think I've seen one or two. But the odds are that no bug report that I make will spark such a discussion.

(I might feel more motivated to file bug reports so that other people with the problem could find them if Launchpad's search wasn't basically terrible as far as I've seen. If I want other people to be able to find my reports, I'm probably better off writing up things here on Wandering Thoughts and hoping that search engines index them. That's one reason I've taken to putting exact error messages in entries.)

UbuntuBugReportsUseless written at 00:15:03; Add Comment

2018-08-28

An illustration of why it's hard to port outside code into the Linux kernel

Sometimes, people show up with useful kernel-side things that were originally written for other systems and try to put them into Linux as (GPLv2) donations. Often these have been filesystems (SGI and XFS, IBM and JFS), but there have been other attempts at code drops. Most recently, Oracle made DTrace available under GPLv2 and integrated it into their kernel. I've said before that this is not an easy thing to do and can take years to actually get the code in to the Linux kernel (eg, XFS). Part of that is that the Linux kernel developers are picky about the code that they accept and require it to follow Linux kernel conventions (because they'll be supporting it for years), but part of that is because joining up code written for another environment with the Linux kernel is a hard, long process, with lots of subtle issues that can be overlooked despite very good people with the best of intentions.

As it happens, I have an illustration of this. Before I start, I want to say explicitly that I think ZFS on Linux is a very solid and well done project. In fact, that it's so solid and well done makes this case all the more useful as an illustration, because it shows how a bunch of very good people, working very hard (and for a long time) and doing a very good job, can still have something slip by in the interface between the Linux kernel and outside code. So here is the story.

Recently, someone on the ZFS on Linux mailing list reported a kernel panic related to snapshots (and also and the ZoL issue report). The first background detail required is that under some circumstances, accessing snapshots can get ELOOP from the kernel's general filesystem code. The specific crash happened because when the kernel converted a NFS filehandle that was (apparently) for a file in a ZFS snapshot into a kernel dentry, the dentry pointer it got back from ZFS had the value of 0x28 (decimal 40). Although very low faulting addresses usually have causes related to NULL pointers, 40 happens to be the errno value for ELOOP, which is suspicious.

Internally, the Linux kernel has a somewhat odd way of handling errors. Probably in order to avoid confusing them with valid values that might be returned normally, errors are usually returned from functions as negative errno values; for example, if a pathname lookup fails because there's no such name in the directory, the relevant functions will return -ENOENT (and there's a whole infrastructure to smuggle these negative errno values in and out of what are normally pointers). Much of the rest of the world has kernel functions that return positive errnos to signal errors, and in particular the original Solaris/Illumos ZFS code uses positive errnos to return errors, so a ZFS function that wants to signal 'no such name' will return ENOENT.

The ZFS on Linux code tries to stay as close to the upstream Illumos ZFS code as possible to make it easier to port fixes and changes back and forth. As part of this, it has not changed to using Linux style negative errnos internally; instead, it converts from ZFS positive errnos (in its own code) to Linux kernel negative errnos when it returns results to the kernel, such as when it is converting a NFS filehandle into a ZFS dnode and then a kernel dentry, which may fail with errors like ESTALE. This conversion is done by negating the internal ZFS errno, turning a positive ZFS errno into a negative kernel errno (which is then smuggled inside a pointer).

All of this is fine except that there turns out to be a point in converting NFS filehandles where the ZFS on Linux code calls a kernel function to do path lookups and returns its error result unaltered. Since this is a kernel function, it returns negative errnos, which are passed up through the ZFS on Linux call stack and then carefully negated by ZoL before it returns them to the kernel. This careful negation turns what was a negative kernel errno into a positive number that the kernel thinks is not an error, but a dentry pointer. Things do not go well from here.

All of the code involved looks innocent and good on a normal inspection; you're calling functions, you're checking for errors, you're returning errors if there are any, everything looks normal and standard. You need a bunch of contextual knowledge to know that this one function call is special and returns a dangerously different result from everything else (if it encounters an error at all, which it usually doesn't), and it needs special handling. The commit that added this code is over a year old and was reviewed by one of the most experienced ZoL developers, and the code has passed tests and been used in production (where it worked because errors from this kernel function are very rare in this context).

This error is not in ZFS code and it is not in kernel code; it's at the seam between the two, where one world must be carefully converted to the other. Here, one little spot was missed and joined imperfectly, and the result was a kernel panic a year later. And that's part of why porting outside code into the Linux kernel is hard and takes a long time, one way or another.

(It also makes a good illustration of why the Linux kernel developers generally insist that outside code be converted over to use kernel idioms before they'll accept it into the kernel tree.)

PortingKernelCodeChallenging written at 01:12:38; Add Comment

(Previous 10 or go back to August 2018 at 2018/08/10)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.