Wandering Thoughts


What affects automatically removing old kernels on Ubuntu

I have griped before (and recently) about how much of a pain it is to try to keep the number of kernels that Ubuntu installs on your machines under control. Writing your own script to remove obsolete kernels is fraught with challenges, but as it turns out I think we can do what we want with 'apt-get autoremove' and some extra work.

First, as Ewen McNeill said in a comment here back in 2015, it's the case that 'apt-get autoremove' will not remove a held package, kernel or otherwise. This makes a certain amount of sense, even if it's inconvenient. We can't keep kernels unheld in general for reasons covered here and here, but we probably can write a script that unholds them, runs 'apt-get autoremove', and holds the remaining kernels afterwards.

(Note that holding Ubuntu packages doesn't convert them from automatically installed packages to manually installed ones; it just holds them. You can see this with apt-mark, which also makes a handy way to hold and unhold packages on the command line.)

If you run apt-get autoremove with your kernel packages not held, you'll notice that it doesn't remove all of them. This naturally made me curious about what controlled this, and at least in Ubuntu the answer is in /etc/apt/apt.conf.d/01autoremove-kernels:

// DO NOT EDIT! File autogenerated by
// /etc/kernel/postinst.d/apt-auto-removal

This contains a list of kernel packages and package regular expressions that should not be autoremoved; generally it's going to contain your two most recent kernels. As the comment says, it's (re)created by a script when kernel packages are installed and removed. This script, /etc/kernel/postinst.d/apt-auto-removal, starts with a comment that does a pretty good job of explaining what it wants to do:

Mark as not-for-autoremoval those kernel packages that are:

  • the currently booted version
  • the kernel version we've been called for
  • the latest kernel version (as determined by debian version number)
  • the second-latest kernel version

In the common case this results in two kernels saved (booted into the second-latest kernel, we install the latest kernel in an upgrade), but can save up to four. Kernel refers here to a distinct release, which can potentially be installed in multiple flavours counting as one kernel.

The second rule here implies that if you install an old kernel by hand for some reason, it will get added to the manual exclusion list. Well, added to the current manual exclusion list, since the list is rebuilt on at least every kernel install.

Now, there is a very important gotcha with this whole setup: this list of kernels to never autoremove is only recreated when kernel packages are installed or otherwise manipulated. When you run 'apt-get autoremove', there is nothing that specifically preserves the kernel you are actually running right then. Normally you're probably booted into one of the preserved kernels. But you might not be; if you have to boot back into an old version for some reason and you then run 'apt-get autoremove', as far as I can see it's entirely possible for this to remove your kernel right out from underneath you. Possibly autoremove has specific safeguards against this, but if so I don't see them mentioned in the manpage and there's also this Ubuntu bug.

(As a result, our wrapper script is likely to specifically hold or keep held the actual running kernel.)

(I got some of this information from this askubuntu question and its answers.)

PS: This suggests that maximum safety comes from writing your own script to explicitly work out what kernels you can remove based on local policy decisions. Using 'apt-get autoremove' will probably work much of the time, but it's the somewhat lazy way. We're lazy, though, so we'll probably use it.

UbuntuKernelAutoremove written at 00:57:14; Add Comment


Modern Linux kernel memory allocation rules for higher-order page requests

Back in 2012 I wrote an entry on why our Ubuntu 10.04 server had a page allocation failure, despite apparently having a bunch of memory free. The answer boiled down to the the NFS code wanting to allocate a higher-order request of 64 Kb of (physically contiguous) memory and the kernel having some rather complicated and confusing rules for when this was permitted when memory was reasonably fragmented and low(-ish).

That was four and a half years ago, back in the days of kernel 3.5. Four years is a long time for the kernel. Today the kernel people are working on 4.11 and, unsurprisingly, things have changed around a bit in this area of code. The function involved is still called __zone_watermark_ok() in mm/page_alloc.c, but it is much simpler today. As far as I can tell from the code, the new general approach is nicely described by the function's current comment:

Return true if free base pages are above 'mark'. For high-order checks it will return true of the order-0 watermark is reached and there is at least one free page of a suitable size. Checking now avoids taking the zone lock to check in the allocation paths if no pages are free.

The 'order-0' watermark is the overall lowmem watermark (which I believe is low: from my old entry). This bounds all requests for obvious reasons; as the code says in a comment, if a request for a single page is not something that can go ahead, requests for more than one page certainly can't. Requests for order-0 pages merely have to pass this watermark; if they do, they get a page.

Requests for higher-order pages have to pass an obvious additional check, which is that there has to be a chunk of at least the required order that's still free. If you ask for a 64 Kb contiguous chunk, your request can't be satisfied unless there's at least one chunk of size 64 Kb or bigger left, but it's satisfied if there's even a single such chunk. Unlike in the past, as far as I can tell requests for higher-order pages can now consume all of those pages, possibly leaving only fragmented order-0 4 Kb pages free in the zone. There is no longer any attempt to have a (different) low water mark for higher-order allocations.

This change happened in late 2015, in commit 97a16fc82a; as far as I can tell it comes after kernel 4.3 and before kernel 4.4-rc1. I believe it's one commit in a series by Mel Gorman that reworks various aspects of kernel memory management in this area. His commit message has an interesting discussion of the history of high-order watermarks and why they're apparently not necessary any more.

(Certainly I'm happy to have this odd kernel memory allocation failure mode eliminated.)

Sidebar: Which distributions have this change

Ubuntu 16.04 LTS uses kernel '4.4.0' (plus many Ubuntu patches); it has this change, although with some Ubuntu modifications from the stock 4.4.0 code. Ubuntu 14.04 LTS has kernel 3.13.0 so it shouldn't have this change.

CentOS 7 is using a kernel labeled '3.10.0'. Unsurprisingly, it does not have this change and so should have the old behavior, although Red Hat has been known to patch their kernels so much that I can't be completely sure that they haven't done something here.

Debian Stable has kernel 3.16.39, and thus should also be using the old code and the old behavior. Debian Testing ('stretch') has kernel 4.9.13, so it should have this change and so the next Debian stable release will include it.

ModernPageAllocRules written at 21:54:42; Add Comment


I wish you could whitelist kernel modules, instead of blacklisting them

There was another Ubuntu kernel security update released today, this time one for CVE-2017-2636. It's a double-free in the N_HDLC line discipline, which can apparently be exploited to escalate privileges (per the news about it). Another double-free issue was also the cause of CVE-2017-6074; based on what Andrey Konovalov said for the latter, these issues are generally exploitable with standard techniques. Both were found with syzkaller, and all of this suggests that we're going to see more such double-free and use after free issues found in the future.

You've probably never heard of the N_HDLC line discipline, which is probably related to HDLC; certainly I hadn't. You may well not have heard of DCCP either. The Linux kernel contains a great deal of things like this, and modern distributions generally build it all as loadable kernel modules, because why not? Modules are basically free, since they just take up some disk space, and building everything avoids issues that have bit people in the past.

Unfortunately, in a world where more and more obscure Linux kernel code is being subjected to more and more attention, modules are no longer free. All of those neglected but loadable modules are now potential security risks, and the available evidence is that every so often one of them is going to explode in your face. So I said on Twitter:

I'm beginning to think that we should explicitly blacklist almost all Ubuntu kernel modules that we don't use. Too many security issues.

(At the time I was busy adding a blacklist entry for the n_hdlc module to deal with CVE-2017-2636. Given that we now blacklist DCCP, n_hdlc, and overlayfs, things are starting to add up.)

A lot of kernel modules are for hardware that we don't have, which is almost completely harmless since the drivers won't ever be loaded automatically and even if you did manage to load them, they would immediately give up and go away because the hardware they need isn't there. But there are plenty of things like DCCP that will be loaded on demand through the actions of ordinary users, and which are then exposed to be exploited. Today, this is dangerous.

There are two problems with the approach I tweeted. The first is that the resulting blacklist will be very big, since there are a lot of modules (even if one skips device drivers). The second is that new versions of Linux generally keep adding new modules, which you have to hunt down and add to your blacklist. Obviously, what would be better is a whitelist; we'd check over our systems and whitelist only the modules that we needed or expected to need. All other modules would be blocked by default, perhaps with some way to log attempts to load a module so we could find out when one is missing from our whitelist.

(Modern Linux systems load a lot of modules; some of our servers have almost a hundred listed in /proc/modules. But even still, the whitelist would be smaller than any likely blacklist.)

Unfortunately there doesn't seem to be any particular support for this in the Linux kernel module tools. Way back in 2010, Fedora had a planned feature for this and got as far as writing a patch to add this. Unfortunately the Fedora bugzilla entry appears to have last been active in 2012, so this is probably very dead by now (I doubt the patch still applies, for example).

KernelModuleWhitelistWish written at 00:30:10; Add Comment


What an actual assessment of Ubuntu kernel security updates looks like

Ubuntu recently released some of their usual not particularly helpful kernel security update announcements and I tweeted:

Another day, another tedious grind through Ubuntu kernel security announcements to do the assessment that Ubuntu should be doing already.

I have written about the general sorts of things we want to know about kernel security updates, but there's nothing like a specific example (and @YoloPerdiem asked). So here is essentially the assessment email that I sent to my co-workers.

First, the background. We currently have Ubuntu 16.04 LTS, 14.04 LTS, and 12.04 LTS systems, so we care about security updates for the mainline kernels for all of those (we aren't using any of the special ones). The specific security notices I was assessing are USN-3206-1 (12.04), USN-3207-1 (14.04), and USN-3208-1 (16.04). I didn't bother looking at CVEs that require hardware or subsytems that we don't have or use, such as serial-to-USB hardware (CVE-2017-5549) or KVM (several CVEs here). We also don't update kernels just for pure denial of service issues (eg CVE-2016-9191, which turns out to require containers anyway), because our users already have plenty of ways to make our systems crash if they want to.

So here is a slightly edited and cleaned up version of my assessment email:

Subject: Linux kernel CVEs and my assessment of them

16.04 is only affected by CVE-2017-6074, which we've mitigated, and CVE-2016-10088, which doesn't apply to us because we don't have people who can access /dev/sg* devices.

12.04 and 14.04 are both affected by additional CVEs that are use-after-frees. They are not explicitly exploitable so far, but CVE-2017-6074 is also a use-after-free and is said to be exploitable with an exploit released soon, so I think they are probably equally dangerous.

[Local what-to-do discussion elided.]



Andrey Konovalov discovered a use-after-free vulnerability in the DCCP implementation in the Linux kernel. A local attacker could use this to cause a denial of service (system crash) or possibly gain administrative privileges.

This is bad if not mitigated, with an exploit to be released soon (per here), but we should have totally mitigated it by blocking the DCCP modules. See my worklog on that.


Dmitry Vyukov discovered a use-after-free vulnerability in the sys_ioprio_get() function in the Linux kernel. A local attacker could use this to cause a denial of service (system crash) or possibly gain administrative privileges.

Links: 1, 2, 3.

The latter URL has a program that reproduces it, but it's not clear if this can be exploited to do more than crash. But CVE-2017-6074's use-after-free is apparently exploitable, so...


It was discovered that a use-after-free vulnerability existed in the block device layer of the Linux kernel. A local attacker could use this to cause a denial of service (system crash) or possibly gain administrative privileges.

Link: 1

Oh look, another use-after-free issue. Ubuntu's own link for the issue says 'allows local users to gain privileges by leveraging the execution of [...]' although their official release text is less alarming.


It was discovered that the generic SCSI block layer in the Linux kernel did not properly restrict write operations in certain situations. A local attacker could use this to cause a denial of service (system crash) or possibly gain administrative privileges.

Finally some good news! As far as I can tell from Ubuntu's actual CVE-2016-10088 page, this is only exploitable if you have access to a /dev/sg* device, and on our machines people don't.

(The actual email was plain text, so the various links were just URLs dumped into the text.)

As you can maybe see from this, doing a proper assessment requires reading at least the detailed Ubuntu CVE information in order to work out under what circumstances the issue can be triggered, for instance to know that CVE-2016-10088 requires access to a /dev/sg* device. Not infrequently you have to go chasing further; for example, only Andrey Konovalov's initial notice mentions that he will release an exploit in a few days. In this case we could mitigate the issue anyways by blacklisting the DCCP modules, but in other cases 'an exploit will soon be released' drastically raises the importance of a security exposure (at least for us).

The online USN pages usually link to Ubuntu's pages on the CVEs they include, but the email announcements that Ubuntu sends out don't. Ubuntu's CVE pages usually have additional links, but not a full set; often I wind up finding Debian's page on a CVE because they generally have a full set of search links for elsewhere (eg Debian's CVE-2016-9191 page). I find that sometimes the Red Hat or SuSE bug pages will have the most technical detail and thus help me most in understanding the impact of a bug and how exposed we are.

The amount of text that I wind up writing in these emails is generally way out of proportion to the amount of reading and searching I have to do to figure out what to write. Everything here is a sentence or two, but getting to the point where I could write those is the slog. And with CVE-2017-6074, I had to jump in to set up and test an entire mitigation of blacklisting all the DCCP modules via a new /etc/modprobe.d file and then propagating that file around to all of our Ubuntu machines.

UbuntuKernelUpdateAssessment written at 23:26:07; Add Comment


Some notes on moving a software RAID-1 root filesystem around (to SSDs)

A while ago I got some SSDs for my kind of old home machine but didn't put them to immediate use for various reasons. Spurred on first by the feeling that I should get around to it sometime, before my delay got too embarrassing, and then by one of my system drives apparently going into slow IO mode for a while, I've now switched my root filesystem over to my new SSDs. I've done this process before, but this time around I want to write down notes for my future reference rather than having to re-derive all the steps each time. All of this is primarily for Fedora, currently Fedora 25; some steps will differ on other distributions such as Ubuntu.

I partitioned using GPT partitions, not particularly because I needed to with 750 GB SSDs but because it seemed like a good idea. I broadly copied the partitioning I have on my SSDs at work for no particularly strong reason, which means that I set it up this way:

Number Size Code Name
1 256 MB EF00 EFI System
2 1 MB EF02 BIOS boot partition
3 100 GB FD00 Linux RAID
4 1 GB FD00 Linux RAID (swap)
5 <rest> BF01 ZFS

Some of this is likely superstition by now, such as the BIOS boot partition.

With the pair of SSDs partitioned, I set up the software RAID-1 arrays for the new / and swap. Following my guide to RAID superblock formats I used version 1.0 format for the / array, since I'm going to end up with /boot on it. Having created them as /dev/md10 and /dev/md11 it was time to put them in /etc/mdadm.conf. The most convenient way is to use 'mdadm --examine --scan' and then copy the relevant output into mdadm.conf by hand. Once you have updated mdadm.conf, you also need to update the initramfs version of it by rebuilding the initramfs. Although you can do this for all kernel versions, I prefer to do it only for the latest one so that I have a fallback path if something explodes. So:

dracut --kver $(uname -r) --force

(This complained about a broken pipe for cat but everything seems to have worked.)

When I created the new RAID arrays, I took advantage of an mdadm feature to give them a name with -N; in particular I named them 'ssd root' and 'ssd swap'. It turns out that mdadm --examine --scan tries to use this name as the /dev/ name of the array and the initramfs doesn't like this, so on boot my new arrays became md126 and md127, instead of the names I wanted. To fix that I edited mdadm.conf to give them the proper names, and while I was there I added all of the fields that my other (much older) entries had:

ARRAY /dev/md10  metadata=1.0 level=raid1 num-devices=2 UUID=35d6ec50:bd4d1f53:7401540f:6f971527
ARRAY /dev/md11  metadata=1.2 level=raid1 num-devices=2 UUID=bdb83b04:bbdb4b1b:3c137215:14fb6d4e

(Note that specifying the number of devices may have dangerous consequences if you don't immediately rebuild your initramfs. It's quite possible that Fedora 25 would have been happy without it, but I don't feel like testing. There are only a finite number of times I'm interested in rebooting my home machine.)

After copying my root filesystem from its old home on SATA HDs to the new SSD filesystem, there were a number of changes I need to make to actually use it (and the SSD-based swap area). First, we modify /etc/fstab to use the UUIDs of the new filesystem and swap area for / and, well, swap. The easiest way to get these UUIDs is to use blkid, as in 'blkid /dev/md10' and 'blkid /dev/md11'.

(For now I'm mounting the old HD-based root filesystem on /oldroot, but in the long run I'm going to be taking out those HDs entirely.)

But we're not done, because we need to make some GRUB2 changes in order to actually boot up with the new root filesystem. A normal kernel boot line in grub.cfg looks like this:

linux   /vmlinuz-4.9.9-200.fc25.x86_64 root=UUID=5c0fd462-a9d7-4085-85a5-643555299886 ro acpi_enforce_resources=lax audit=0 SYSFONT=latarcyrheb-sun16 LANG=en_US.UTF-8 KEYTABLE=us rd.md.uuid=d0ceb4ac:31ebeb12:975f015f:1f9b1c91 rd.md.uuid=c1d99f17:89552eec:ab090382:401d4214 rd.md.uuid=4e1c2ce1:92d5fa1d:6ab0b0e3:37a115b5 rootflags=rw,relatime,data=ordered rootfstype=ext4

This specifies two important things, the UUID of the root filesystem in 'root=...' and the (software RAID) UUIDs of the software RAID arrays that the initramfs should assemble in early boot in the 'rd.md.uid=...' bits (per the dracut.cmdline manpage, and also). We need to change the root filesystem UUID to the one we've already put into /etc/fstab and then add rd.md.uuid= settings for our new arrays. Fortunately mdadm has already reported these UUIDs for us and we can just take them from our mdadm.conf additions. Note that these two UUIDs are not the same; the UUID of a filesystem is different than the UUID of the RAID array that contains it, and one will (probably) not work in the place of the other.

(In the long run I will need to take out the rd.md.uuid settings for the old HD-based root and swap partitions, since they don't need to be assembled in early boot and will actively go away someday.)

The one piece of the transition that's incomplete is that /boot is still on the HDs. Migrating /boot is somewhat more involved than migrating the root filesystem, especially as I'm going to merge it into the root partition when I do move it. In the past I've written up two aspects of that move to cover the necessary grub.cfg changes and a BIOS configuration change I'll need to make to really make my new SSDs into the BIOS boot drives, but I've never merged /boot into / in the process of such a move and I'm sure there will be new surprises.

(This is where I cough in quiet embarrassment and admit that even on my work machine, which moved its / filesystem to SSDs some time ago, my /boot still comes from HDs. I really should fix that by merging /boot into the SSD / at some point. Probably I'll use doing it at work as a trial run for doing it at home, because I have a lot more options for recovery if something goes wrong at work.)

PS: The obvious thing to do for merging /boot into / is to build a Fedora 25 virtual machine with a separate /boot and then use it to test just such a merger. There's no reason to blow up my office workstation when I can work out most of the procedure beforehand. This does require a new custom-built Fedora 25 VM image, but it's still probably faster and less hassle than hacking up my office machine.

PPS: It's possible that grub2-mkconfig will do a lot of this work for me (even things like the rd.md.uuid and root= changes). But I have an old grub.cfg that I like and grub2-mkconfig would totally change it around. It's easier to hand modify grub.cfg than write the new config to a new file and then copy bits of it, and in the process I wind up with a better understanding of what's going on.

RootFilesystemSSDMigrationNotes written at 23:20:05; Add Comment

Some views on the Corebird Twitter client

I mentioned recently that my Fedora 25 version of choqok doesn't support some of the latest Twitter features, like quoted tweets (and this causes me to wind up with a bit of a Rube Goldberg environment to deal with it). In a comment, Georg Sauthoff suggested taking a look at Corebird, which is a (or the) native Gtk+ Twitter client. I've now done so and I have some views as a result, both good and bad.

The good first. Corebird is the best Linux client I've run into for quickly checking in on Twitter and skimming my feed; it comes quite close to the Tweetbot experience, which is my gold standard here. A lot of this is that Corebird understands and supports modern Twitter and does a lot directly in itself; you can see quoted tweets, you can see all of the images attached to a tweet and view them full sized with a click, and Corebird will even play at least some animations and videos. All of this is good for quickly skimming over things because you don't have to go outside the client.

Corebird doesn't quite have all of the aspects of the experience nailed in the way that Tweetbot does, especially in the handling of chains of tweets. Tweetbot shows you the current tweet in the middle, past tweets (tweets it was a reply to) above it, and future tweets (tweets that replied to it) below, and you can jump around to other tweets. Corebird shows only past tweets and shows them below, in reverse chronological order, which kind of irritates me; it should be above with the oldest tweet at the top. And you can't jump around.

However, for me Corebird is not what I want to use to actively follow Twitter on an ongoing basis, and I say this for two reasons. The first is that I tried to do it and it seems to have given me a headache (I'm not sure why, but I suspect something about font rendering and UI design). The second is that it's missing a number of features that I want for this, partly because I've found that the user interface for this matters a lot to me. Things that Corebird is missing for me include:

  • no unread versus read marker.
  • you can't have multiple accounts in a single tabbed window; you need either separate windows, one for each account, or to switch back and forth.
  • it doesn't minimize to (my) system tray the way Choqok does; instead you have to keep it running, which means keeping multiple windows iconified and eating up screen space with their icons.
  • it doesn't unobtrusively show a new message count, so I basically have to check periodically to see if there's more stuff to look at.

(With multiple accounts you don't want to quit out of Corebird on a regular basis, because when it starts up only one of those accounts will be open (in one window), and you'll get to open up windows for all of the other ones.)

Corebird will put up notifications if you want it to, but they're big obtrusive things. I don't want big obtrusive notifications about new unread Twitter messages; I just want to know if there are any and if so, roughly how many. Choqok's little count in its tray icon is ideal for this; I can glance over to get an idea if I want to check in yet or not. I also wish Corebird would scroll the timeline with keys, not just the mouse scrollwheel.

I'm probably going to keep Corebird around because it's good for checking in quickly and skimming things, and there's plenty of time when it's good for me to not actively follow Twitter (to put it one way, following Twitter is a great time sink). I'm definitely glad that I checked it out and that Georg Sauthoff mentioned it to me. But I'm going to keep using Choqok as my primary client because for my particular tastes, it works better.

PS: It turns out that Choqok 1.6 will support at least some of these new Twitter features, and it's on the way some time for Fedora. Probably not before Fedora 26, though, because of dependency issues (unless I want to build a number of packages myself, which I may decide to).

CorebirdViews written at 00:44:53; Add Comment


How to see and flush the Linux kernel NFS server's authentication cache

We're going to be running some Linux NFS (v3) servers soon (for reasons beyond the scope of this entry), and we want to control access to the filesystems that those servers will export by netgroup, because we have a number of machines that should have access. Linux makes this a generally very easy process, because unlike many systems you don't need YP NIS in order to use netgroups. All you need to do is change /etc/nsswitch.conf to say 'netgroup: files', and then you can just put things in /etc/netgroup.

However, using netgroups makes obvious the important question of how you get your NFS server to notice changes in netgroup membership, as well as more general changes in authorizations such as changes in the DNS. If you add or delete a machine from a netgroup, or change a IP's PTR record in DNS, you want your NFS server to notice and start using the new information.

I will skip to the conclusion: the kernel maintains a cache of mappings from IP addresses to 'authentication domains' that the IP address is a member of. When it needs to know information about an IP address that it doesn't already have, the kernel asks mountd and mountd adds an entry to the cache. Entries are generally added with a time-to-live, after which they'll be automatically expired and then re-validated; mountd hard-codes this TTL to 30 minutes.

(You can read more information about this and several other interesting things in the nfsd(7) manpage, which describes what you'll find in the special NFS server related bits of /proc and associated virtual filesystems.)

You can see the contents of this cache by looking at /proc/net/rpc/auth.unix.ip/content. Note that the cache includes both positive entries and negative ones (where mountd has declined to authorize a host, and so it's had mount permissions denied). To clear this cache and force everything to revalidate, you write a sufficiently large number to /proc/net/rpc/auth.unix.ip/flush. So, what is a sufficiently large number?

The nfsd(7) manpage describes flush this way:

When a number of seconds since epoch (1 Jan 1970) is written to this file, all entries in the cache that were last updated before that file become invalidated and will be flushed out. Writing 1 will flush everything. [...]

That bit about writing a 1 is incorrect and doesn't work (perhaps this is a bug, but it's also the reality on all of the kernels that you'll find on systems today). So you need to write something that is a Unix timestamp that's in the future, perhaps well in the future. If you feel like running a command to get such a number, the simple thing is to use GNU date's relative time feature:

$ date -d tomorrow +%s

The easier way is just to stack up 9s until everything gets flushed. Of course there have been so many seconds since the Unix epoch that you need quite a lot of 9s by now.

Probably we're going to wrap this up in a script and put a big comment at the start of /etc/netgroup (and possibly /etc/exports) about it. Fortunately I don't expect our netgroups to change very often for these fileservers. Their export lists will likely be mostly static, but we'll slowly add some additional machines to the netgroup.

NFSFlushingServerAuthCache written at 01:24:07; Add Comment


Systemd's slowly but steadily increased idealism about the world

When it started out, systemd was in many ways relentlessly pragmatic. My shining example of this is that the developers went to fairly great lengths to integrate both System V init scripts and /etc/fstab into systemd in a fairly deep and thus quite convenient way. The easy way would have been to just run things and mount filesystems through some compatibility shims and programs. Systemd went the extra distance to make them more or less real units, which means that you can do things like add extra dependencies to System V init scripts through /etc/systemd/system overrides, just as if they were native systemd units.

(This has not always worked seamlessly, especially for mounts, but it has gotten better over time.)

As well as being convenient for people using systemd, I suspect that this was a pragmatic decision. Being a better System V init than SysV init itself undoubtedly didn't hurt systemd's case to be the winning init system; it gave people a few more reasons to like systemd and approve of it and maybe even push for it.

Unfortunately, since then systemd developers have shown an unfortunate and increasing streak of idealism. More and more, systemd seems not to be interested in dealing with the world as it actually is, with all of its grubby inconvenient mess; instead it simply assumes some pure and ideal version of things. If the world does not measure up to how it is supposed to be, well, that is not systemd's problem. Systemd will do the nominally right thing no matter how that works out in practice, or doesn't.

Exhibit one for this is how systemd interprets LSB dependencies in System V init scripts. These dependencies are predictably wrong in any number of cases, because they've never been really used before. Ignoring them and just running init scripts in order (with some pauses in the middle) would be the pragmatic choice, but instead systemd chose the idealistic one of 'we will assume that declared dependencies are correct, gain a minor speed boost, and if things blow up it's not our fault'.

Exhibit two for me is the general non-resolution of our SATA port multiplier issue with device naming. The general systemd view seems to be that this is not their problem; either it should be in some vague diffuse other system that no one is writing today, or the kernel's sysfs should provide more direct information, or both. In no case is this going to be solved by systemd. Never mind that systemd is getting things blatantly wrong; it is not their problem to fix, even though they could. This once again is clear idealism and purity triumphing over actual usability on actual hardware and systems.

It seems clear to me that systemd is less and less a pragmatic system where the developers are willing to make compromises and go out of their way to deal with the grubby, messy world as it actually is, and more and more a project where the developers want to deal with a pure world where things are all done in the philosophically right way. We all know how this ends up, because we have seen this play out in security, among other places. If you're not solving problems in the real world, you're not really solving problems; you are just being smug and 'clever'.

(This elaborates on and explains an old tweet of mine.)

PS: Or perhaps systemd was always sort of like this, and I didn't really notice it before. You do need more than a little bit of idealism to think 'we will do an init system right this time', and certainly systemd had some idealistic beliefs right from the start. Socket activation and (not) handling things that wanted to depend on the network being up are the obvious cases. Systemd was arguably correct but certainly kind of annoying about them.

SystemdAndItsIdealism written at 22:46:15; Add Comment


Systemd should be better than it is, but it is still our best init system

It all started on Twitter:

@hirojin: unpopular opinion: systemd is the epitome of worse is better, and as such fits right into the unix philosophy.

@thatcks: Systemd has turned into the X Windows of init systems.

Some people will see my tweet as a slam on systemd, which is fair, because it is. Some people will see it as sort of praise for systemd, which is also fair, because it is that too. Just like X, systemd is a good illustration of Rob Pike's famous quote that "sometimes when you fill a vacuum, it still sucks".

Systemd didn't quite fill a vacuum, but then neither did X. Much like the aphorism that democracy is the worst form of government except for all the other forms, systemd is increasingly not particularly appealing while still being the best overall Linux init system that we have. Systemd still gets plenty of things right that other init systems mostly don't, and it's improved in some of those areas since 2012, when I wrote that entry. But at the same time systemd has increasingly picked up bad habits and gotten flabby, among other issues (eg), and it always had an air of brute force 'we will get this done somehow' about it.

Like X, systemd is not elegant (or Unixy) and it has flaws, and time is magnifying those flaws. But also like X, systemd works and beats the currently available alternatives. I wish systemd was better than it is, but for all its issues and irritations I don't particularly want to go back to upstart or to System V init (or Solaris's SMF). I could, of course; I ran servers with both for years, and they worked okay. But systemd is generally nicer than they are. Even the systemd journal can be cool and useful periodically.

I care more about systemd's warts than I do about X's, but that's because as a sysadmin I work with systemd a lot more than I write X programs (I actually did write one once) or wrestle with window manager issues or the like.

As a corollary of this, I agree with @hirojin that systemd is a 'worse is better' system. I don't think that it fits into the Unix philosophy in one sense, but that's using the appealing and idealized version of the Unix philosophy. Systemd certainly fits into the philosophy of Unix that you can see exposed through its history in things like various of BSD Unix's changes, the kernel internals of V7 Unix, and so on, by which I mean that a great deal of Unix's history actually involves a bunch of pragmatism, getting things done, and brute force.

SystemdShouldBeBetter written at 02:45:38; Add Comment


How you can abruptly lose your filesystem on a software RAID mirror

We almost certainly just completely lost a software RAID mirror with no advance warning (we'll know for sure when we get a chance tomorrow to power-cycle the machine in the hopes that this revives a drive). This comes as very much of a surprise to us, as we thought that this was not supposed to be possible short of simultaneous two drive failure out of the blue, which should be an extremely rare event. So here is what happened, as best we can reconstruct right now.

In December, both sides of the software RAID mirror were operating normally (at least as far as we know; unfortunately the filesystem we've lost here is /var). Starting around January 4th, one of the two disks began sporadically returning read errors to software RAID code, which caused the software RAID to redirect reads to the other side of the mirror but not otherwise complain to us about the read errors beyond logging some kernel messages. Since nothing showed up about these read errors in /proc/mdstat, mdadm's monitoring never sent us email about it.

(It's possible that SMART errors were also reported on the drive, but we don't know; smartd monitoring turns out not to be installed by default on CentOS 7 and we never noticed that it was missing until it was too late.)

In the morning of January 27th, the other disk failed outright in a way that caused Linux to mark it as dead. The kernel software RAID code noticed this, of course, and duly marked it as failed. This transferred all IO load to the first disk, the one that had been seeing periodic errors since January 4th. It immediately fell over too; although the kernel has not marked it as explicitly dead, it now fails all IO. Our mirrored filesystem is dead unless we can somehow get one or the other of the drives to talk to us.

The fatal failure here is that nothing told us about the software RAID code having to redirect reads from one side of the mirror to the other due to IO errors. Sure, this information shows up in kernel messages, but so does a ton of other unstructured crap; the kernel message log is the unstructured dumping ground for all sorts of things and as a result, almost nothing attempts to parse it for information (at least not in a standard, regular installation).

Well, let me amend that. It appears that this information is actually available through sysfs, but nothing actually monitors it (in particular mdadm doesn't). There is an errors file in /sys/block/mdNN/md/dev-sdXX/ that contains a persistent counter of corrected read errors (this information is apparently stored in the device's software RAID superblock), so things like mdadm's monitoring could track it and tell you when there were problems. It just doesn't.

(So if you have software RAID arrays, I suggest that you put together something that monitors all of your errors files for increases and alerts you prominently.)

LosingMirroredRAIDViaDiskErrors written at 00:48:53; Add Comment

(Previous 10 or go back to January 2017 at 2017/01/23)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.