Wandering Thoughts archives

2015-04-24

A DKMS problem I had with lingering old versions

I use the DKMS-based approach for my ZFS on Linux install, fundamentally because using DKMS makes upgrading kernels painless and convenient. It's worked well for a long time, but recently some DKMS commands, particularly 'dkms status', started erroring out with the odd message:

Error! Could not locate dkms.conf file.
File:  does not exist.

Since everything seemed to still work I shrugged my shoulders and basically ignored it. I don't know DKMS myself; as far as I've been concerned, it's just as much magic as, oh, /bin/kernel-install (which, if you're not familiar with it, is what Fedora runs to set up new kernels). I did a little bit of Internet searching for the error message but turned up nothing that seemed particularly relevant. Then today I updated to a new Fedora kernel, got this message, and in an excess of caution decided to make sure that I actually had the ZoL binary modules built and installed for the new kernel. Well, guess what? I didn't. Nor could I force them to be built for the new kernel; things like 'dkms install ...' kept failing with this error message or things like it.

(I felt very happy about checking before I rebooted the system into the new kernel and had it come up without my ZFS pools.)

I will cut to the chase. ZFS on Linux recently released version 0.6.4, when I had previously been running development versions that still called themselves 0.6.3 for DKMS purposes. When I upgraded to 0.6.4, something in the whole process left behind some 0.6.3 directory hierarchies in a DKMS area, specifically /var/lib/dkms/spl/0.6.3 and /var/lib/dkms/zfs/0.6.3. Removing these lingering directory trees made DKMS happy with life and allowed me to eventually build and install the 0.6.4 SPL and ZFS modules for the new kernel.

(The dkms.conf file(s) that DKMS was looking for are normally found in /usr/src/<pkg>-<ver>. My theory is that the lingering directories in /var/lib/dkms were fooling DKMS into thinking that spl and zfs 0.6.3 were installed, and then it couldn't find their dkms.conf files under /usr/src and errored out.)

I have no idea if this is a general DKMS issue, something that I only ran into because of various somewhat eccentric things I wound up doing on my machine, or some DKMS related thing that the ZoL packages are doing slightly wrong (which has happened before). At least I've solved it now and 'dkms status' is now happy with life.

(I can't say I've deeply increased my DKMS knowledge in the process. DKMS badly needs a 'so you're a sysadmin and something has gone wrong with a DKMS-based package, here's what you do next' document. Also, this is obviously either a bad or a buggy error message.)

DKMSLingeringVersionProblem written at 01:13:20; Add Comment

2015-04-10

I wish systemd would get over its thing about syslog

Anyone who works with systemd soon comes to realize that systemd just doesn't like syslog very much. In fact systemd is so unhappy with syslog that it invented its own logging mechanism (in the form of journald). This is not news. What people who don't have to look deeply into the situation often don't realize is that systemd's dislike is sufficiently deep that systemd just doesn't interact very well with syslog.

I won't say that bugs and glitches 'abound', because I've only run into two issues so far (although both issues are relatively severe). One was that systemd mis-filed kernel messages under the syslog 'user' facility instead of the 'kernel' one; this bug made it past testing and into RHEL 7 / CentOS 7. The other is that sometimes on boot, randomly, systemd will barf up a significant chunk of old journal messages (sometimes very old) and re-send them to syslog. If you don't scroll back far enough while watching syslog logs, this can lead you to believe that something really bad and weird has happened.

(This has actually happened to me several times.)

This is stupid and wrongheaded on systemd's part. Yes, systemd doesn't like syslog. But syslog is extremely well established and extremely useful, especially in the server space. Part of that is historical practice, part of that is that syslog is basically the only cross-platform logging technology we have, and partly it's because you can do things like forward syslog to other machines, aggregate logs from multiple machines on one, and so on (and do so in a cross-platform way). And a good part of it is because syslog is simple text and it's always been easy to do a lot of powerful ad-hoc stuff with text. That systemd continually allows itself to ignore and interact badly with syslog makes everyone's life worse (except perhaps the systemd authors). Syslog is not going away just because the systemd authors would like it to and it is high time that systemd actually accepted that and started not just sort of working with syslog but working well with it.

One of systemd's strengths until now has been that it played relatively well (sometimes extremely well) with existing systems, warts and all. It saddens me to see systemd increasingly throw that away here.

(And I'll be frank, it genuinely angers me that systemd may feel that it can get away with this, that systemd is now so powerful that it doesn't have to play well with other systems and with existing practices. This sort of arrogance steps on real people; it's the same arrogance that leads people to break ABIs and APIs and then tell others 'well, that's your problem, keep up'.)

PS: If systemd people feel that systemd really does care about syslog and does its best to work well with it, well, you have two problems. The first is that your development process isn't managing to actually achieve this, and the second is that you have a perception problem among systemd users.

SystemdAndSyslog written at 23:42:47; Add Comment

2015-04-09

Probably why Fedora puts their release version in package release numbers

Packaging schemes like RPM and Debian debs split full package names up into three components: the name, the (upstream) version, and the (distribution) release of the package. Back when people started making RPM packages, the release component tended to be just a number, giving you full names like liferea-1.0.9-1 (this is release 1 of Liferea 1.0.9). As I mentioned recently, the modern practice of Fedora release numbers has changed to include the distribution version. Today we have liferea-1.10.13-1.fc21 instead (on Fedora 21, as you can see). Looking at my Fedora systems, this appears to be basically universal.

Before I started writing this entry and really thinking about the problem, I thought there was a really good deep reason for this. However, now I think it's so that if you're maintaining the same version of a package on both Fedora 20 and Fedora 21, you can use the exact same .spec file. As an additional reason, it makes automated rebuilds of packages for (and in) new Fedora versions easier and work better for upgrades (in that someone upgrading Fedora versions will wind up with the new version's packages).

The simple magic is in the .spec file:

Release: 1%{?dist}

The RPM build process will substitute this in at build time with the Fedora version you're building on (or for), giving you release numbers like 1.fc20 and 1.fc21. Due to this substitution, any RPM .spec file that does releases this way can be automatically rebuilt on a new Fedora version without needing any .spec file changes (and you'll still get a new RPM version that will upgrade right, since RPM sees 1.fc21 as being more recent than 1.fc20).

The problem that this doesn't really deal with (and I initially thought it did) is wanting to build an update to the Fedora 20 version of a RPM without updating the Fedora 21 version. If you just increment the release number of the Fedora 20 version, you get 2.fc20 and the old 1.fc21 and then upgrades won't work right (you'll keep the 2.fc20 version of the RPM). You'd have to change the F20 version to a release number of, say, '1.fc20.1'; RPM will consider this bigger than 1.fc20 but smaller than 1.fc21, so everything works out.

(I suspect that the current Fedora answer here is 'don't try to do just a F20 rebuild; do a pointless F21 rebuild too, just don't push it as an update'. Really there aren't many situations where you'd need to do a rebuild without any changes in the source package, and if you change the source package, eg to add a new patch, you probably want to do a F21 update too. I wave my hands.)

PS: I also originally thought that Ubuntu does this too, but no; while Ubuntu embeds 'ubuntu' in a lot of their package release numbers, it's not specific to the Ubuntu version involved and any number of packages don't have it. I assume it marks packages where Ubuntu deviates from the upstream Debian package in some way, eg included patches and so on.

FedoraRPMReleaseNumberIssue written at 00:50:23; Add Comment

2015-04-07

How Ubuntu and Fedora each do kernel packages

I feel the need to say things about the Ubuntu (and I believe Debian) kernel update process, but before I do that I want to write down how kernel packages look on Ubuntu and Fedora from a sysadmin's perspective because I think a number of people have only been exposed to one or the other. The Fedora approach to kernel packages is also used by Red Hat Enterprise Linux (and CentOS) and probably other Linux distributions that use yum and RPMs. I believe that the Ubuntu approach is also used by Debian, but maybe Debian does it a bit differently; I haven't run a real Debian system.

Both debs and RPMs have the core concepts of a package having a name, an upstream version number, and a distribution release number. For instance, Firefox on my Fedora 21 machine is currently firefox, upstream version 37.0, and release 2.fc21 (increasingly people embed the distribution version in the release number for reasons beyond the scope of this entry).

On Fedora you have some number of kernel-... RPMs installed at once. These are generally all instance of the kernel package (the package name); they differ only in their upstream version number and their release number. Yum normally keeps the most recent five of them for you, deleting the oldest when you add a new one via a 'yum upgrade' when a new version of the kernel package is available. This gives you a list of main kernel packages that looks like this:

kernel-3.18.8-201.fc21.x86_64
kernel-3.18.9-200.fc21.x86_64
kernel-3.19.1-201.fc21.x86_64
kernel-3.19.2-201.fc21.x86_64
kernel-3.19.3-200.fc21.x86_64

Here the kernel RPM with upstream version 3.19.3 and Fedora release version 200.fc21 is the most recent kernel I have installed (and this is a 64-bit machine as shown by the x86_64 architecture).

(This is a slight simplification. On Fedora 21, the kernel is actually split into three kernel packages: kernel, kernel-core, and kernel-modules. The kernel package for a specific version is just a meta-package that depends (through a bit of magic) on its associated kernel-core and kernel-modules packages. Yum knows how to manage all of this so you keep five copies not only of the kernel meta-package but also of the kernel-core and kernel-modules packages and so on. Mostly you can ignore the sub-packages in Fedora; I often forget about them. In RHEL up through RHEL 7, they don't exist and their contents are just part of the kernel package; the same was true of older Fedora versions.)

Ubuntu is more complicated. There is a single linux-image-generic (meta-)package installed on your system and then some number of packages with the package name of linux-image-<version>-<release>-generic for various <version> and <release> values. Each of these packages has a deb upstream version of <version> and a release version of <release>.<number>, where the number varies depending on how Ubuntu built things. Each specific linux-image-generic package version depends on a particular linux-image-<v>-<r>-generic package, so when you update to it it pulls in that specific kernel (at whatever the latest package release of it is).

Because of all of this, Ubuntu systems wind up with multiple kernels installed at once by the side effects of updating linux-image-generic. A new package version of l-i-g will depend on and pull in an entirely new linux-image-<v>-<r>-generic package, leaving the old linux-image-*-generic packages just sitting there. Unlike with yum, nothing in plain apt-get limits how many old kernels you have sitting around; if you leave your server alone, you'll wind up with copies of all kernel packages you've ever used. As far as the Ubuntu package system sees it, these are not multiple versions of the same thing but entirely separate packages, each of which you have only one version of.

This gives you a list of packages that looks like this (splitting apart the package name and the version plus Ubuntu release, what 'dpkg -l' calls Name and Version):

linux-image-3.13.0-24-generic   3.13.0-24.47
linux-image-3.13.0-45-generic   3.13.0-45.74
linux-image-3.13.0-46-generic   3.13.0-46.79
linux-image-3.13.0-48-generic   3.13.0-48.80

linux-image-generic             3.13.0.48.55

(I'm simplifying again; on Ubuntu 14.04 there are also linux-image-extra-<v>-<r>-generic packages.)

On this system, the current 3.13.0.48.55 version of linux-image-generic depends on and thus requires the linux-image-3.13.0-48-generic package, which is currently 'at' the nominal upstream version 3.13.0 and Ubuntu release 48.80. Past Ubuntu versions of linux-image-generic depended on the other linux-image-*-generic packages and caused them to be installed at the time.

I find the Fedora/RHEL approach to be much more straightforward than the Ubuntu approach. With Fedora, you just have N versions of the kernel package installed at once; done. With Ubuntu, you don't really have multiple versions of any given package installed; you just have a lot of confusingly named packages, each of which has one version installed, and these packages get installed on your system as a side effect of upgrading another package (linux-image-generic). As far as I know the Ubuntu package system doesn't know that all of these different named packages are variants of the same thing.

(A discussion of some unfortunate consequences of this Ubuntu decision is beyond the scope of this entry. See also.)

Sidebar: kernel variants

Both Ubuntu and Fedora have some variants of the kernel; for instance, Fedora has a PAE variant of their 32-bit x86 kernel. On Fedora, these get a different package name, kernel-pae, and everything else works in the same way as for normal kernels (and you have both PAE and regular kernels installed at the same time; yum will keep the most recent five of each).

On Ubuntu I believe these get a different meta-package that replaces linux-image-generic, for example linux-image-lowlatency, and versions of this package depend on specific kernel packages with different names, like linux-image-<v>-<r>-lowlatency. You can see the collection with 'apt-cache search linux-image'.

Both Fedora and Ubuntu have changed how they handled kernel variants over time; my memory is that Ubuntu had to change more in order to become more sane. Today their handling of variants strikes me as reasonably close to each other.

UbuntuVsFedoraKernelPackages written at 01:24:19; Add Comment

2015-04-04

An important note if you want to totally stop an IKE IPSec connection

Suppose, hypothetically, that you think your IPSec GRE tunnel may be contributing to some weird connection problem you're having. In order to get it out of the picture, you want to shut it down (which will still leave you able to reach things). There are three ways you can do this: you can use 'ipsec whack --terminate' to ask your local pluto to shut down this specific IKE connection (which you've engineered to stop the GRE tunnel), you can shut your local pluto down entirely with 'systemctl stop pluto' (or equivalent), or you can stop pluto on both ends.

I will skip to the punchline: if you have no *protoport set (so that you're doing IPSec on all traffic just because you might as well), you need to shut pluto down on both ends. Merely shutting down the IKE IPSec stuff for your GRE tunnel (and taking down the tunnel itself) will leave the overall IPSec security policy intact and this policy specifically instructs the kernel to drop any non-IPSec packets between your left and right IPs. Only shutting down pluto itself will get rid of the security policy, and you need to get rid of it on both ends so you need to shut down pluto on both.

(If pluto is handling more than one connection for you on one of the ends, you're going to need to do something more complicated. My situation is usefully simple here.)

If you shut down pluto on only one end and then keep trying to test things, you can get into very puzzling and head-scratching problems. For instance, if you try to make a connection from the shut-down side to the side with pluto still running, tcpdump on both ends will tell you that SYN packets are being send and arriving at their destination but are getting totally ignored despite there being no firewall rules and so on that would do this.

(If you have a selective *protoport set, any traffic that would normally be protected by IPSec will be affected by this because the security policy says 'drop any of this traffic that is not protected with IPSec'.)

PS: your current IPSec security policies can be examined with 'setkey -DP'. There's probably some way to get a counter of how many packets have been dropped for violating IPSec security policies, but I don't know what it is (maybe it's hiding somewhere in 'ip xfrm', which has low-level details of this stuff, although /proc/net/xfrm_stat doesn't seem to be it).

IKEShuttingDownConnection written at 03:30:31; Add Comment

A weird new IKE IPSec problem that I just had on Fedora 21's latest kernel

Back when I first wrote up my IKE configuration for my point to point GRE tunnel, I restricted the IKE IPSec configuration so that it would only apply IPSec to the GRE traffic with:

conn cksgre
   [...]
   leftprotoport=gre
   rightprotoport=gre
   [...]

I only did this restriction out of caution and matching my old manual configuration. A while later I decided that it was a little silly; although I basically didn't do any unencrypted traffic to the special GRE touchdown IP address I use at the work end, I might as well fully protect the traffic since it was basically free. So I took the *protoport restrictions out, slightly increasing my security, and things worked fine for quite some time.

Today this change quietly blew up in my face. The symptoms were that often (although not always) a TCP connection specifically between my home machine and the GRE touchdown IP would stall after it transferred some number of bytes (it's possible that the transfer direction mattered but I haven't tested extensively). Once I narrowed down what was going on from the initial problems I saw, reproduction was pretty consistent: if I did 'ssh -v touchdown-IP' from home I could see it stall during key exchange.

I don't know what's going on here, but it seems specific to running the latest Fedora 21 kernel on both ends; I updated my work machine to kernel 3.19.3-200.fc21 a couple of days ago and did not have this problem, but I updated my home machine to 3.19.3-200.fc21 a few hours ago and started seeing this almost immediately (although it took some time and frustration to diagnose just what the problem was).

(I thought I had some evidence from tcpdump output but in retrospect I'm not sure it meant what I think it meant.)

(I had problems years ago with MTU collapse in the face of recursive GRE tunnel routing, but that was apparently fixed back in 2012 and anyways this is kind of the inverse of that problem, since this is TCP connections flowing outside my GRE tunnel. Still, it feels like a related issue. I did not try various ways of looking at connection MTUs and so on; by the time I realized this was related to IPSec instead of other potential problems it was late enough that I just wanted the whole thing fixed.)

IKEAndIPSecNewIssue written at 03:29:50; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.