Ubuntu's packaging failure with mcelog in 14.04

September 18, 2014

For vague historical reasons we've had the mcelog package in our standard package set. When we went to build our new 14.04 install setup, this blew up on us; on installation, some of our machines would report more or less the following:

Setting up mcelog (100-1fakesync1) ...
Starting Machine Check Exceptions decoder: CPU is unsupported
invoke-rc.d: initscript mcelog, action "start" failed.
dpkg: error processing package mcelog (--configure):
 subprocess installed post-installation script returned error exit status 1
Errors were encountered while processing:
 mcelog
E: Sub-process /usr/bin/dpkg returned an error code (1)

Here we see a case where a collection of noble intentions have had terrible results.

The first noble intention is a desire to warn people that mcelog doesn't work on all systems. Rather than silently run uselessly or silently exit successfully, mcelog instead reports an error and exits with a failure status.

The second noble intention is the standard Debian noble intention (inherited by Ubuntu) of automatically starting most daemons on installation. You can argue that this is a bad idea for things like database servers, but for basic system monitoring tools like mcelog and SMART monitoring I think most people actually want this; certainly I'd be a bit put out if installing smartd didn't actually enable it for me.

(A small noble intention is that the init script passes mcelog's failure status up, exiting with a failure itself.)

The third noble intention is that it is standard Debian behavior for an init script that fails when it is started in the package's postinstall script to cause the postinstall script itself to exit out with errors (it's in a standard dh_installinit stanza). When the package postinstall script errors out, dpkg itself flags this as a problem (as well it should) and boom, your entire package install step is reporting an error and your auto-install scripts fall down. Or at least ours do.

The really bad thing about this is that server images can change hardware. You can transplant disks from one machine to another for various reasons; you can upgrade the hardware of a machine but preserve the system disks; you can move virtual images around; you can (as we do) have standard machine building procedures that want to install a constant set of packages without having to worry about the exact hardware you're installing on. This mcelog package behavior damages this hardware portability in that you can't safely install mcelog in anything that may change hardware. Even if the initial install succeeds or is forced, any future update to mcelog will likely cause you problems on some of your machines (since a package update will likely fail just like a package install).

(This is a packaging failure, not an mcelog failure; given that mcelog can not work on some machines it's installed on, the init script failure should not cause a fatal postinstall script failure. Of course the people who packaged mcelog may well not have known that it had this failure mode on some machines.)

I'm sort of gratified to report that Debian has a bug for this, although the progress of the bug does not fill me with great optimism and of course it's probably important enough to ever make it into Ubuntu 14.04 (although there's also an Ubuntu bug).

PS: since mcelog has never done anything particularly useful for us, we have not been particularly upset over dropping it from our list of standard packages. Running into the issue was a bit irritating though, but mcelog seems to be historically good at irritation.

PPS: the actual problem mcelog has is even more stupid than 'I don't support this CPU'; in our case it turns out to be 'I need a special kernel module loaded for this machine but I won't do it for you'. It also syslogs (but does not usefully print) a message that says:

mcelog: AMD Processor family 16: Please load edac_mce_amd module.#012: Success

See eg this Fedora bug and this Debian bug. Note that the message really means 'family 16 and above', not 'family 16 only'.

Written on 18 September 2014.
« In praise of Solaris's pfiles command
What I mean by passive versus active init systems »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Sep 18 01:57:09 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.