Wandering Thoughts archives

2010-09-30

Stopping kernel updates on Ubuntu

Suppose that you run production machines, where you don't want to have to reboot things without a bunch of advance planning (or a serious emergency). One of the things you want to do on such a system is block kernel updates. On dpkg-based systems, this is called holding a package.

(One way to do it, the one I use, is 'echo pkgname hold | dpkg --set-selections'. 'dpkg --get-selections | fgrep hold' can then be used to list held packages.)

In order to block all Ubuntu kernel updates, you have to remember that Ubuntu does two sorts of kernel updates:

  • entirely new kernel packages (with the new kernel version in their names).

    As new packages these aren't seen as upgrades to anything already installed on your system, so Ubuntu updates the kernel meta-packages to require the new kernel packages. Holding the meta-packages blocks any chance that these new kernel packages will get pulled in by a routine update.

    In theory 'apt-get -u upgrade' won't install new packages, even dependencies of upgrades of existing packages (you have to use dist-upgrade instead). In practice I'm not sure that I trust that to happen all of the time; holding the meta-packages is harmless and makes sure.

    (Ubuntu appears to update only the meta-packages from time to time, but since the meta-package contains basically nothing, not updating it seems harmless.)

  • 'minor' point releases of existing kernel packages.

    As point releases of an already installed package, these are update candidates on their own (without a meta-package update to go with them), so you have to hold all of the existing kernel packages to block them. This means that you have to remember to apply a hold to any new kernel package that gets installed as a result of updating the meta-packages.

    (If you don't care about older kernel packages, you can either leave them un-held or just remove them.)

The way we explicitly upgrade held packages is to use 'apt-get install ...'. There is probably a better command line way, but this one works for us.

(Please do not suggest aptitude. Aptitude's command line interface makes me want to strangle people; it is about five times too clever.)

linux/UbuntuHoldingKernels written at 18:45:05; Add Comment

A lot of my bugs are conceptual oversights

This is in part a war story.

We've written our own system to handle deploying spares on our ZFS fileservers. One of the decisions we made was how much activity to have the system start at once, because we already knew that trying to resilver too many mirrors at once killed performance for everyone. What we decided was that we only wanted one resilver to happen at once and further that we would abort any ZFS scrubs that were in progress if we needed to activate a spare, because getting a mirror back to redundancy was more important than a precautionary check of a mirror's consistency.

So I wrote two functions, scrubbing_pools and resilvering_pools. Because you need to know the name of the pool (or pools) that are scrubbing in order to abort the scrub, scrubbing_pools returned a list of pools that were scrubbing. resilvering_pools did too because it was a trivial variant of scrubbing_pools, and why not? The code that refused to start a second resilver when there was one already running used the obvious check of 'is the list of resilvering pools empty?'

(The system is written in Python, so returning lists of names is a natural thing to do.)

Today an entire backend went down and rebooted out of the blue, causing ZFS to declare all of its disks bad, and we needed to replace it with a spare backend that we trusted. This means resilvering every disk currently in use, which we did through the spares system. After a while we decided that resilvering only one disk at a time was too slow and we could probably survive doing three at once, and we wanted the spares system to do the work for us.

So it was time for a quick yet obvious and simple code change; instead of checking for a non-null list, just count how many entries are in it and see if this is larger or equal to a 'maximum resilvers' parameter (which defaulted to 1 for backwards compatibility). We tested this, deployed it, watched it work, and left it going. Tonight, as I checked in on the system state, I realized that there was a bug in what I had done.

Can you see it? (You have a big advantage; I've told you that there is one.)

Here is the bug: we want to limit how many disks are resilvering at once, but the code is counting how many pools are resilvering. If a pool has more than one disk that needs resilvering, the code will wind up happily resilvering all of them at once, no matter how many there are.

There is no coding bug here, and I would argue that there is not even bad design; the code returned resilvering pools instead of disks for completely sensible reasons, and the difference originally didn't matter (a pool that is resilvering has at least one resilvering disk, which meant that we didn't want to start another). The bug is a conceptual oversight, a mismatch between how I thought of the code and how the code was actually working.

Many of the bugs that I find in my code are conceptual oversights, not straightforward errors or mistakes of implementation. How often I hit conceptual oversights is one of the reasons that I am not as enthused about testing (unit and otherwise) as I could be; I don't think that testing, especially unit testing, is a good way to find them. Conceptual oversights are bugs where you don't think about something at all, and if you don't think about something, how can you write tests that check for it? If your tests turn up a conceptual oversight, it is probably a lucky accident.

(This is one argument for a separate testing group, because they may well not make the same conceptual oversight that you did. Especially given that they are not as immersed in the logic of the code as you are.)

programming/ConceptualBugExample written at 02:37:24; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.