What makes our Ubuntu updates driver program complicated
In response to yesterday's entry on how we sort of automate Ubuntu package updates, which involves a complicated driver program (written in Python) to control a bunch of ssh's to our machines, a commentator asked the perfectly sensible and obvious question:
Is there a reason this couldn’t be a bash script that invokes pdsh?
Ultimately the complexity of our driver program is caused by how the Ubuntu package update process is flawed. We might still have a Python program instead of a shell script if the process worked better, but it would at least be a simpler Python program.
There are a number of complicated things that our driver program does (and my list here is somewhat different than my list in my reply comment). The lesser one is that it parses the output of apt-get to determine what packages would be updated or nominally did get updated on machines during an update run. This parsing could theoretically be done in an awk script, but in Python we can take advantage of better data structures to make it clearer and gather more complex data. The obvious thing we do with this complex data is aggregate it by groups of machines that will all apply the same set of package updates; usually this drastically reduces the output down to something that's much easier to follow.
(One of the other things we do with this complex data is look for signs of mis-configurations in what Ubuntu packages are held, because sometimes either something goes wrong or a machine was not quite set up correctly. If we spot things like a Samba server package update that would be applied, we print a big warning. This has saved us from awkward problems several times. After the driver's initial scan has finished, we can exclude machines from updates, or we can bail out and hold the packages properly on the machine, then restart the whole process.)
After the initial scan for updates is done, the update driver enters a command loop where it asks what to do next. Typically we tell it to apply updates to everything, but you can also tell it to do a specific machine first, or exclude some machines from what will be updated, and a number of other things. Or you can quit out immediately if you don't actually want to apply updates (perhaps you were just checking what updates were pending). The command loop ends when the update driver thinks it has nothing left to do because all still-eligible machines have had updates applied; at this point the updates driver writes out its final summary and so on.
The most complicated portion of the program and the process is
actually applying the updates on each system. When we were basically
doing 'ssh host apt-get -y upgrade' in an earlier version of our
update automation, we found that it would periodically stall on
some host and then we would have a problem; sometimes apt-get wanted
to ask us a question, and sometimes it just ran into issues. So our
current approach is to run the updates in what '
ssh -t' and
apt-get think is an interactive environment, capture all of their
output without spewing it over our terminal, and then if things
seem to go wrong allow us to step into the session to answer
questions, sort things out, or just see where things stalled.
Mechanically we use the third party Python pexpect module, which I had
some learning experiences with (although
I see that the module has been updated since then).
(The driver's current way of detecting problems is if an update produces no output for a sufficiently long time. We can also immediately step in if we want to.)
In theory apt-get and dpkg have settings that should let the update process automatically pick the default answer for any question a package update wants to ask us. In practice, we don't trust the default answer to always be sensible on package upgrades, although we do try to tell dpkg to always pick our own local version of configuration files to cut down on the questions we get asked.
Because Ubuntu package updates and apt-get operations are slow, we want to be able to run package updates in parallel, although we don't always do so. This adds extra complications to stepping into apt-get sessions, as you might expect, and there's a certain amount of code to coordinate all of this. Also, if one session has to be stepped into, we don't want to automatically continue on to do other (serial) updates, in case this is a systemic issue with this set of updates that we want to deal with before we proceed. Similarly if one update session fails outright (with ssh returning an error code), the driver pauses and waits for further directions.
(The entire reason the driver exists is so that we don't have to do updates one by one with manual attention. If a particular package update turns out to require manual attention, we will often either hold the package to block the update until we can figure things out, or directly update the affected machines by hand. If we have to interact with an 'apt-get upgrade', running it directly on the machine instead of through the driver is better.)
The updates driver also has a second mode that is used to update
held packages. In this mode, we run '
apt-get install <...>' for
the specific packages we want to update, instead of the usual
apt-get upgrade', and the update driver's command loop now has
commands for selecting what package or packages should be updated
(we don't necessarily want to update all held packages on a machine).
This is typically used for things like kernel updates, where we
want to mass update all of our machines. Updates of per-machine
held packages (like the Samba server) are often done by just logging
in to the machine and doing the process by hand (we often want to
monitor daemon logs and so on anyway).
(There are also some ancillary modes of operation, like a dry run mode and a mode to just report on what held packages have pending updates. Additional features let us control which machines it operates on, including trying to update machines that aren't in our normal list of machines to update.)
PS: Probably the updates driver has too many features. Certainly it has features that we don't really use, and some that I'd forgotten about until I re-read its full help text. It's one of those programs where my enthusiasm may have gotten away from me when I wrote it.