2005-09-22
Excluding buggy RPMs from a yum repository
We mirror the Fedora Core 4 updates area, and we use yum to install
things from it, and Red Hat recently released an extremely unsuitable
(for us) update to xorg-x11 (as recounted in MoreFC4Problems). So we
needed to take those buggy RPMs out of circulation and keep them out.
Fortunately this turns out to be pretty easy, because not all RPMs in
a directory have to be in the repository metadata that yum
uses. This let us continue to pull a full mirror from Red Hat (keeping
the mirroring script simple) and just built our own metadata that
excludes the buggy RPMs we don't want.
Modern yum repository metadata is created with the createrepo
command (not yum-arch, despite the latter being packaged with
yum and the former not), which is in the createrepo RPM.
createrepo has a -x option to exclude RPMs from the metadata it
makes; -x even takes a glob pattern, like say
xorg-x11-*6.8.2-37.FC4.48.1* (remember to quote it).
There are two drawbacks with this:
createrepotakes a while to run.- We're now dependent on having a Fedora Core 4 machine around to run
the mirroring on (because we probably need to use a
createrepoversion that matches the Fedora Core 4yum, so they agree on the metadata format).
Hopefully someday Red Hat will fix the bugs and we can go back to using all of the current updates and thus to just mirroring the update repository metadata as well as all of the RPMs.
(Because we are only ignoring a specific version of xorg-x11, yum
will include the previous, working Fedora Core 4 xorg-x11 update in
the metadata. Which is what we want, because we definitely want that
update installed.)
2005-09-21
More Fedora Core 4 problems with X
Important update: the buggy RPMs have been superseded. See the end.
You may remember back to FC4FirstIrritations, where I talked about
several X problems. Well, Red Hat managed to outdo itself recently, as
we discovered the hard way: their latest update to the xorg-x11
packages, 6.8.2-37.FC4.48.1 (released September 16th or so) now
causes almost all of our workstations to explode.
Not literally exploded, but X won't start and more or less crashes the machine (one bug report compared it to what a crashed C-64 looks like, and I have to agree). It turns out that this update appears to be totally broken on many if not all Intel graphics hardware, such as the 865G chipset that our HP Deskpro D530s.
The Red Hat Bugzilla bugs filed so far are #168919, #168937 (i945 graphics), and #168940 (i865G graphics). They've apparently all been filed very recently, which at least makes me feel somewhat less annoyed.
Unfortunately, we mirror the entire Red Hat updates repository and
use yum, which requires carefully built repository metadata. This
means that clearing this update out of our system is not as simple as
removing the RPMs; we have to fix our mirror scripts, move to manually
building the repository metadata, and then test the result.
Since a reinstall test takes half an hour or more, there went a lot of my day. Fedora Core 4 has really turned into the update that I wish we'd skipped.
Clearly, we need to adopt new rules for Fedora Core 4 updates:
- never apply updates within a week of their release (unless they are urgent security fixes).
- always check Bugzilla for new bugs caused by the updates.
It's a rather sad day when we have to adopt these rules, but that is apparently life with Fedora Core 4.
(You may gather that I am not happy with this state of affairs.)
Update, September 23rd: The good news is that this issue has now
been fixed; today Red Hat released a new xorg-x11 update, version
6.8.2-37.FC4.49.2, that doesn't have the problem. The master Bugzilla
bug for this issue turns out to be
Bugzilla #168752,
which has various bits of information.
I am still somewhat unhappy about this state of affairs, but less unhappy than I was when I first wrote this entry. Part of it is that the problem is gone; part of it is that Red Hat dealt with the issue pretty rapidly, without dragging their heels (as I feel they have with some Fedora Core 4 bugs). I do think that Red Hat needs a procedure for pulling seriously problematic updates from their updates area (and the mirrors, which makes it more challenging), and that they should have used it in this situation.
What vendor updates are pending on your Linux system?
One of those routine system management tasks these days is
checking to make sure that I'm up to date on vendor security releases
and other updates. And when I'm not up to date, I generally want to
know what's out of date; being out of date on diff is one thing,
being out of date on a kernel or glibc is quite another. (For a
start, I probably want to test the kernel a bit more before pushing it
into our local update queue.)
Since I have a lot of systems, I like a concise report; say, one line with all the pending packages for systems that need updates, and total silence from systems that are current (in other words, I'm a Unix geek and I like Unix-style output).
Unfortunately, most package management systems are not quite as eager to accommodate my minimalism as I would like, so it takes some scripting work.
For anything with yum (Fedora Core, current Red Hat Enterprise, and probably others):
yum -d 1 check-update |
awk '{print $1}'
In at least Fedora Core 4, this doesn't give you the literal package
names; you get a bonus architecture glued on the back that you'll have
to slice off if you really care. You also get a bonus blank line at
the start (this may be a yum bug).
Debian makes it a little bit more annoying. I get to do:
apt-get -qq update
apt-get -qq -s -u upgrade |
awk '$1 == "Inst" {print $2}'
(Assuming that Debian does not change apt-get's output someday. They
might, and this is one reason
apt is not my favorite program.)
In theory 'apt-get -qq update' could be dangerous, but I think the
odds are pretty low.
Converting these to produce single-line output is left as an exercise to the reader.
2005-09-13
The problem of being overcautious
Today's fire drill was caused by our printing system not printing; since it is the first day of classes, it was not a good time to discover this. After fixing a couple of small problems, the big stumbling block was authentication not working.
Our printing system has a central machine that handles quota management and a per-lab machine that handles the actual print spooling and printing. This requires the labmasters to talk to the quota server to tell it about pages that got printed.
Because I am paranoid, the quota server insists that connections from the labmasters be somewhat authenticated (otherwise a clever student could ruin someone else's day by telling the system they'd just printed 10,000 pages). Because I am lazy, the authentication is done by the RFC 1413 'ident protocol', which gives the nominal owner of one end of a TCP connection. In this case, the quota server only accepts print updates from the user 'lp' on the labmasters.
Examining logs showed that authentication was failing because authd
(the 'ident protocol' daemon) on the labmasters wasn't returning
information about the connection. Worse, this wasn't a general
failure; if I tried it by hand, it worked. Only the quota checking
script run as part of printing provoked the authd failures.
It took careful examination of authd's code and a certain amount of
staring at debugging output and capturing snapshots of system files
like /proc/net/tcp to find the problem: excessive caution.
TCP connections are uniquely identified by the quad of 'source host,
source port, destination host, destination port'. But when it reads
/proc/net/tcp to find the right connection, authd checks more
than that; it also requires that the state of the TCP connection be
'ESTABLISHED'. However, if you tell the kernel that you are done writing
data to the connection and will henceforth only read data, the kernel
moves your connection to the 'FIN_WAIT1' state.
The script uses a program that opens a connection, sends a line to the
other end, and then immediately tells the kernel it's done writing.
By the time the quota server program got around to asking authd who
was making the connection, the kernel had already put the connection
into 'FIN_WAIT1' and authd skipped over it. (When I tried by hand
I wasn't using a program that finished writing immediately, and my
connection stayed in 'ESTABLISHED'.)
I'm sure the author of authd felt he was being careful about the
whole thing by checking the connection state as well as everything
else. However, his caution led to a problem, because his check
wasn't complete.
Every time you check something you have to be accurate and complete. The more things you check, the more work you have to do and the greater the chance that you've gotten something wrong. Thus, more checks can actually mean more bugs, instead of less.
Being complete can be difficult. For example, I'm not sure what connection states a valid TCP connection can be in in the Linux kernel, and finding out would probably require a bunch of research. (Which I could make mistakes in.)
Because of this, rather than make authd check for FIN_WAIT1 as
well as ESTABLISHED, I just took the check out entirely.
2005-09-10
More Fedora Core 4 Anaconda fun
Another day, another Fedora Core 4 Anaconda bug stumbled over. This time it is #160911, where if your system has any bind mounts, Anaconda can't upgrade it and aborts. (Some other systems call bind mounts 'loopback mounts'.) From the error message that I remember, it seems that Anaconda was trying to treat the source directory as a disk device and not getting very far.
Naturally I was in too much of a hurry to actually get our central fileserver upgraded to remember to save any debugging logs that Anaconda might have written. (Suggestion to the Anaconda people: make Anaconda automatically save its logs any time it aborts an installation. Disk space is cheap, you can put a message in about it, and it will save everyone a bunch of effort.)
Workaround: assuming that the bind mount is not for anything vital,
temporarily comment it out of your /etc/fstab. Remember to comment
it back in before you bring the system up, or things may explode.
(If the bind mount is for something vital, I believe you are up the
creek. Watch #160911 for updates.)
The thing that really irritates me is that this is a new Anaconda bug; the Fedora Core 2 Anaconda did this right. And I know that because we upgraded this very central fileserver, with this very bind mount, from Red Hat 7.3 to Fedora Core 2 without problems.
It is depressing to think that perhaps the best advice for the future is 'don't do anything on a new Fedora Core without reading the Anaconda buglist front to back'. Or maybe the entire bug list, given the (finally fixed) X bugs mentioned in FC4FirstIrritations.
An update on the unlabeled swap partitions bug
Since I managed to get our central fileserver upgraded, clearly I
worked around the 'Multiple devices are labelled' bug I covered in
AnotherFC4AnacondaBug. Unfortunately, mkswap -L didn't do it,
either with the FC4 source code recompiled on FC2 and run while the
system was up, or in the FC4 installer environment itself.
Eventual workaround: totally zeroing out both swap partitions with
dd from /dev/zero. This meant that our Fedora Core 4 install came
up without swap partitions (since they did not have valid signatures),
but fortunately the server has enough memory that it doesn't need swap
to boot. After it came up I used mkswap -L on both partitions.
Since mkswap -L didn't work, I now have the sinking feeling that
this problem is going to reappear next upgrade.
2005-09-08
Another Fedora Core 4 Anaconda bug
To go with the other FC4 Anaconda bugs (and still no updated installer images), we just ran into another Anaconda bug in the process of attempting to upgrade a server from Fedora Core 2 to FC4. This server, like most of our other servers, uses mirrored system disks, which in turn means it has two swap partitions (one on each drive).
If you have this configuration and are upgrading from a previous Fedora Core or Red Hat version, on some of your systems Anaconda will abort the upgrade with the helpful message:
Duplicate labels
Multiple devices are labelled
(sic; 'labelled' is how the message spells it.)
You will note that nothing is listed as to what the duplicate label is. This is because the 'duplicate' label appears to be some sort of null label that Anaconda is apparently finding in the swap partitions.
(If you are going 'I didn't know swap partitions could be labelled', you're not alone. Presumably this bug goes along nicely with Anaconda reading totally bogus garbage from newly created swap partitions as their alleged 'label', as mentioned in FC4BuggyAnaconda. It appears that swap partition labels are not very robust or reliable, which makes me oh so thrilled that Fedora Core 4 wants to use them.)
There are two Anaconda bugs filed for this in Red Hat's Bugzilla: #160622 and #166820. No resolution is currently available, not even 'fixed in Rawhide'. I say 'appears' up above because, from the bug reports, no one is currently entirely certain of what the bug is; that it is caused by multiple unlabeled swap partitions is so far not an entirely sure bet.
The really irritating thing about this bug is that it only happens some of the time, on some systems. We've already upgraded three servers with the same sort of swap configuration to FC4 without this problem; it figures that we hit it now, when we 'trust' FC4 enough to try to put it on our central fileserver, as our good upgrade window closes.
The suggested workaround is to give your swap devices (different)
labels, using mkswap -L. To complicate life, the Fedora Core 2
mkswap doesn't have a -L argument. The Anaconda image has a
mkswap, but Anaconda may or may not leave device nodes around for
people to run mkswap by hand. The Fedora Core 4 mkswap binary
doesn't run on Fedora Core 2, since it's built against the FC4 glibc.
I don't know how we'll fix this yet, since we spent today finding and
poking at this issue. Probably I'll have to recompile the FC4 mkswap
source code on Fedora Core 2 and try that. Updates later, when I get
something to work.
Updated: see FC4AnacondaAgain for my eventual workaround.