Wandering Thoughts archives

2005-09-22

Excluding buggy RPMs from a yum repository

We mirror the Fedora Core 4 updates area, and we use yum to install things from it, and Red Hat recently released an extremely unsuitable (for us) update to xorg-x11 (as recounted in MoreFC4Problems). So we needed to take those buggy RPMs out of circulation and keep them out.

Fortunately this turns out to be pretty easy, because not all RPMs in a directory have to be in the repository metadata that yum uses. This let us continue to pull a full mirror from Red Hat (keeping the mirroring script simple) and just built our own metadata that excludes the buggy RPMs we don't want.

Modern yum repository metadata is created with the createrepo command (not yum-arch, despite the latter being packaged with yum and the former not), which is in the createrepo RPM. createrepo has a -x option to exclude RPMs from the metadata it makes; -x even takes a glob pattern, like say xorg-x11-*6.8.2-37.FC4.48.1* (remember to quote it).

There are two drawbacks with this:

  1. createrepo takes a while to run.
  2. We're now dependent on having a Fedora Core 4 machine around to run the mirroring on (because we probably need to use a createrepo version that matches the Fedora Core 4 yum, so they agree on the metadata format).

Hopefully someday Red Hat will fix the bugs and we can go back to using all of the current updates and thus to just mirroring the update repository metadata as well as all of the RPMs.

(Because we are only ignoring a specific version of xorg-x11, yum will include the previous, working Fedora Core 4 xorg-x11 update in the metadata. Which is what we want, because we definitely want that update installed.)

YumExcludeBadRPMs written at 01:49:35; Add Comment

2005-09-21

More Fedora Core 4 problems with X

Important update: the buggy RPMs have been superseded. See the end.

You may remember back to FC4FirstIrritations, where I talked about several X problems. Well, Red Hat managed to outdo itself recently, as we discovered the hard way: their latest update to the xorg-x11 packages, 6.8.2-37.FC4.48.1 (released September 16th or so) now causes almost all of our workstations to explode.

Not literally exploded, but X won't start and more or less crashes the machine (one bug report compared it to what a crashed C-64 looks like, and I have to agree). It turns out that this update appears to be totally broken on many if not all Intel graphics hardware, such as the 865G chipset that our HP Deskpro D530s.

The Red Hat Bugzilla bugs filed so far are #168919, #168937 (i945 graphics), and #168940 (i865G graphics). They've apparently all been filed very recently, which at least makes me feel somewhat less annoyed.

Unfortunately, we mirror the entire Red Hat updates repository and use yum, which requires carefully built repository metadata. This means that clearing this update out of our system is not as simple as removing the RPMs; we have to fix our mirror scripts, move to manually building the repository metadata, and then test the result.

Since a reinstall test takes half an hour or more, there went a lot of my day. Fedora Core 4 has really turned into the update that I wish we'd skipped.

Clearly, we need to adopt new rules for Fedora Core 4 updates:

  1. never apply updates within a week of their release (unless they are urgent security fixes).
  2. always check Bugzilla for new bugs caused by the updates.

It's a rather sad day when we have to adopt these rules, but that is apparently life with Fedora Core 4.

(You may gather that I am not happy with this state of affairs.)

Update, September 23rd: The good news is that this issue has now been fixed; today Red Hat released a new xorg-x11 update, version 6.8.2-37.FC4.49.2, that doesn't have the problem. The master Bugzilla bug for this issue turns out to be Bugzilla #168752, which has various bits of information.

I am still somewhat unhappy about this state of affairs, but less unhappy than I was when I first wrote this entry. Part of it is that the problem is gone; part of it is that Red Hat dealt with the issue pretty rapidly, without dragging their heels (as I feel they have with some Fedora Core 4 bugs). I do think that Red Hat needs a procedure for pulling seriously problematic updates from their updates area (and the mirrors, which makes it more challenging), and that they should have used it in this situation.

MoreFC4Problems written at 17:08:53; Add Comment

What vendor updates are pending on your Linux system?

One of those routine system management tasks these days is checking to make sure that I'm up to date on vendor security releases and other updates. And when I'm not up to date, I generally want to know what's out of date; being out of date on diff is one thing, being out of date on a kernel or glibc is quite another. (For a start, I probably want to test the kernel a bit more before pushing it into our local update queue.)

Since I have a lot of systems, I like a concise report; say, one line with all the pending packages for systems that need updates, and total silence from systems that are current (in other words, I'm a Unix geek and I like Unix-style output).

Unfortunately, most package management systems are not quite as eager to accommodate my minimalism as I would like, so it takes some scripting work.

For anything with yum (Fedora Core, current Red Hat Enterprise, and probably others):

yum -d 1 check-update |
  awk '{print $1}'

In at least Fedora Core 4, this doesn't give you the literal package names; you get a bonus architecture glued on the back that you'll have to slice off if you really care. You also get a bonus blank line at the start (this may be a yum bug).

Debian makes it a little bit more annoying. I get to do:

apt-get -qq update
apt-get -qq -s -u upgrade |
  awk '$1 == "Inst" {print $2}'

(Assuming that Debian does not change apt-get's output someday. They might, and this is one reason apt is not my favorite program.)

In theory 'apt-get -qq update' could be dangerous, but I think the odds are pretty low.

Converting these to produce single-line output is left as an exercise to the reader.

FindingPendingUpdates written at 01:49:08; Add Comment

2005-09-13

The problem of being overcautious

Today's fire drill was caused by our printing system not printing; since it is the first day of classes, it was not a good time to discover this. After fixing a couple of small problems, the big stumbling block was authentication not working.

Our printing system has a central machine that handles quota management and a per-lab machine that handles the actual print spooling and printing. This requires the labmasters to talk to the quota server to tell it about pages that got printed.

Because I am paranoid, the quota server insists that connections from the labmasters be somewhat authenticated (otherwise a clever student could ruin someone else's day by telling the system they'd just printed 10,000 pages). Because I am lazy, the authentication is done by the RFC 1413 'ident protocol', which gives the nominal owner of one end of a TCP connection. In this case, the quota server only accepts print updates from the user 'lp' on the labmasters.

Examining logs showed that authentication was failing because authd (the 'ident protocol' daemon) on the labmasters wasn't returning information about the connection. Worse, this wasn't a general failure; if I tried it by hand, it worked. Only the quota checking script run as part of printing provoked the authd failures.

It took careful examination of authd's code and a certain amount of staring at debugging output and capturing snapshots of system files like /proc/net/tcp to find the problem: excessive caution.

TCP connections are uniquely identified by the quad of 'source host, source port, destination host, destination port'. But when it reads /proc/net/tcp to find the right connection, authd checks more than that; it also requires that the state of the TCP connection be 'ESTABLISHED'. However, if you tell the kernel that you are done writing data to the connection and will henceforth only read data, the kernel moves your connection to the 'FIN_WAIT1' state.

The script uses a program that opens a connection, sends a line to the other end, and then immediately tells the kernel it's done writing. By the time the quota server program got around to asking authd who was making the connection, the kernel had already put the connection into 'FIN_WAIT1' and authd skipped over it. (When I tried by hand I wasn't using a program that finished writing immediately, and my connection stayed in 'ESTABLISHED'.)

I'm sure the author of authd felt he was being careful about the whole thing by checking the connection state as well as everything else. However, his caution led to a problem, because his check wasn't complete.

Every time you check something you have to be accurate and complete. The more things you check, the more work you have to do and the greater the chance that you've gotten something wrong. Thus, more checks can actually mean more bugs, instead of less.

Being complete can be difficult. For example, I'm not sure what connection states a valid TCP connection can be in in the Linux kernel, and finding out would probably require a bunch of research. (Which I could make mistakes in.)

Because of this, rather than make authd check for FIN_WAIT1 as well as ESTABLISHED, I just took the check out entirely.

ProblemOfOvercaution written at 03:51:52; Add Comment

2005-09-10

More Fedora Core 4 Anaconda fun

Another day, another Fedora Core 4 Anaconda bug stumbled over. This time it is #160911, where if your system has any bind mounts, Anaconda can't upgrade it and aborts. (Some other systems call bind mounts 'loopback mounts'.) From the error message that I remember, it seems that Anaconda was trying to treat the source directory as a disk device and not getting very far.

Naturally I was in too much of a hurry to actually get our central fileserver upgraded to remember to save any debugging logs that Anaconda might have written. (Suggestion to the Anaconda people: make Anaconda automatically save its logs any time it aborts an installation. Disk space is cheap, you can put a message in about it, and it will save everyone a bunch of effort.)

Workaround: assuming that the bind mount is not for anything vital, temporarily comment it out of your /etc/fstab. Remember to comment it back in before you bring the system up, or things may explode. (If the bind mount is for something vital, I believe you are up the creek. Watch #160911 for updates.)

The thing that really irritates me is that this is a new Anaconda bug; the Fedora Core 2 Anaconda did this right. And I know that because we upgraded this very central fileserver, with this very bind mount, from Red Hat 7.3 to Fedora Core 2 without problems.

It is depressing to think that perhaps the best advice for the future is 'don't do anything on a new Fedora Core without reading the Anaconda buglist front to back'. Or maybe the entire bug list, given the (finally fixed) X bugs mentioned in FC4FirstIrritations.

An update on the unlabeled swap partitions bug

Since I managed to get our central fileserver upgraded, clearly I worked around the 'Multiple devices are labelled' bug I covered in AnotherFC4AnacondaBug. Unfortunately, mkswap -L didn't do it, either with the FC4 source code recompiled on FC2 and run while the system was up, or in the FC4 installer environment itself.

Eventual workaround: totally zeroing out both swap partitions with dd from /dev/zero. This meant that our Fedora Core 4 install came up without swap partitions (since they did not have valid signatures), but fortunately the server has enough memory that it doesn't need swap to boot. After it came up I used mkswap -L on both partitions.

Since mkswap -L didn't work, I now have the sinking feeling that this problem is going to reappear next upgrade.

FC4AnacondaAgain written at 01:18:09; Add Comment

2005-09-08

Another Fedora Core 4 Anaconda bug

To go with the other FC4 Anaconda bugs (and still no updated installer images), we just ran into another Anaconda bug in the process of attempting to upgrade a server from Fedora Core 2 to FC4. This server, like most of our other servers, uses mirrored system disks, which in turn means it has two swap partitions (one on each drive).

If you have this configuration and are upgrading from a previous Fedora Core or Red Hat version, on some of your systems Anaconda will abort the upgrade with the helpful message:

Duplicate labels
Multiple devices are labelled

(sic; 'labelled' is how the message spells it.)

You will note that nothing is listed as to what the duplicate label is. This is because the 'duplicate' label appears to be some sort of null label that Anaconda is apparently finding in the swap partitions.

(If you are going 'I didn't know swap partitions could be labelled', you're not alone. Presumably this bug goes along nicely with Anaconda reading totally bogus garbage from newly created swap partitions as their alleged 'label', as mentioned in FC4BuggyAnaconda. It appears that swap partition labels are not very robust or reliable, which makes me oh so thrilled that Fedora Core 4 wants to use them.)

There are two Anaconda bugs filed for this in Red Hat's Bugzilla: #160622 and #166820. No resolution is currently available, not even 'fixed in Rawhide'. I say 'appears' up above because, from the bug reports, no one is currently entirely certain of what the bug is; that it is caused by multiple unlabeled swap partitions is so far not an entirely sure bet.

The really irritating thing about this bug is that it only happens some of the time, on some systems. We've already upgraded three servers with the same sort of swap configuration to FC4 without this problem; it figures that we hit it now, when we 'trust' FC4 enough to try to put it on our central fileserver, as our good upgrade window closes.

The suggested workaround is to give your swap devices (different) labels, using mkswap -L. To complicate life, the Fedora Core 2 mkswap doesn't have a -L argument. The Anaconda image has a mkswap, but Anaconda may or may not leave device nodes around for people to run mkswap by hand. The Fedora Core 4 mkswap binary doesn't run on Fedora Core 2, since it's built against the FC4 glibc.

I don't know how we'll fix this yet, since we spent today finding and poking at this issue. Probably I'll have to recompile the FC4 mkswap source code on Fedora Core 2 and try that. Updates later, when I get something to work.

Updated: see FC4AnacondaAgain for my eventual workaround.

AnotherFC4AnacondaBug written at 01:42:26; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.