2011-05-31
How to fail at versioning
Today, Ubuntu released a PAM security update and we applied it. On Ubuntu 10.04, this upgraded the Ubuntu PAM package from version 1.1.1-2ubuntu5 to 1.1.1-2ubuntu5.2 (other Ubuntu LTS releases had similarly versions numbers); as you can see, this is a very minor version bump. As you'd expect, this did not change, eg, the libpam shared library soname version (not even the minor version).
Our logs promptly exploded with error messages like:
CRON[12926]: PAM unable to dlopen(/lib/security/pam_env.so): /lib/libpam.so.0: version `LIBPAM_MODUTIL_1.1.3' not found (required by /lib/security/pam_env.so)
(This appears to have affected nearly any daemon or persistent process that uses PAM; cron is just the most obvious one.)
You can make an entire list of ways that this was a versioning fail, and in fact it's a fail on several levels. First, what was labeled as a minor packaging update introduced an ABI incompatibility, one that basically forces a system reboot at that. Second, despite having various versioning mechanisms available to it, PAM made no use of them; for example, it did not bump sonames (not even by a minor number). Finally, PAM claims to have versioning (eg, its library soname is versioned) but it has unversioned components with version dependencies. Here pam_env is clearly versioned; a specific version is ABI-compatible only with a very narrow and specific PAM library. But there is no way to have two pam_env shared objects for two different versions of the PAM library (even if the PAM library made use of version numbering), because it has no version number itself.
(In light of this last issue, it's kind of unsurprising that the libpam soname version did not change; it probably wouldn't do any good even if it did.)
Note that it's not clear who is responsible for all of the failures here. At a minimum Ubuntu is at fault; 'break your system' ABI incompatibilities should never have version number changes that are just minor package updates (not that Ubuntu is all that good at this). If Ubuntu created the ABI incompatibility through one of their patches, they are also at fault for not versioning it properly.
Update: Ubuntu has accepted this as a bug, bug #790538. I suppose the good news is that they consider this a serious issue.
PS: what I can best describe as an extreme reluctance to ever change library soname version despite major ABI changes is not exactly unique to PAM. As far as I can tell it's common behavior for a great many Linux projects, most prominently glibc (which seems to have invented its own additional versioning system because soname versions weren't good enough). I have no idea why people like doing this, although I'm sure there's a reason.
(Possibly changing library soname version numbers on ABI changes was found to not work very well in practice.)
Sidebar: what happens and how to fix it
When this happens to a daemon such as cron or xdm, the daemon basically stops doing much of anything useful; cron did not run cron jobs, and xdm did not let anyone log in. You can cure the problem by restarting the daemon, but note that restarting xdm has the small side effect of immediately terminating the session of everyone who logged in through xdm.
Ultimately you're going to want to reboot the computer. This is kind of troublesome if it is a heavily used login server or a compute server used for long-running jobs. This is still Unix, even though developers seem more and more intent on turning it into 'reboot after doing any changes' Windows.
(Yes, I'm bitter right now.)
2011-05-28
The stickyness of Fedora 8 (despite my better intentions)
[...] - I wanted to bash CKS for [still running] Fedora EIGHT, but then my wife has *XP* on netbook...
Yes, it's true, I'm still running Fedora 8 on my home machine. It's clearly reaching the edges of viability; I live in fear of the day when some sufficiently important precompiled binary package stops working because it needs, eg, a more recent version of the core C++ ABI.
(This has happened with both Firefox 4 and Google Chrome for exactly that reason. Fortunately they are not quite sufficiently important.)
I'd like to say that I have good reasons for staying on Fedora 8, but the truth is that it has just been too much work when the existing solution limps along (and I've had distractions). In the best case I'm looking at a solid day's work to put my machine back together, and I always have something better to do with that time. I'd like to be upgraded, but I dread going through the process of upgrading.
In theory the rational thing to do at this point is to buy a modern machine where all the pieces actually work (however annoying that is), install the current Fedora on it, and copy over all of my data (although this still leaves the work of reapplying all of my machine customizations, which is not something that I'm really looking forward to). In practice, one of the issues is that my Fedora install is a very old one and is thus carrying around a number of old shared library RPMs that aren't in any modern Fedora. I was going to say that I wasn't looking forward to finding out what programs required them and then where I could find the ancient RPMs, but it turns out that I think I have local copies of everything that might matter, and some of the other ancient RPMs are likely not functional any more anyways.
(For example, in news that will probably horrify Pete Zaitcev even more, I discovered that I had an ld.so RPM installed. This is the dynamic linker for libc5-based programs, and mine dates from 2000. While I do have a libc5-based binary or two still lying around, I don't have the actual libc.so.5 shared library so those binaries can't have run for years. Whether they would run even with a libc.so.5 is an open question; I suspect that the answer is 'no'.)
Of course, the next question is what customizations I've done to my home machine's setup. In theory as a wise sysadmin I should have nice notes and instructions. In practice, well, I was in a rush at the time, or at least that's my excuse. Next time for sure.
PS: in the original entry I talked about
going straight from Fedora 8 to Fedora 13 via the Fedora 13 CD. As
it turns out, that doesn't work; I believe that PreUpgrade is being entirely accurate
when it told me that I could only go to Fedora 10. So that would be at
least three full upgrade steps (8 to 10, 10 to 12, and then 12 to 14
followed sometime with a 14 to 15 upgrade via yum).
You know, that crazy upgrade scheme is looking more and more attractive all the time.
2011-05-26
You can get 'stale filehandle' errors for local files on extN filesystems
Here's something interesting that we found out today (when another
sysadmin here had it happen to him): it's possible to get 'stale
filehandle' errors (ie, an ESTALE errno) when you access local
files, under fairly obscure situations and if you're using the right
filesystem. Specifically, if you're using an ext2, ext3, or ext4
filesystem, an inode that is corrupt in just the right way will do it;
the corruption can happen either on disk or on the fly in the path from
the disk to you.
You might wonder how a corrupt inode can result in a 'stale filehandle' error, and there lies a tale.
Suppose that some client has an NFS filehandle for a file (and thus an inode) that
has since been deleted on the fileserver, and it tries to access that
file. Obviously the NFS server needs to reject the access with an
ESTALE result, which means that some part of the filesystem-specific
code involved in turning a NFS filehandle into an inode needs to detect
this and return some sort of error.
It turns out that the extN series of filesystems opts to do this
detection not in code specific to NFS but instead in their generic
'get an inode from disk' code (in ext3, ext3_iget() in
fs/ext3/inode.c). In theory this error path can only be triggered
through the NFS server, since there's no way to access a file by its
inode number from user level code, and so ESTALE is a perfectly
appropriate error to return in this situation.
However, if the inode for a non-deleted file becomes sufficiently
corrupt (either on the disk or in flight as it's read from the disk),
this generic code will think that it is deleted and return an ESTALE
error, and because it's generic code that's called for both local and
remote accesses, this can result in 'stale filehandle' errors for a
local file.
(I think that you can also get the same result if you have a directory get corrupted so that it still has entries for deleted files or has the wrong inode numbers for real files.)
Sidebar: the specifics
The situation changes slightly from ext2 to ext3 to ext4, but in all of them an inode with both a zero link count and a full inode mode of zero (which means that the inode has no information about what type of file it's for) will do it.