Wandering Thoughts archives

2010-01-08

A brief and jaundiced history of Unix packaging systems

In the beginning (in the days of V7 and BSD Unix), Unix systems came as a great big tarball or the equivalent that included everything and you just unpacked it onto your machine. If there were problems, people passed around new versions of various bits of source in various ways; you got some, you put them on your system, you recompiled things, and so on.

Shortly after Unix vendors started selling Unix, they discovered that they needed some actual mechanism to deliver bugfixes and updates to their customers in some form smaller than an entire OS distribution. This came to be known as 'patches' and vendors built various programs for it. Because this was the old days, there was a very strong desire to make these patches as small as possible.

Shortly after AT&T started selling Unix, they decided that they wanted to make more money by charging extra for various 'optional' bits, like the C compiler or troff. This required a mechanism to split the previously monolithic blob of the OS up into multiple pieces, ie 'packages'. Other Unix vendors soon followed, even if they were selling BSD, especially as Unix systems accreted more and more pieces that fewer and fewer people were interested in.

However peculiar it seems in today's world, Unix vendors never merged their patching systems and their packaging systems, partly because packages were still fairly big and in the late 1980s and early 1990s people still cared a fair bit about updates being small. Significant OS updates (eg going from X.0 to X.1) were delivered as new packages and might well require a system reinstall, but small ones continued to be delivered as patches. Vendors built increasingly complex and baroque systems for doing each job.

Free Unixes and especially Linux distributions started from scratch in the mid to late 1990s in a very different environment, without any of this accreted history. With minimal manpower available, they built packaging systems because they had to and then simply delivered updates by giving people new versions of the entire package (size efficiency be damned). Because updates were delivered as new versions of packages, these packaging systems grew various features like handling package upgrades.

(Disclaimer: jaundiced views of history are not necessarily entirely correct.)

unix/PackagingHistory written at 13:48:50; Add Comment

Interesting things can happen when you scale things up

This is a sysadmin war story.

Once upon a time, there were a bunch of IMAP servers. Since this was long ago, they were running Linux with the 2.4 kernel. They started out storing their mail on locally attached 72 GB SCSI disks, organized simply with one ext2 filesystem per disk, but then they moved the storage to a faster and more sophisticated SAN backend with RAID-10 arrays (still on small fast enterprise disks), giving each server node a single logical array (on a dedicated set of drives) and data filesystem (still ext2).

Not too long after the move to the SAN, the servers started falling over every so often, unpredictably; their load average would climb to the many hundreds (we saw load averages over 700), IMAP response times went into the toilet, and eventually the machine would have to be force-booted. However, nothing obvious was wrong with the system stats (at least nothing that seemed to correlate with the problems). Somewhat through luck, we discovered that the time it took to touch and then remove a file in the data filesystem was closely correlated to the problem; when the time started going up, the system was about to get hammered. In the end, this led us to the answer.

Ext2 keeps track of allocated inodes (and allocated blocks) in bitmap blocks in the filesystem. In Linux 2.4, all changes to these bitmaps for a single filesystem were serialized by a single filesystem-wide kernel mutex, so only one process could be allocating or freeing an inode at a time. In the normal course of events, this is not a problem; most filesystems do not have a lot of inode churn, and if they do the bitmap blocks will all stay cached in system RAM and so getting the mutex, updating the bitmap, and releasing the mutex will normally be fast.

What had happened with us is that this broke down. First, we had a lot of inode churn because IMAP was creating (and then deleting) a lot of lockfiles. This was survivable when the system had a lot of separate filesystems, because each of them had a separate lock and not that many bitmap blocks. But when we moved to the SAN we moved to a single big filesystem; this meant both a single lock for all file creation and deletion, and that the filesystem had a lot of bitmap blocks.

(I believe that pretty much the same amount of disk space was in use in both cases; it was just organized differently.)

This could work only as long as either almost all of the bitmap blocks stayed in cache or we didn't have too many processes trying to create and delete files. When we hit a crucial point in general IO load and memory usage on an active system, the bitmaps blocks started falling out of cache, more and more inode operations had to read bitmap blocks back in from disk while holding the mutex (which meant they took significant amounts of time), and more and more processes piled up trying to get the mutex (which was the cause of the massive load average). Since this lowered how frequently any particular bitmap block was being used, it made them better and better candidates for eviction from cache and made the situation even worse.

(Of course, none of this showed up on things like iostat because general file IO to do things like read mailboxes was continuing normally. Even the IO to read bitmap blocks didn't take all that long on a per-block basis; it was just that it was synchronous and a whole lot of processes were effectively waiting on it.)

Fortunately, once we understood the problem we could do a great deal to mitigate it, because the lockfiles that the IMAP server was spending all of that time and effort to create were just backups to its fcntl() based locking. So we just turned them off, and things got significantly better.

(The overall serialized locking problem was fixed in the 2.6 kernel as part of work to make ext2 and ext3 more scalable on multiprocessor systems, so you don't have to worry about it today.)

linux/FilesystemScalingProblem written at 00:53:30; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.