Interesting things can happen when you scale things up

January 8, 2010

This is a sysadmin war story.

Once upon a time, there were a bunch of IMAP servers. Since this was long ago, they were running Linux with the 2.4 kernel. They started out storing their mail on locally attached 72 GB SCSI disks, organized simply with one ext2 filesystem per disk, but then they moved the storage to a faster and more sophisticated SAN backend with RAID-10 arrays (still on small fast enterprise disks), giving each server node a single logical array (on a dedicated set of drives) and data filesystem (still ext2).

Not too long after the move to the SAN, the servers started falling over every so often, unpredictably; their load average would climb to the many hundreds (we saw load averages over 700), IMAP response times went into the toilet, and eventually the machine would have to be force-booted. However, nothing obvious was wrong with the system stats (at least nothing that seemed to correlate with the problems). Somewhat through luck, we discovered that the time it took to touch and then remove a file in the data filesystem was closely correlated to the problem; when the time started going up, the system was about to get hammered. In the end, this led us to the answer.

Ext2 keeps track of allocated inodes (and allocated blocks) in bitmap blocks in the filesystem. In Linux 2.4, all changes to these bitmaps for a single filesystem were serialized by a single filesystem-wide kernel mutex, so only one process could be allocating or freeing an inode at a time. In the normal course of events, this is not a problem; most filesystems do not have a lot of inode churn, and if they do the bitmap blocks will all stay cached in system RAM and so getting the mutex, updating the bitmap, and releasing the mutex will normally be fast.

What had happened with us is that this broke down. First, we had a lot of inode churn because IMAP was creating (and then deleting) a lot of lockfiles. This was survivable when the system had a lot of separate filesystems, because each of them had a separate lock and not that many bitmap blocks. But when we moved to the SAN we moved to a single big filesystem; this meant both a single lock for all file creation and deletion, and that the filesystem had a lot of bitmap blocks.

(I believe that pretty much the same amount of disk space was in use in both cases; it was just organized differently.)

This could work only as long as either almost all of the bitmap blocks stayed in cache or we didn't have too many processes trying to create and delete files. When we hit a crucial point in general IO load and memory usage on an active system, the bitmaps blocks started falling out of cache, more and more inode operations had to read bitmap blocks back in from disk while holding the mutex (which meant they took significant amounts of time), and more and more processes piled up trying to get the mutex (which was the cause of the massive load average). Since this lowered how frequently any particular bitmap block was being used, it made them better and better candidates for eviction from cache and made the situation even worse.

(Of course, none of this showed up on things like iostat because general file IO to do things like read mailboxes was continuing normally. Even the IO to read bitmap blocks didn't take all that long on a per-block basis; it was just that it was synchronous and a whole lot of processes were effectively waiting on it.)

Fortunately, once we understood the problem we could do a great deal to mitigate it, because the lockfiles that the IMAP server was spending all of that time and effort to create were just backups to its fcntl() based locking. So we just turned them off, and things got significantly better.

(The overall serialized locking problem was fixed in the 2.6 kernel as part of work to make ext2 and ext3 more scalable on multiprocessor systems, so you don't have to worry about it today.)

Comments on this page:

From at 2010-01-08 13:38:09:

I love war stories. Thanks for sharing.

Saint Aardvark the Carpeted

From at 2010-01-08 15:24:39:

Thanks for writing this up, Chris. We at CDF are trying to understand a performance issue on our file server (Solaris+UFS); while my colleagues mostly think that the hardware is no longer up to the task, I have a hunch that the issue may be something obscure like what you described. - Arcady

Written on 08 January 2010.
« A theory on why most browsers control their own list of CA roots
A brief and jaundiced history of Unix packaging systems »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jan 8 00:53:30 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.