2010-01-31
More vim options it turns out that I want
Much to my displeasure, Ubuntu seems to have been steadily making the
version of vim that they ship more and more superintelligent. I do not
want a superintelligent vi; in fact, superintelligence
is a net negative in vim, because unlike with GNU Emacs it is almost
always wrong. So, unlike the first set of vim options,
these are negative options that I need, things that turn off settings.
So far, I have wound up with:
set formatoptions=l- Turns off automatic line wrapping. Since vi
is my sysadmin's editor and sysadmins
edit configuration files a lot, automatic line wrapping is anti-feature.
(I hate it in Emacs too, when it happens.)
let loaded_matchparen = 1- Turns off blinking matched delimeters,
like () and [] and so on. I find this irritating and distracting.
filetype plugin off- This turns off all sorts of superintelligent automatic formatting that I aggressively don't want.
(At some point I may look into the best way to fix the line ending issue, but I haven't been annoyed enough yet.)
Some reading in the vim help files suggests that 'set paste' will also
do a lot to turn off all of the superintelligence that I so dislike.
Using Ubuntu's 'tiny' version of vim also goes a long way to disabling
various things I don't like, but it has the side effect of making vim
not like the latter two .vimrc settings here (and it's not something
that I can turn on globally on our systems and so have all the time, no
matter what environment or UID I am at the moment).
All in all, I really wish vim had a mode where it just settled for
being a better vi instead of trying to be a bad imitation of GNU
Emacs. As before, if I want GNU Emacs, I know where to
find it.
2010-01-18
More on mismatched sectors on Linux software RAID mirrors
Some brief followups from my first entry on this.
First, the mismatch_cnt numbers are reset from scratch every
time you re-run a check (and probably every time there is a mirror
resync). On many current systems, this means that they will be reset
every week. This makes sense and is even implied by the documentation (in the usual Unix
fashion of reading between the lines),
but it would have been nice to have it explicitly documented.
(I'm aware that I'm grumpy about this, but sysadmins really care about having clear and explicit documentation about what error messages mean. Sadly we rarely get either clear error messages or clear documentation about them.)
Second, I have seen the numbers go up and down from week to week, some times significantly, and I've even seen the problem go away for one of my software RAID devices (the smaller one, in both size and error counts) and then come back worse. I can't say that this makes me even more unhappy, because I was pretty unhappy from the start, but it does mean that whatever has started causing my problems with this is an ongoing problem, not a one-time event.
Unfortunately I have no practical alternative to software RAID in Linux at the current time. However, the urge to add some sort of real error logging to the kernel code for this is getting stronger and stronger.
(Please do not suggest hardware RAID; it isn't practical for various reasons, and I still believe that software RAID is better.)
2010-01-11
In praise of crash (and kernel source)
You might wonder how we actually found our ext2 locking issue. That is a war story too, but fortunately it's a relatively short one.
First, we tried to use magic SysRq commands to dump various information. Let me save you some time: this doesn't work in any meaningful way. When you have several thousand processes, asking the kernel for a magic SysRq task dump is functionally equivalent to rebooting the machine, and sometimes it was literally equivalent to it. We extracted essentially no useful information with this, and rebooted several production servers in the process.
Instead we turned to the crash program (once we stumbled over it; see
also here
for a usage overview). Crash has many features, but the one that
we cared about is that it has a command ('foreach bt') that will
dump a kernel stack backtrace for every process, and since crash
is just a user mode program, running it didn't blow up the system
the way using magic SysRq did. The output wasn't perfect, since
other processes kept running and could change, appear, and disappear
while crash was trying to gather all of this information, but when
you have several hundred badly frozen processes, most of them aren't
going to change and most of the time it works out. (And when it
didn't, we could just re-run the command.)
The raw crash kernel backtraces were not immediately useful, as there was too much noise and confusion (and a certain amount of things that were plain garbled). I wrote a set of program to post-process and analyze them; these created simplified call traces, trimmed off internal locking routines, aggregated processes with identical call traces to give us counts of common paths, and analyzed what processes seemed to be doing based on the call traces.
This analysis caused certain things to jump out; there were a lot of processes waiting to acquire mutexes in only a few code paths. Figuring out what mutexes these code paths were trying to acquire took reading the kernel source, and then it took more reading to come up with a comprehensive list of what paths took these mutexes in order to do what, followed by the fun game of 'so, what could cause the holder of this mutex to block?'
All of this process wound up pointing to ext2 inode allocation and deallocation blocking on disk IO in order to read bitmap blocks; in the traces we could see a whole lot of processes waiting to get the mutex, and one process that was actually waiting on disk IO to finish.
(Well, if we got somewhat lucky with crash's timing we could see that; there was no guarantee that crash would look at the process doing IO before its IO completed. But one of the great things about using crash was that I could run it over and over and over again to get huge volumes of traces to do lots of analysis on, since it wasn't as if the problem was going away on me.)
Sidebar: what I'd really like (and what I hope exists)
Crash is a nice tool, but what would be better is something that looks directly at the mutex locking state in order to show what processes are waiting on what locks, what process holds each lock and what it's doing, and so on. Such a thing would have pointed pretty much directly at our problem, without the need for post-processing and analysis and so on.
(There's the promising looking lockstat, but it doesn't
look like it's enabled in common kernels; it's CONFIG_LOCK_STAT if
you want to build your own. I don't know what
performance impacts turning it on has, but a slow system may be better
than one that explodes.)
2010-01-08
Interesting things can happen when you scale things up
This is a sysadmin war story.
Once upon a time, there were a bunch of IMAP servers. Since this was long ago, they were running Linux with the 2.4 kernel. They started out storing their mail on locally attached 72 GB SCSI disks, organized simply with one ext2 filesystem per disk, but then they moved the storage to a faster and more sophisticated SAN backend with RAID-10 arrays (still on small fast enterprise disks), giving each server node a single logical array (on a dedicated set of drives) and data filesystem (still ext2).
Not too long after the move to the SAN, the servers started falling over every so often, unpredictably; their load average would climb to the many hundreds (we saw load averages over 700), IMAP response times went into the toilet, and eventually the machine would have to be force-booted. However, nothing obvious was wrong with the system stats (at least nothing that seemed to correlate with the problems). Somewhat through luck, we discovered that the time it took to touch and then remove a file in the data filesystem was closely correlated to the problem; when the time started going up, the system was about to get hammered. In the end, this led us to the answer.
Ext2 keeps track of allocated inodes (and allocated blocks) in bitmap blocks in the filesystem. In Linux 2.4, all changes to these bitmaps for a single filesystem were serialized by a single filesystem-wide kernel mutex, so only one process could be allocating or freeing an inode at a time. In the normal course of events, this is not a problem; most filesystems do not have a lot of inode churn, and if they do the bitmap blocks will all stay cached in system RAM and so getting the mutex, updating the bitmap, and releasing the mutex will normally be fast.
What had happened with us is that this broke down. First, we had a lot of inode churn because IMAP was creating (and then deleting) a lot of lockfiles. This was survivable when the system had a lot of separate filesystems, because each of them had a separate lock and not that many bitmap blocks. But when we moved to the SAN we moved to a single big filesystem; this meant both a single lock for all file creation and deletion, and that the filesystem had a lot of bitmap blocks.
(I believe that pretty much the same amount of disk space was in use in both cases; it was just organized differently.)
This could work only as long as either almost all of the bitmap blocks stayed in cache or we didn't have too many processes trying to create and delete files. When we hit a crucial point in general IO load and memory usage on an active system, the bitmaps blocks started falling out of cache, more and more inode operations had to read bitmap blocks back in from disk while holding the mutex (which meant they took significant amounts of time), and more and more processes piled up trying to get the mutex (which was the cause of the massive load average). Since this lowered how frequently any particular bitmap block was being used, it made them better and better candidates for eviction from cache and made the situation even worse.
(Of course, none of this showed up on things like iostat because
general file IO to do things like read mailboxes was continuing
normally. Even the IO to read bitmap blocks didn't take all that long on
a per-block basis; it was just that it was synchronous and a whole lot
of processes were effectively waiting on it.)
Fortunately, once we understood the problem we could do a great deal to
mitigate it, because the lockfiles that the IMAP server was spending all
of that time and effort to create were just backups to its fcntl()
based locking. So we just turned them off, and things got significantly
better.
(The overall serialized locking problem was fixed in the 2.6 kernel as part of work to make ext2 and ext3 more scalable on multiprocessor systems, so you don't have to worry about it today.)