2009-08-31
The IO scheduler improvements I saw
In the spirit of sharing actual numbers and details for things that I left a bit unclear in an earlier entry, here is more:
First, we switched to the deadline IO scheduler (from the default
cfq). I did brief tests with the noop scheduler and found it
basically no different from deadline for my test setup, and deadline
may have some advantages for us with more realistic IO loads.
My IO tests were sequential read and write IO, performed directly
on a test fileserver, which uses
a single iSCSI backend. On a ZFS pool that
effectively is a stripe of two mirror pairs, switching the backend to
deadline increased a single sequential read from about 175 MBytes/sec
to about 200 Mbytes/sec. Two sequential reads of separate files were
more dramatic; aggregate performance jumped by somewhere around 50
Mbytes/sec. In both cases, this was close to saturating both gigabit
connections between the iSCSI backend and the fileserver.
(Since all of these data rates are well over the 115 Mbytes/sec or so that NFS clients can get out of our Solaris fileservers, this may not make a significant difference in client performance.)
I measured no speed increase for a single sequential writer, but it was already more or less going at what I believe is the raw disk write speed. (According to the IET mailing list, other people have seen much more dramatic increases in write speeds.)
I didn't try to do systematic tests; for our purposes, it was enough
that deadline IO scheduling had a visible performance effect and
didn't seem to have any downsides. I didn't need to know the specific
contours of all of the improvements we might possibly get before I
could recommend deployment on the production machines.
2009-08-26
What packages SystemTap requires on Ubuntu 8.04 (and others)
For my own future reference and because the SystemTap wiki is not the clearest thing on this, here's what packages are required to be able to use SystemTap on various distributions.
We use the -server kernels on our Ubuntu 8.04 machines, so I needed to do:
apt-get install systemtap linux-image-debug-server linux-headers-server
ln -s /boot/vmlinux-debug-$(uname -r) /lib/modules/$(uname -r)/vmlinux
In the Ubuntu/Debian way, the linux- packages are generic ones that will pull in the necessary kernel version specific package. I believe that this adds about 60 MBytes of disk space usage.
(Mostly researched from here, which believes that you also need additional bits that I didn't.)
The Red Hat Enterprise 5 directions from the SystemTap wiki are accurate. Note that the necessary packages added up to over 600 MBytes (of installed space) on my x86_64 test system.
Although I haven't tested them, I believe that the Fedora directions are basically
accurate. Current versions of Fedora don't need you to explicitly
install the kernel-devel RPM, as the systemtap RPM already depends
on it. (It does not on RHEL for some reason.)
I find it both unfortunate and a sign of a somewhat broken package system that both Ubuntu and Fedora/RHEL need extra steps to make SystemTap work; in a sensible world, simply installing SystemTap would install all of its dependencies. I understand why both Ubuntu and Fedora/RHEL are broken, I just don't think that they should be.
(Ubuntu is broken mostly because it has no fixed package name that means 'the current kernel, whatever that is', which I think is ultimately because the Debian package format lacks multi-arch support, although I haven't looked deeply into this. Fedora and RHEL are broken because they put the necessary kernel debugging packages into non-default repositories, possibly so that users aren't confused by a profusion of -debuginfo RPMs in various listings of available packages. Since the kernel debuginfo RPMs are huge, it would be nice to have smaller ones in the main repository that just had the data that SystemTap needs.)
2009-08-24
The problem with the CFQ IO scheduler and our iSCSI targets
Current versions of Linux have several different ways to schedule IO activity; one writeup of this is here. The developers of the iSCSI target software that we use have for some time recommended that people switch away from CFQ (the default one) to either the 'deadline' or the 'noop' scheduler to get better performance. Recently I got around to testing this to see if it made a difference in our environment, and it turns out that it does.
(Whether the difference is significant is an open question.)
The specifics of the difference are less interesting than why it happens. According to what I've gathered from the developers, this is what is going on:
CFQ is the 'completely fair queuing' scheduler. Its goal is to fairly schedule IO between multiple contending processes, so that each gets its fair share of disk bandwidth and IO and that one highly active process doesn't lead to long delays for IO from other processes. (This problem has a long history in Unixes, especially once the unified buffer cache appeared.)
The problem between CFQ and our iSCSI target software is the target driver uses multiple threads and randomly assigns incoming IO requests to them. Each thread is seen as a separate context by CFQ, and as part of being fair CFQ won't merge contiguous requests together if they're in different CFQ contexts. So the frontend splits up one big IO operating into multiple iSCSI requests (because of, for example, size limits on a single iSCSI request), these iSCSI requests are dispatched to different threads on the backend, and then they are scheduled and completed separately (and thus slower).
(The other case that I can imagine is several different streams of sequential IO requests, such as readaheads on multiple files at once. The target software may well split the IO requests for a given stream across threads, thereby killing any chance of merging them.)
Because the deadline scheduler doesn't attempt fair scheduling this way, it will merge such requests together, which is what we want. My understanding is that it will also try to make sure that no individual request waits too long, which has various potential benefits in our environment.
(We effectively have three or four different logical disks on each physical disk, so we don't want IO to one logical disk to starve the others. This implies that iSCSI cooperating with CFQ with one CFQ context per logical disk/LUN would probably be ideal, since that would fairly divide the disk bandwidth between the LUNs. The IET developers are talking about doing that at some point in the future.)
Having written all this (partly to get it straight in my head), I now wonder if this issue also affects other sorts of disk and fileservers that use threads. The big one would be NFS servers, since I believe that they have a similar thread pool setup.
(Unfortunately I have no Linux NFS servers to do experiments with.)
2009-08-18
Why user programs mapping page zero is so bad news on x86 hardware
The Linux kernel recently had a significant security issue or two where the root cause was that user programs could map memory at page zero, and this lead to kernel level exploits. If you went through the same sort of undergrad OS course that I did, you might be wondering how on earth a user process memory mapping issue leads to a kernel exploit; after all, as all of those little box diagrams tell us, the user program address space is one thing and the kernel address space is an entirely different thing.
That's the nice theoretical view as presented in undergrad OS courses. The messy reality of actual hardware is that on 32-bit x86 machines, accessing a completely separate address space is really expensive (I remember figures of a 10% to 20% overall performance hit, depending on what your programs do). The result is that no common operating system puts its kernel in a completely separate address space on x86 machines; instead pretty much everyone (not just Linux) embeds the kernel in every user process's address space and relies on page protections to keep it inaccessible to user code.
(There actually have been Linux patches that change this, such as the '4G/4G' split.)
When the system switches into kernel mode the kernel's pages become accessible. But this is not a switch between address spaces, it's extra permissions, so the current user process's pages stay visible and accessible, although properly written kernel code doesn't ever directly touch them.
Now we get to the problem. Page zero is where NULL pointers point; if the kernel dereferences a NULL pointer in some way, it will try to access something in page zero or shortly above it. Thus if a user program can map a page at page zero and then persuade the kernel to deference a NULL pointer, this shared and accessible address space means that the kernel is directly getting data from the user program's page without realizing it and the user program is in control of the result of the NULL dereference. In the most dangerous case, the kernel is dereferencing a function pointer that it will go and jump to; as it happens, an x86 CPU is perfectly happy to jump to a user page and run code there while in kernel mode.
(This is instant game over if it happens, since the kernel is now running arbitrary attack code of the program's choice.)
This is not just a Linux problem; this is an issue for pretty much any x86 operating system that can ever be coaxed into dereferencing NULL pointers in kernel mode. Either you need very good, very foolproof protection against NULL pointer dereferences (and one of the Linux bugs recently showed how hard this is), or you need to make absolutely sure that a user program cannot map a page zero, ever.
(For safety you should also forbid low memory close to page zero, in case you ever dereference a NULL pointer with a relatively large offset.)
2009-08-05
A feature that I wish Linux package managers had
One of the things that I wish Linux package managers like yum and Apt
had was a convenient way of retrieving the original, stock version of
some file and optionally reinstalling it into its original place. I'd
expect (and wouldn't mind) that this would require re-downloading the
original package that the file came from (and thus it was restricted
to packages that were still in the package repository and hadn't been
obsoleted and removed by some more recent package).
You might wonder what use this is. Well, perhaps you are more carefully organized than we are, but I find myself periodically wondering 'what did we change in file X?', for various varieties of X, and then trying to answer the question. While we have our customized files carefully saved in our reinstall area, we don't necessarily have the original file and especially we may not have the original file from the current OS release (if we have, for example, been carrying a customized file forward from OS release to OS release instead of attempting to redo our changes on top of each release's version).
(While careful use of a sufficiently sophisticated version control system can make it possible to keep track of all of this, it takes a bunch of additional work that you have to do all the time, even during emergencies when you're doing quick hacks to get the system going again.)
You can already do this by hand if you want, and I have, but it's a bunch of tedious work that the package manager could automate. Personally I think that it's a common enough thing (either in the 'I want to check the original' form or in the 'whoops, let's revert to stock configuration' form) that it should be a package manager feature.
(Today I found myself using 'apt-get purge lighttpd; apt-get install
lighttpd' on a test system as the easiest way to get the stock
lighttpd.conf file to look at. And you know, it was, and this is not a
slam at Apt; I probably would have done the same thing with yum on an
RPM-based system.)
Sidebar: how to do it by hand
The incantations that I know of for doing this in a yum-based system
or an apt-based one are:
- for
yum: useyumdownloaderto get a copy of the.rpmfor the package into some convenient scratch directory, then use 'rpm2cpio <whatever>.rpm | cpio -di' to unpack it. Fish out the desired file. - for Apt: get the
.debfrom somewhere (there is probably a magic apt way to do this that's analogous toyumdownloader; I just go to package.ubuntu.com, search around, and follow the links to the binary packages), then use 'dpkg-deb -x <whatever>.deb scratchdir' to unpack it. Fish out the desired file.
As I score this, the two sides come out about equal in the end.