2009-06-27
Possible limits on our port multiplied ESATA performance
Our iSCSI targets use port multiplied ESATA to talk to their data disks. This, for those who have not had to deal with it yet, is a way of undoing one of the advantages of SATA; instead of having one disk per port, channel and cable, now you have some number of disks per port (I've seen as many as five). The advantage is that if you want lots of disks, you don't have to find room for (say) 12 individual connectors on the back of your enclosure, and to somehow get 12 individual ESATA ports into your system (with the resulting proliferation of PCI cards); instead you can just have three or four.
It's clear that we are running into some sort of overall bandwidth limit when doing streaming reads. The question is where. I can think of a number of possible limits:
- the PCI Express bus bandwidth of the ESATA controller card, since
all 12 disks are being handled on one controller card (which is
absolutely what you want in a 1U server).
(I don't know enough about reading
lspci -vvoutput to know how fast the card claims to be able to go. The theoretical bus bandwidth seems unlikely to be the limiting factor.) - channel contention or bandwidth limits, since we have four disks
on each port. I did some earlier testing that didn't see any
evidence of this, but it wasn't exhaustive and it was done on
somewhat different hardware.
- kernel performance limits, where either the overall kernel or the
driver for the SiI 3124 based cards we're using can't drive things
at full speed.
(I'm dubious about this; issuing large block-aligned IOs straight to the disks does not seem like it would be challenging.)
- some kind of general (hardware) contention issue, where there is too much going on at once so that requests are unable to be issued and serviced at full speed for all disks.
Fortunately, this performance shortfall is basically irrelevant in our environment; for sequential IO, the iSCSI targets will be limited by total network bandwidth well before they'll run into this limit, and for random IO you care far more about IO operations a second (per disk) than you do about bandwidth.
(We are not rich enough to afford 10G Ethernet. And I have seen an iSCSI target easily do 200 Mbytes/sec of read IO, saturating both of its iSCSI networks.)
Looking into this has shown me that I don't know as much about the current state of PC hardware and its performance characteristics as I'd like to. (Yes, I know, it's sort of like the weather; wait a bit and it will change again.)
2009-06-20
Using GRUB to figure out the mapping of BIOS drive numbers
Suppose you have a piece of hardware where the mapping between BIOS drive numbers and Linux device names is not straightforward, and you want to figure out just what the mapping is (at least for your current disks). If you use GRUB (and you probably do), there's a relatively simple way of doing this.
GRUB has two important attributes for this: it can read Linux filesystems and tell you what's in them, and at boot time it is working with the BIOS drive numbering. So the basic approach is to plant unique flag files in filesystems on each drive, reboot your system, break into the GRUB shell, and use GRUB commands to find out what flag file is on each BIOS disk number.
(Okay, there's an important qualifier to this: this is the BIOS boot order, not the actual labels that the BIOS and the motherboard use. It is possible to have the BIOS perturb the boot order from the label order, so that 'SATA 3' is the boot drive, not 'SATA 1'.)
GRUB doesn't have an explicit ls command or the like, but what it does
have is filename autocompletion. So the basic way to look for the flag
file on each partition is something like the following commands:
root (hdX,Y)
kernel /FLAG-<TAB>
This will helpfully tell you which 'FLAG-*' file is on partition Y of BIOS disk number X (numbering from 0). You can repeat this sequence for each BIOS disk number that you have.
(You will probably have to break out of the usual GRUB menu into GRUB's command mode during booting. As far as I know there's no way back to menu mode, so just reboot the machine afterwards.)
Depending on your configuration, getting a unique flag file onto a
simple filesystem in a simple partition on each disk may be either
trivial or very complex (this is one situation where a mirrored
/boot is a drawback). GRUB needs a whole
filesystem of a type that it can read (ext2, ext3, and some others); it
can't read things inside LVM, software RAID-5, and so on.
If you don't already have such a set of filesystems, here's some suggestions:
- if you have a mirrored
/boot, consider deliberately breaking the mirror apart temporarily. - if you have swap partitions, you can turn off swapping to them and
reuse them for filesystems. (If they're mirrored, you can break
the mirroring too.)
- if your disks have different numbers of partitions, you may not
need a flag file at all; you can use GRUB's
rootcommand by itself to see what partitions are defined on each drive.(But then, if you have a different number of partitions on each drive you probably have single partition filesystems that GRUB can read.)
In my case, it turned out that ASUS M4N2-SLI motherboards invert the
order of the first pair of SATA channels relative to the last two;
SATA 1 and 2 were sdc and sdd in Linux, and SATA 3 and 4 were sda and
sdb (at least in Fedora 8; later kernels may have changed this). Then
I had extra fun because the BIOS boot order had gotten perturbed to
something like 2, 3, 1, 4.
2009-06-19
Fedora desperately needs a better upgrade system
I just upgraded my office workstation from Fedora 10 to Fedora 11 via preupgrade, one of the two officially supported ways of doing it. It took four hours. Of course, my machine was effectively down for those four hours, as it was off the network and unusable.
(While my office machine has a lot of RPMs installed (2742 after the upgrade), it is not a slow machine, so I suspect that this is not an unusual thing.)
I know, Fedora doesn't officially like upgrades; they want you to reinstall from scratch every time. Newflash to Fedora: this is not viable for real people using your distribution for real work.
For me the major problem is that four hours of downtime. I don't care if the upgrade takes four hours, but I do care a lot if I can't use my machine during that time, ideally with some reasonable facsimile of my regular environment. Thus there's two good options that I can see; either really supporting yum-based upgrades or creating a 'live cd' style upgrade environment. I would prefer the former, but the latter is probably easier (although doing a good job is hard, since you want to pull as many settings as possible from the user's regular system).
(It's possible that something horribly slow that shouldn't be happening is being done as part of the upgrade process, as I am pretty sure that installing from scratch would take substantially less time than four hours, and there's only so much that can be blamed on filesystem fragmentation. Alternately, there is or was something quietly but badly broken on my Fedora 10 system that caused the upgrade to be very, very slow.)
(I would like to say that a basic live cd upgrade environment should be easy to put together for Fedora, but I can't actually remember if the Fedora Live CD stays 'live' if you opt to install the system, the way that the Ubuntu Live CD does. If it does, an equivalent version for upgrades ought to be easy since both installs and upgrades use Anaconda.)
2009-06-18
A kernel NFS error message explained
Suppose that your machine is an NFS client, and it periodically logs a kernel error message that looks like:
nfs_stat_to_errno: bad nfs status return value: 16
What does this mean?
The short summary is that your NFS server is violating the NFS specifications. NFS requests can fail, and when they do the server returns an error number to tell the client why. The NFS specs say what the allowed error numbers are; the Linux kernel client code checks that the error number it got was one of the allowed ones. You get this message when this check fails. The number is the (decimal) number of the invalid NFS error code.
(You can find the list of valid NFS v3 error codes in RFC 1813 section 2.5 and the NFS v4 error codes in RFC 3530 section 18. I believe that NFS v2 error codes are a subset of NFS v3 ones, and NFS v4 ones are almost a superset of NFS v3 ones.)
Now some guesses as to what those stray error values actually mean.
If you look at the actual numbers of those defined error values and have
a Solaris machine handy (or just a general knowledge of Unix), something
jumps out at you immediately: many of the NFS 'error codes' actually
have the same numbers as the corresponding SunOS errno value for the
problem. I strongly suspect that the original NFS server and client code
did not have NFS 'error codes' as such; instead it took the server's
kernel errno value generated from trying to do the request, stuffed
it in the NFS reply, and on the client took the NFS error code and set
errno to that.
I further suspect that some NFS servers still do this. Thus, if you get get such an error message from your client kernels, your best bet at figuring out what the NFS server is trying to say is to look at what error that value is on your NFS server. Figuring out why some operation seems to be getting this error is up to you.
(In the specific case of this message for us, our NFS servers are
Solaris 10 machines, where 16 is EBUSY. Low errno numbers are
relatively well standardized; high ones are not necessarily so.)
2009-06-04
Another irritation with Gnome's gconf settings system
As if the first irritation wasn't enough, there's another problem with the Gnome settings stuff. It is this: modern Gnome applications keep all of their settings locked up inside gconf, even things that you enter and that in another, simpler era would have been stuck in dotfiles somewhere.
This has a practical issue that I have been running into after learning about convenient ssh in Gnome; it's hard to move settings for something from machine to machine. Consider the mini-commander applet macros, which are stored as gconf settings. I'd like to be able to just copy improvements I make from system to system (instead of having to re-enter them), and to easily install the setup on new machines. With dotfiles, this would be obvious; with gconf, what I have is in practice an opaque blob.
(To its credit, sshmenu does have an actual dotfile for its host information that I can just copy around.)
In theory, one can use tools like gconftool-2 to extract and then
set this information, provided that you know the gconf keys (and you
can use gconftool-2 to trawl for the keys, too). In practice, this
leaves you to reverse engineering the application's key usage and thus
to hope that you have not missed anything important, that you have
included everything you need to, and that you haven't included something
you don't want to.
Part of the problem is that, of course, applications are not organizing
their keys for your convenience; in this sort of case, they're using
gconf as a datastore. The mini-commander applet makes a good example;
as far as I can tell, it stores its macros in two keys (both in
/apps/mini-commander); macro_patterns seems to be a list of the
regexp patterns, in order, and macro_commands seems to be a list of
the matching commands, in the same order. I assume that it being stored
this way means that gconf only has single level lists, so you can't have
lists of lists.
(Disclaimer: it's possible that the mini-commander applet is a bad example and that good Gnome programs don't act this way. I'm not really optimistic, though.)