Wandering Thoughts archives

2009-06-27

Possible limits on our port multiplied ESATA performance

Our iSCSI targets use port multiplied ESATA to talk to their data disks. This, for those who have not had to deal with it yet, is a way of undoing one of the advantages of SATA; instead of having one disk per port, channel and cable, now you have some number of disks per port (I've seen as many as five). The advantage is that if you want lots of disks, you don't have to find room for (say) 12 individual connectors on the back of your enclosure, and to somehow get 12 individual ESATA ports into your system (with the resulting proliferation of PCI cards); instead you can just have three or four.

It's clear that we are running into some sort of overall bandwidth limit when doing streaming reads. The question is where. I can think of a number of possible limits:

  • the PCI Express bus bandwidth of the ESATA controller card, since all 12 disks are being handled on one controller card (which is absolutely what you want in a 1U server).

    (I don't know enough about reading lspci -vv output to know how fast the card claims to be able to go. The theoretical bus bandwidth seems unlikely to be the limiting factor.)

  • channel contention or bandwidth limits, since we have four disks on each port. I did some earlier testing that didn't see any evidence of this, but it wasn't exhaustive and it was done on somewhat different hardware.

  • kernel performance limits, where either the overall kernel or the driver for the SiI 3124 based cards we're using can't drive things at full speed.

    (I'm dubious about this; issuing large block-aligned IOs straight to the disks does not seem like it would be challenging.)

  • some kind of general (hardware) contention issue, where there is too much going on at once so that requests are unable to be issued and serviced at full speed for all disks.

Fortunately, this performance shortfall is basically irrelevant in our environment; for sequential IO, the iSCSI targets will be limited by total network bandwidth well before they'll run into this limit, and for random IO you care far more about IO operations a second (per disk) than you do about bandwidth.

(We are not rich enough to afford 10G Ethernet. And I have seen an iSCSI target easily do 200 Mbytes/sec of read IO, saturating both of its iSCSI networks.)

Looking into this has shown me that I don't know as much about the current state of PC hardware and its performance characteristics as I'd like to. (Yes, I know, it's sort of like the weather; wait a bit and it will change again.)

PossibleESATALimits written at 02:49:27; Add Comment

2009-06-20

Using GRUB to figure out the mapping of BIOS drive numbers

Suppose you have a piece of hardware where the mapping between BIOS drive numbers and Linux device names is not straightforward, and you want to figure out just what the mapping is (at least for your current disks). If you use GRUB (and you probably do), there's a relatively simple way of doing this.

GRUB has two important attributes for this: it can read Linux filesystems and tell you what's in them, and at boot time it is working with the BIOS drive numbering. So the basic approach is to plant unique flag files in filesystems on each drive, reboot your system, break into the GRUB shell, and use GRUB commands to find out what flag file is on each BIOS disk number.

(Okay, there's an important qualifier to this: this is the BIOS boot order, not the actual labels that the BIOS and the motherboard use. It is possible to have the BIOS perturb the boot order from the label order, so that 'SATA 3' is the boot drive, not 'SATA 1'.)

GRUB doesn't have an explicit ls command or the like, but what it does have is filename autocompletion. So the basic way to look for the flag file on each partition is something like the following commands:

root (hdX,Y)
kernel /FLAG-<TAB>

This will helpfully tell you which 'FLAG-*' file is on partition Y of BIOS disk number X (numbering from 0). You can repeat this sequence for each BIOS disk number that you have.

(You will probably have to break out of the usual GRUB menu into GRUB's command mode during booting. As far as I know there's no way back to menu mode, so just reboot the machine afterwards.)

Depending on your configuration, getting a unique flag file onto a simple filesystem in a simple partition on each disk may be either trivial or very complex (this is one situation where a mirrored /boot is a drawback). GRUB needs a whole filesystem of a type that it can read (ext2, ext3, and some others); it can't read things inside LVM, software RAID-5, and so on.

If you don't already have such a set of filesystems, here's some suggestions:

  • if you have a mirrored /boot, consider deliberately breaking the mirror apart temporarily.
  • if you have swap partitions, you can turn off swapping to them and reuse them for filesystems. (If they're mirrored, you can break the mirroring too.)

  • if your disks have different numbers of partitions, you may not need a flag file at all; you can use GRUB's root command by itself to see what partitions are defined on each drive.

    (But then, if you have a different number of partitions on each drive you probably have single partition filesystems that GRUB can read.)

In my case, it turned out that ASUS M4N2-SLI motherboards invert the order of the first pair of SATA channels relative to the last two; SATA 1 and 2 were sdc and sdd in Linux, and SATA 3 and 4 were sda and sdb (at least in Fedora 8; later kernels may have changed this). Then I had extra fun because the BIOS boot order had gotten perturbed to something like 2, 3, 1, 4.

GrubBiosMapping written at 00:44:51; Add Comment

2009-06-19

Fedora desperately needs a better upgrade system

I just upgraded my office workstation from Fedora 10 to Fedora 11 via preupgrade, one of the two officially supported ways of doing it. It took four hours. Of course, my machine was effectively down for those four hours, as it was off the network and unusable.

(While my office machine has a lot of RPMs installed (2742 after the upgrade), it is not a slow machine, so I suspect that this is not an unusual thing.)

I know, Fedora doesn't officially like upgrades; they want you to reinstall from scratch every time. Newflash to Fedora: this is not viable for real people using your distribution for real work.

For me the major problem is that four hours of downtime. I don't care if the upgrade takes four hours, but I do care a lot if I can't use my machine during that time, ideally with some reasonable facsimile of my regular environment. Thus there's two good options that I can see; either really supporting yum-based upgrades or creating a 'live cd' style upgrade environment. I would prefer the former, but the latter is probably easier (although doing a good job is hard, since you want to pull as many settings as possible from the user's regular system).

(It's possible that something horribly slow that shouldn't be happening is being done as part of the upgrade process, as I am pretty sure that installing from scratch would take substantially less time than four hours, and there's only so much that can be blamed on filesystem fragmentation. Alternately, there is or was something quietly but badly broken on my Fedora 10 system that caused the upgrade to be very, very slow.)

(I would like to say that a basic live cd upgrade environment should be easy to put together for Fedora, but I can't actually remember if the Fedora Live CD stays 'live' if you opt to install the system, the way that the Ubuntu Live CD does. If it does, an equivalent version for upgrades ought to be easy since both installs and upgrades use Anaconda.)

BetterFedoraUpgrades written at 01:56:40; Add Comment

2009-06-18

A kernel NFS error message explained

Suppose that your machine is an NFS client, and it periodically logs a kernel error message that looks like:

nfs_stat_to_errno: bad nfs status return value: 16

What does this mean?

The short summary is that your NFS server is violating the NFS specifications. NFS requests can fail, and when they do the server returns an error number to tell the client why. The NFS specs say what the allowed error numbers are; the Linux kernel client code checks that the error number it got was one of the allowed ones. You get this message when this check fails. The number is the (decimal) number of the invalid NFS error code.

(You can find the list of valid NFS v3 error codes in RFC 1813 section 2.5 and the NFS v4 error codes in RFC 3530 section 18. I believe that NFS v2 error codes are a subset of NFS v3 ones, and NFS v4 ones are almost a superset of NFS v3 ones.)

Now some guesses as to what those stray error values actually mean.

If you look at the actual numbers of those defined error values and have a Solaris machine handy (or just a general knowledge of Unix), something jumps out at you immediately: many of the NFS 'error codes' actually have the same numbers as the corresponding SunOS errno value for the problem. I strongly suspect that the original NFS server and client code did not have NFS 'error codes' as such; instead it took the server's kernel errno value generated from trying to do the request, stuffed it in the NFS reply, and on the client took the NFS error code and set errno to that.

I further suspect that some NFS servers still do this. Thus, if you get get such an error message from your client kernels, your best bet at figuring out what the NFS server is trying to say is to look at what error that value is on your NFS server. Figuring out why some operation seems to be getting this error is up to you.

(In the specific case of this message for us, our NFS servers are Solaris 10 machines, where 16 is EBUSY. Low errno numbers are relatively well standardized; high ones are not necessarily so.)

NFSKernelErrorExplained written at 01:17:44; Add Comment

2009-06-04

Another irritation with Gnome's gconf settings system

As if the first irritation wasn't enough, there's another problem with the Gnome settings stuff. It is this: modern Gnome applications keep all of their settings locked up inside gconf, even things that you enter and that in another, simpler era would have been stuck in dotfiles somewhere.

This has a practical issue that I have been running into after learning about convenient ssh in Gnome; it's hard to move settings for something from machine to machine. Consider the mini-commander applet macros, which are stored as gconf settings. I'd like to be able to just copy improvements I make from system to system (instead of having to re-enter them), and to easily install the setup on new machines. With dotfiles, this would be obvious; with gconf, what I have is in practice an opaque blob.

(To its credit, sshmenu does have an actual dotfile for its host information that I can just copy around.)

In theory, one can use tools like gconftool-2 to extract and then set this information, provided that you know the gconf keys (and you can use gconftool-2 to trawl for the keys, too). In practice, this leaves you to reverse engineering the application's key usage and thus to hope that you have not missed anything important, that you have included everything you need to, and that you haven't included something you don't want to.

Part of the problem is that, of course, applications are not organizing their keys for your convenience; in this sort of case, they're using gconf as a datastore. The mini-commander applet makes a good example; as far as I can tell, it stores its macros in two keys (both in /apps/mini-commander); macro_patterns seems to be a list of the regexp patterns, in order, and macro_commands seems to be a list of the matching commands, in the same order. I assume that it being stored this way means that gconf only has single level lists, so you can't have lists of lists.

(Disclaimer: it's possible that the mini-commander applet is a bad example and that good Gnome programs don't act this way. I'm not really optimistic, though.)

GnomeSettingsIrritationII written at 02:22:31; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.