Wandering Thoughts archives

2011-11-23

The many names of Linux SATA devices

If you have SATA devices on your Linux system, the kernel gives them no less than four different names in four different namespaces (although arguably one of the namespaces is not really a kernel one). Because I recently had to deal with this, I feel like running down all of them.

The four namespaces are:

  • Traditional sdX names like sdj. Unless udev is playing around with them for you, these are allocated sequentially in the order that the SATA (or SCSI-like) drives are encountered. If you hotswap or hot-add a new drive, it will generally (but not always) get the next highest unused drive name. These names start from sda and go up.

    (If you hotswapped the highest drive, it will sometimes reuse its old name.)

    As I have found out the hard way, there is no particular guarantee that what Linux sees as sda will be either the first BIOS drive or the drive plugged into the SATA connector on your motherboard that is labeled 'SATA 1', and there is equally no guarantee that the first BIOS drive is what is plugged into 'SATA 1'. This gets very fun.

    sdX names also appear in sysfs as the primary name of drives, in /sys/block.

None of the following three naming schemes for drives change if and when you hotswap a drive, unlike sdX names; they are always constant.

  • What I will call 'SCSI host names' that look like 'sd 4:3:0:0' or 'scsi 4:3:0:0' (kernel messages use both forms). Note that 'host' here does not mean what you think it might mean; it is special SCSI terminology that roughly corresponds to what I would call a channel.

    The first number is the SCSI host (for SATA, a single port) and for SATA the second is the drive on that host. If the port doesn't have a SATA port multiplier, the second number is always 0; if it does, as we have on some machines, it counts up for each disk reached through the port. Note that a single SATA adaptor card can have multiple ports; each port is numbered separately.

    (The actual definition of the numbers can be seen in /proc/scsi/scsi. Note that non-SATA drivers can do numbering quite differently than SATA drivers; on our Dell 2950s using the mptsas driver, there is only one SCSI host, the second number is always 0, and the third number counts up for individual disks.)

    SCSI host numbers start from 0 and are allocated in the order the system encounters disk 'adaptors', broadly construed. They are never reused. Because USB materializes and de-materializes host adaptors as well as the hard drives themselves, on systems where you frequently insert and remove things like USB drives or camera memory cards you can get very high 'sd NN:...' numbers; my home system routinely reaches 'sd 20' (despite the actual memory card generally showing up as sde).

    These days Linux treats almost every disk as a SCSI disk regardless of the physical attachment method, so almost every disk gets a name like this. Although the kernel prints no messages about this, SCSI host numbers are allocated for all SCSI-like disk adaptor regardless of whether or not there are any disks attached to it at the time that it is seen. This means that 'scsi X' numbers for actual disks can be non-contiguous and your first disk is not necessarily 'scsi 0:0:0:0'.

    (We have machines where the disk numbering is 'sd 2:0:0:0', 'sd 3:0:0:0', 'sd 5:0:0:0', 'sd 5:1:0:0', and so on. Yes, this is fun.)

    Whether IDE devices are pulled into this numbering seems to vary based on the kernel version and device drivers involved. On our SunFire X2100s, Ubuntu 8.04's kernel considers them sufficiently SCSI-like devices to be enumerated this way but a stock 2.6.25.3 kernel does not.

    SCSI hosts more or less appear in sysfs in /sys/class/scsi_host; their disks appear directly in /sys/class/scsi_disk. You can also see the same information in /proc/scsi/scsi.

    Kernel messages that use SCSI host names usually also include the sdX device names as well.

  • (s)ATA port and disk names, which look like 'ata5' and 'ata5.03'. Unlike SCSI host names, these are only allocated to real (s)ATA ports; USB drives and the like do not get them. The two parts of the name are the port number and the drive on the port; again, if you have no port multiplexer the drive number will always be '00'.

    Just to confuse you, ATA ports are numbered starting from 1, unlike SCSI hosts (which start from 0). sd 4:3:0:0 is thus ata5.03, assuming that sd 0 through sd 4 are all (s)ATA devices. And yes, the Linux kernel will happily mix both sorts of names in drive error messages.

    SATA port multipliers themselves seem to be called 'ataN.15' in some kernel versions.

    These names do not appear in sysfs at all as far as I can see; they only appear in kernel messages. Unfortunately a drive that's experiencing transient problems is sometimes only (clearly) identified by ATA disk name, especially if you're trying to pick out which drive on a port multiplier port is the one that's having problems.

  • Per-PCI-device names that you find in /dev/disk/by-path, which have the form 'pci-0000:03:00.0-scsi-2:3:0:0'. The bit after the -scsi- has the same general meaning as in SCSI host names except that the host number is per PCI device, not global; several different PCI devices can have a '-scsi-0:0:0:0' disk.

    How SATA ports on your motherboard are split between PCI devices is what you could politely call extremely varied. SATA ports on actual PCI cards are usually more predictable; one card is normally one PCI device. Finding the appropriate PCI device identifier for your card is up to you.

    In sysfs, PCI devices hide out in the depths of /sys/devices/pci* but understanding the layout (and finding your specific device) requires more understanding of PCI bus topology issues than I have.

Only sdX names and the per-PCI-device names appear in /dev, at least conveniently, and so these are your only real choices for referring to specific drive slots. (Specific physical HDs can be referred to by a number of other ways that appear in /dev/disk.)

In our case, today we had to deal with a drive under all four names; sdj, sd 4:3:0:0, ata5.03, and pci-0000:03:00.0-scsi-2:3:0:0 are (or were) all the same drive. The first three names appeared, intermixed, in various kernel messages about retried errors and then permanent errors; the fourth name we use so that the iSCSI backend software has a stable pathname for the drive and its partitions.

(In the case of the sdj name, when we hot-swapped the physical HD the new version of the drive became sdo. None of the other three names for it changed.)

LinuxSATANames written at 02:40:47; Add Comment

2011-11-21

The likely cause of my IPSec dropped packet mystery

I believe that I've identified the cause of my mysterious dropped GRE tunnel packets that showed up in recent kernels. The short description of the cause is recursive routing leading to a path MTU collapse.

Explaining this is going to take some verbiage. Back when I set up my GRE tunnel, I wrote:

My current trick is routing the subnet that the target of the tunnel is on over the tunnel itself, which makes my head hurt.

Let me make this concrete. The GRE tunnel target is 128.100.3.58, and as part of my dual identity routing I have a route:

ip route add 128.100.3.0/24 dev extun

Let us call the tunnel target T, my machine's inside address I, and my machine's outside address O (because all of these are much shorter and clearer than writing out the IP addresses in full). In an environment with policy based routing it's possible to see how all of this works; because the tunnel is explicitly specified as being from O to T, it is forced to ignore the route to T's subnet that would normally send the GRE-encapsulated traffic to T back over the tunnel. This still works even if you talk directly to T without specifying a source address; your plain TCP connection will be routed over the GRE tunnel (and get the source address of I), and then the encapsulated version will be routed over the regular connection since it now comes from O.

(It's possible that the kernel is smart enough to do this even without policy based routing, but I haven't tested that.)

Because the GRE tunnel is an encapsulation over my regular link, it has a lower MTU than the regular link. This means that traffic going from I to T has a lower (path) MTU than traffic going from O to T.

In old kernels, all of this worked fine, and in particular the kernel kept the path MTUs of the two versions of traffic to T separate. In recent kernels, this appears to have changed; it looks like there is only a single path MTU for T, regardless of the path to it. The consequence is that when I start a TCP conversation with T over the GRE tunnel, the path MTU to T almost immediately collapses down to 552 octets (the default minimum path MTU). I assume that this is happening due to a recursive series of path MTU reductions; first the GRE tunnel reduces the normal MTU to T down to the GRE tunnel's MTU, then the encapsulation code notices that the GRE tunnel's MTU doesn't fit in the MTU to T and chops it in turn, and things repeat until the kernel won't let the MTU go any lower.

(There appears to be a minimum MTU for the GRE tunnel that is over 552 octets. Once the MTU to T shrinks too far and I try to talk to anything over the GRE tunnel, I see a series of locally generated rejections of the form 'ICMP 128.100.3.51 unreachable - need to frag (mtu 478), length 556'. Another diagnostic is that the transmit error count shown by 'ip -s -s show dev extun' keeps counting up.)

One can see some of this by inspecting the routing cache with 'ip route show table cache'. However, flushing the cache (with 'ip route flush table cache') does not help; it seems that in current kernels, this routing cache is not the real fully authoritative source of this information. (I am not up enough on Linux networking to understand what is going on here.)

This problem can be avoided to a certain extent by creating a host route for T that sends traffic for it explicitly over the underlying link, not the GRE tunnel. However you will still provoke this problem if you force traffic for T to go over the GRE tunnel (for example, by specifying a source IP of I so that policy based routing kicks in); this just avoids accidents.

(Much of my understanding of what's going on has been developed through interacting with Eric Dumazet on the Linux kernel netdev mailing list, and in skimming netdev in general. Without Eric's questions in response to my initial bug report, I would never have been able to work out what's going on.)

Sidebar: useful sysctls and other things

There are two potentially useful sysctls in /proc/sys/net/ipv4/route. min_pmtu sets the minimum path MTU, and is normally 552. mtu_expires sets how long (in seconds) that learned path MTU(s) will stick around for and is normally ten minutes; I believe that setting it to a low value does not expire already-learned path MTUs. There is a seductive looking flush sysctl entry in the same directory but I was unable to get it to do anything useful in testing; whatever it's flushing is not what is grimly holding on to a bad path MTU.

IPSecPacketDropProblemII written at 00:14:04; Add Comment

2011-11-19

My Liferea crashes are not Liferea's fault

Back in Fedora15VsMe I said bad things about the current version of Liferea, especially that it was crash-prone. I have been trying to debug these crashes ever since they started, and I now believe that this is not Liferea's fault; instead it is a bug in the underlying GLib IO library that it is using. For the benefit of anyone else hitting this who is doing Internet searches, I will describe the crash and the bug.

All of my crashes have been directly in g_tls_client_connection_gnutls_finish_handshake() (which is part of libgiognutls, a GIO module), in a call chain that eventually runs through g_io_stream_dispose() (all of this is obtained from running the latest git version of Liferea under gdb). The TLS routine crashes because it is trying to dereference a NULL pointer (inout_error in the source code Fedora 15 uses, which is a GError **). This pointer is ultimately NULL because g_io_stream_dispose() calls g_io_stream_close() with a null error parameter. However, the documentation for g_io_stream_close() is very clear that the error parameter may be NULL (which means to not try to store any information about any errors that happen).

(It's not clear where inside libgiognutls the bug actually lies. Clearly something is making an assumption that there's always a place to put an error, but it may be a higher level function involved in closing down connections. Oh, and now that I've started doing web searches on the TLS routine's name, I've found Debian bug #628068 and it appears that the glib-networking people fixed this in August.)

Diagnosing all of this was made much easier by how Fedora handles their 'debuginfo' packages. Before I started doing this, I had the vague impression that debuginfo packages simply had the symbol tables for the various libraries and programs, so that gdb could give you symbolic backtraces and so on. As I've discovered, this is not all that you get with debuginfo packages; you also get the full source for the package in /usr/src/debug. Oh, and of course gdb knows where to find the source so you get full source listings in stack backtraces and so on.

Having full source immediately available (once you find it) is a big boost to tracking down things like this. I'm not sure I would have bothered to dig into this issue very much if I hadn't had the source code there to browse; sure, I could get it by fetching source RPMs and unpacking them and so on, but that's enough extra work that I might well have not bothered.

(On the flipside, had I not taken the system programmer approach I might have immediately Googled the TLS routine's name and found out much of this information.)

Hunting down the source of https URLs in my feeds turned out to be somewhat more work than I expected. I had one live feed that had switched to https; since it was just something that I inherited from the default feed list of a very old version of liferea, I removed it. Then there were a few feeds from very dead sites that were now redirecting to https versions of domain parking sites. It's possible that there will still be crashes if feed entries try to include things like images via HTTPS URLs, but I'll have to see.

LifereaCrashUpdate written at 01:49:16; Add Comment

2011-11-10

Praise for systemd

Over time I've accumulated all sorts of local init.d scripts on both my work and my home machines. With my move to Fedora 15 comes systemd, and I was recently prompted to try migrating my crufty old init.d scripts to today's new modern way of doing things.

(I'm a crazy person so sometimes I like doing things like this, not because I have any real need of it but because it just seems like the thing to do.)

I'll admit that when I first heard about systemd, I was more than a bit uneasy. systemd was yet another init system replacement when we were going through that already with upstart, but my big concern was that it was written by the primary author of PulseAudio. I've had a very rocky experience with PulseAudio, as have lots of people; PulseAudio updates (and associated kernel driver changes) are infamous for breaking working audio configurations and the PulseAudio developers are equally infamous for how they've dealt (or not dealt) with this. So I was not necessarily looking forward to a distribution based around systemd.

I was wrong. Pretty much all of the things that the systemd pages talks about are true, and in particular systemd service descriptions really are short, to the point, and easy to write. I converted pretty much everything I had easily, and it's been a pleasure to not have all of the boilerplate goo that infests init.d scripts. In several cases I could simplify how I ran programs because I didn't need to carefully daemonize them any more or jump through hoops to run them as a specific user.

(And it really seems to work for speeding up boot times.)

This isn't all that systemd offers a sysadmin. For a start, systemd can tell me what a particular daemon process belongs to, which is invaluable for answering questions like 'okay, what started this gpsd process that thinks my modem's serial port is a GPS unit?' (sadly that was a real question I had today). There are other potential benefits, although I haven't explored them in depth so far.

Overall, my decided impression is that the authors of systemd have really tried to think about what sysadmins want and need to do on their systems, and built mechanisms where they can do this well, in ways that cooperate with the rest of the distribution. For example, that systemd looks in /etc/systemd/system for service descriptions before /lib/systemd/system means not only that you can customize services but that you can do so without running into package manager issues.

(systemd is not without things that I wish were different, but that's another entry.)

SystemdPraise written at 00:02:05; Add Comment

2011-11-09

The disappearance of separate filesystems for /usr and /var

Taken from the list of common Fedora 16 bugs:

  • Attempting to upgrade a system with /var on a different partition or LV to / will fail

Okay, I get it. I give in. Even on Fedora 15 systemd warns you that various things won't work right if /usr is not part of the root filesystem, and now Fedora 16 upgrades fail if /var is a separate filesystem. When you beat me over the head hard enough, I can get the point: the days of separate filesystems for bits of the system are over. No more partitioning things into /, /usr, and /var; now the only sensible split is / and /boot (and then whatever filesystems for user data you want). And I'm not even convinced of the /boot versus / split, not any more.

(Lest you think I'm throwing stones at Fedora, note that Ubuntu was here a while ago. Oh, and just to drive it home for Fedora, people are talking about moving all of the binaries from /bin and /sbin into /usr.)

My home machine is now set up this way, somewhat through coincidence; at the time I did the 'all one filesystem' setup I didn't know about either of these issues. I'm planning to rebuild my office workstation on new disks, and when that happens I'll be merging my current /, /usr, and /var filesystems together to be all one big root filesystem (and I'll switch to grub2 and GPT, I expect).

Almost all of our Ubuntu servers are already set up this way because we're lazy (the single system filesystem approach is less work in setup, and setting up mirrored system disks is already annoying enough). The exceptions are mail machines where /var has special options for extra data durability, and I'm not sure how we're going to handle those when Ubuntu also inevitably gives up on supporting a separate /var filesystem. Maybe we'll just set data=journal for the entire root filesystem and live with the lower write IO speed.

(Given everything that Ubuntu has done so far, I do not expect them to spend much effort on preserving /var as a viable separate filesystem. And the writing is very clearly on the wall for what upstream packages expect; clearly essentially no developers are testing software on machines with separate /usr and /var filesystems or these problems wouldn't exist. This matters because what upstream developers work with soon becomes reality for distributions unless the distributions feel like doing a lot of work to push back.)

VanishingSystemFilesystems written at 01:16:59; Add Comment

2011-11-07

An IPSec mystery with dropped packets

I'm going to break one of my normal rules in this entry. To put it one way, I normally write about answers, not questions, but this time I have a mystery instead of a solution. In part I'm writing this entry for myself, so that I have everything written down in one spot for later reference.

My home machine has a long-standing GRE over IPSec tunnel to my work machine; this lets it masquerade as a machine on the local internal network. Since I migrated my home machine to Fedora 15, I've been experiencing relatively frequent problems with establishing at least new SSH connections over the link (recent evidence suggests that it may be new TCP connections in general). When the problem is happening, new SSH connections will get partway through the initial SSH protocol negotiations and then stall. At the same time the problem is not constant; sometimes new SSH connections work fine.

(In fact it looks like any TCP connection that hits the magic circumstances will also stall in a similar way.)

The problem is not a general stall of either network traffic or VPN traffic; during a stall, both external connections and existing VPN connections continue to work without problems. The problem is definitely happening at the IPSec/GRE level; I have captured tcpdump traces of both the GRE tunnel and the underlying DSL PPP link, and I can see TCP packets being transmitted on the GRE tunnel but not being transmitted on the DSL PPP link. And at the same time, ping packets are going through fine. A typical dropped packet is:

IP 128.100.3.52.47123 > 128.100.3.51.ssh: Flags [.], seq 22:522, ack 806, win 103, options [nop,nop,TS val 140991656 ecr 201734236], length 500

In fact I have tcpdump traces from both sides of the GRE tunnel that show that the initial version of this packet is dropped even in a stream of other packets that pass through fine. (And it is not an MTU issue; the link passes larger packets, among other signs.)

So far every trace I've seen for the problem has been TCP packets with a reported data length of 500 octets; for example, a trace from a ttcp run shows:

IP 128.100.3.52.46585 > 128.100.3.51.5001: Flags [.], seq 1:501, ack 1, win 91, options [nop,nop,TS val 729200 ecr 979199256], length 500

The ttcp run in fact had a whole block of length-500 packets not get through (this one was the first one). But after a while one of the retries of this packet made it through, was ACK'd, and suddenly the conversation was on; 'length 500' packets flowed freely. Also, I don't know why the packets are being restricted to 500 data octets; I would have expected them to use something close to the GRE's MTU, which is 1200.

Initially I thought that this problem was due to NetworkManager, which is one reason that I forcefully turned it off. However the problem has now happened multiple times without NM running. The problem started in the Fedora 15 kernel; reverting to the Fedora 14 kernel on my home machine makes things work (although it has other issues). Both the Fedora 16 kernel (to be) and the current 3.1.0 git head also have the problem.

(I suppose now I get to write a problem report to the Linux kernel netdev mailing list and see if anything happens.)

Sidebar: the MTUs involved

The DSL PPPoE link has an MTU of 1492. The GRE tunnel has an MTU of 1200 (on both ends). The remote target has the standard Ethernet MTU of 1500. As far as I can see both the PPPoE link and the GRE tunnel will pass maximum-sized packets in either direction.

However, at the same time tracepath reports that the GRE tunnel has a path MTU of 854 only in the home to work direction; for work to home, the tracepath reported path MTU is the full 1200 bytes. I don't know where this limit is coming from. In addition, this doesn't seem to be the case on the Fedora 14 kernel without the problem; on that kernel, the reported path MTU is the full 1200 bytes.

(Okay, 'ip route show table cache' is somewhat helpful here, but I don't know why the kernel has decided to crank down the path MTU. The obvious MTU-related /proc/sys/net/ipv4 settings are the same between the two kernel versions as far as I can see.)

IPSecPacketDropProblem written at 00:12:14; Add Comment

2011-11-06

Ubuntu does system disk mirroring right

When we started installing Ubuntu 10.04 systems with our standard mirrored system disk setup, we noticed that it asked us a new and (in my opinion) very stupid question: did we want the system to boot if only one of the two sides of the mirror were there? Of course we said 'yes', since that's part of why we're mirroring the system disk in the first place. Despite its sillyness this question was already an improvement over 8.04, where the installer defaulted to a 'no' and you had to edit the grub settings by hand to change this.

What we didn't notice at the time was what else the installer was doing with mirrored system disks. To wit, Ubuntu now installs GRUB on the second drive, as well as on the first one.

This is an important thing to do, because it's what makes your system bootable even if you lose (or pull) your entire primary drive. In the past it was a step that we had to remember to do by hand (with an appropriate peculiar incantation), which means that it was sometimes forgotten and so some of our systems have mirrored system disks but could not actually reboot if they lost the primary drive. Now all of our Ubuntu 10.04 machines have this handled automatically for us, which is great and also exactly what a system should do if it detects that /boot is mirrored across multiple drives.

Before I started looking, I was going to confidently assert that this was new behavior in Ubuntu 10.04. However, it appears likely that it's also in Ubuntu 8.04 and we just didn't notice; I've checked a few of our 8.04 machines where I'm reasonably certain that we didn't install GRUB on the second disk by hand, and they have GRUB boot blocks.

(Similarly, my just-installed Fedora 15 home machine has a GRUB boot block on the second drive and I'm completely sure I didn't install it by hand, so it looks like Fedora 15 is also smart enough to do this.)

On a side note, it's surprisingly hard to notice changes like this if you don't consciously check for them when you're working out your install procedures for a new distribution release. Our install procedure has always called for installing GRUB by hand on the second drive, so of course we carried that forward into the 8.04 and 10.04 install instructions. Even when this step got accidentally omitted on specific machine installs, we don't normally pull primary drives and do a test boot on the secondary drive. So it took a chain of circumstances that caused us to boot a system on the second drive in a situation where we didn't think we'd set up GRUB on the second disk, and then testing this by installing a test system, deliberately not doing that manual step, and trying to boot the system with just the second drive.

Sidebar: why the Ubuntu installer's question is stupid

The ostensible reason for having an option to not boot if you have a degraded mirror is because this risks data loss if you don't fix it. However, my personal feeling is that almost everyone who is choosing to mirror system disks is doing so in a situation where they would rather have the system continue to operate even with degraded mirrors; people who care that much about data loss are rare, and even then the Ubuntu question is an incomplete solution to the problem.

(Nothing stops the system if your mirror degrades while the system is running, and I think this is the far more likely case.)

I don't object to there being an option for this behavior, but I don't think this is worth a question during installation. If you find that 90% of your audience answers a question one way, stop asking the question and just let the 10% who need a different answer change it by hand afterwards.

(This suggestion is inapplicable for things that can't be changed afterwards, but this is not one of them.)

UbuntuMirroringRight written at 01:06:14; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.