Wandering Thoughts archives

2012-07-27

What 32-bit x86 Linux's odd 896 MB kernel memory boundary is about

Back in my entry on how the Linux kernel divides up your RAM I described the somewhat odd split the 32-bit x86 Linux kernel uses. In 32-bit x86 kernels, the Normal zone is only memory up to 896 MB and the HighMem zone is everything above it. The reasons for this are rooted in history and the 32-bit (kernel) memory map.

In the beginning, Linux only ran on (32-bit) x86 machines and the machines had a small amount of memory by modern standards. This led Linux to take some convenient shortcuts, including how it handled the problem of getting access to physical memory.

The entire Linux kernel has a 1 GB address space (embedded at the top of every process's 4 GB address space; see here and here for more discussion). Since even 512 MB of RAM was an exceptional amount of memory back in the early days, Linux took the simple approach of directly mapping physical RAM into the kernel address space. The kernel reserved 128 MB of address space for itself and for mapping PCI devices and the like, and uses the rest of the 1 GB for RAM; this allowed it to directly map up to 896 MB of memory.

(I don't know why the specific split was chosen. Possibly it was felt that 128 MB was a good round number for the kernel's own usage.)

After a while it became obvious that direct mapping alone wasn't good enough (partly because of increased PC memory and I think also partly because Linux was being ported to non-x86 machines that couldn't do this). On 32-bit x86, the solution was to create a second zone which would use explicit mappings created on the fly; this is the HighMem zone. Obviously the new zone starts at the point where RAM can't be directly mapped any more, ie at 896 MB.

(More technical details and background are in linux-mm.org's HighMem page.)

If the Linux kernel people were redoing these decisions from scratch I don't know if they'd keep the direct linear mapping of the 32-bit Normal zone, or if they'd simplify life by making all 32-bit memory be HighMem memory. These days many x86 machines that are still running in 32-bit mode have several GB of memory, so most of their RAM is already being mapped in and out of the kernel address space.

(To answer an obvious question: if I'm reading the kernel documentation correctly, the 64-bit x86_64 kernel directly maps all 64 TB of possible physical memory into the kernel address space. See Documentation/x86/x86_64/mm.txt. I suspect that this is a pretty safe decision.)

Linux896MBBoundary written at 01:17:22; Add Comment

2012-07-22

The history of booting Linux with software RAID

One of the broad developments in the Linux kernel's boot process over the past N years has been a steady move from having the kernel do things inside itself to having them done in user level code (which is typically run from an initramfs). The handling of software RAID arrays is no exception to this.

In the beginning, activating software RAID arrays at boot time was handled inside the kernel. At boot time the kernel (specifically the software RAID code) scanned all disk partitions of type fd ('Linux raid') and automatically assembled and activated any software RAID arrays that it found. Although there were a bunch of corner cases that this didn't handle, it worked great in most normal situations and meant that you could boot a 'root on software RAID' system without an initramfs (well, back then it was an initrd). Since this process happened entirely in the kernel, the contents of any mdadm.conf were irrelevant; all that mattered was that the partitions had the right type (and that they had valid RAID superblocks). In fact back in the old days many systems with software RAID had no mdadm.conf at all.

(I don't remember specific kernel versions any more, but I believe that most or all of the 2.4 kernel series could work this way.)

The first step away from this was to have software RAID arrays assembled in the initrd environment by explicitly running mdadm from the /init script, using a copy of mdadm.conf that was embedded in the initrd image. I believe that the disk partition type no longer mattered (since mdadm would normally probe all devices for RAID superblocks). It was possible to have explosive failures if your mdadm.conf did not completely match the state of critical RAID arrays.

(I don't know if this stage would assemble RAID arrays not listed in your mdadm.conf and I no longer have any systems I could use to check this.)

The next state of moving boot time handling of software RAID out of the kernel is the situation we have today. As I described recently, a modern Linux system does all assembly of software RAID arrays asynchronously through udev (along with a great deal of other device discovery and handling). In order to have all of this magical udev device handling happen in the initramfs environment too, your initramfs starts an instance of udev quite early on and this instance is used to process boot-time device events and so on. This instance uses a subset of the regular rules for processing events, generally only covering what is considered important devices for booting your system. As we've seen, this process of assembling software RAID arrays is generally indifferent to whether or not the arrays are listed in mdadm.conf; I believe (but have not tested) that it also doesn't care about the partition type.

(I think that the udev process that the initramfs starts is later terminated and replaced by a udev process started during real system boot.)

SoftwareRaidBootHistory written at 22:41:26; Add Comment

A sleazy trick to capture debugging output from an initramfs

Suppose, not entirely hypothetically, that something in your system's initramfs is failing or that you just want to capture some debugging output or state information in general. The traditional way to do this when console output isn't good enough is to just dump the output into a file and read the file later, but this has a problem in the initramfs world; the file you write out will be in the initramfs, which means that it will quietly disappear when boot process is finished and the initramfs goes away.

So we need two things. We need to preserve the initramfs or at least the bit of it that we care about, and then we need some way to get access to it. There is probably an official way to do this, but here is my sleazy trick.

We can preserve a file from the initramfs by starting a process in the initramfs (and then having it stay running) that has a file descriptor for the file. For example (on Ubuntu 12.04):

(udevadm monitor >/tmp/logfile 2>&1) &

(I believe that even something like 'sleep 16000 >/tmp/logfile &' should do it. You can then have other commands append things to it with '>>/tmp/logfile'.)

There are undoubtedly clever ways to preserve the initramfs or get access to it, but once you have a preserved file descriptor there's a simpler brute force way. Simply look at /proc/<pid>/fd/<N> (<N> is often 1 or 2) and there's your debug file. You can now use whatever tool you like (including a pager like less) to look at it.

CaptureInitramfsDebugging written at 00:53:53; Add Comment

2012-07-20

Ubuntu 12.04 can't reliably boot with software RAID (and why)

Recently one of my co-workers discovered, diagnosed, and worked around a significant issue with software RAID on Ubuntu 12.04. I'm writing it up here partly to get it all straight in my head and partly so we can help out anyone else with the same problem. The quick summary of the situation comes from my tweet:

Ubuntu 12.04 will not reliably boot a system with software RAID arrays due to races in the initramfs scripts.

(As you might guess, I am not happy.)

If you set up Ubuntu 12.04 with one or more software RAID arrays for things other than the root filesystem, you will almost certainly find that some of the time when you reboot your system it will come up with one or more software RAID arrays in a degraded state with one or more component devices not added to the array. If you have set bootdegraded=true as one of your boot options (eg on the kernel command line), your system will boot fully (and you can hot-add omitted device back to the array); if you haven't, the initramfs will pause briefly to ask you if you want to continue booting anyways, time out on the question, and drop you into an initramfs shell.

This can happen whether or not your root filesystem is on a software RAID array (although it doesn't happen to the root array itself, only to other arrays) and even if you do not have the software RAID arrays configured or used in your system in any way (not listed in /etc/mdadm/mdadm.conf, not used in /etc/fstab and so on); simply having software RAID arrays on a disk attached to your system at boot time is enough to trigger the problem. It doesn't require disks that are slow to respond to the kernel (to the extent that we've reproduced this in VMWare, where the disks aren't even physical and respond to kernel probes basically instantly).

Now let's talk about how this happens.

Like other modern systems Ubuntu 12.04 handles device discovery with udev, even during early boot in the initramfs. Part of udev's device discovery is the assembly of RAID arrays from components. What this means is that software RAID assembly is asynchronous; the initramfs starts the udev daemon, the daemon ends up with a list of events to process, and as it works through them the software RAID arrays start to appear. In the mean time the rest of the initramfs boot process continues on and in short order sets itself up to mount the root filesystem. As part of preparing to mount the root filesystem, the initramfs code then checks to see if all visible arrays are fully assembled and healthy without waiting for udev to have processed all pending events. You know, the events that can include incrementally assembling those arrays.

This is a race. If udev wins the race and fully assembles all visible software RAID arrays before the rest of the initramfs checks them, you win and your system boots. If udev loses the race, you lose; the check for degraded software RAID arrays will see some partially assembled arrays and throw up its hands.

Our brute force solution is to modify the check for degraded software RAID arrays to explicitly wait for the udev event queue to drain by running 'udevadm settle'. This appears to work so far but we haven't extensively tested it; it's possible that there's still a race present but it's now small enough that we haven't managed to hit it yet.

This is unquestionably an Ubuntu bug and I hope that it will be fixed in some future update.

Sidebar: our fix in specific

(For the benefit of anyone with this problem who's doing Internet searches.)

Change /usr/share/initramfs-tools/scripts/mdadm-functions as follows:

 degraded_arrays()
 {
+	udevadm settle
 	mdadm --misc --scan --detail --test >/dev/null 2>&1
 	return $((! $?))
 }

Then rebuild your current initramfs by running 'update-initramfs -u'.

Since I suspect that mdadm-functions is not considered a configuration file, you may want to put a dpkg hold on the Ubuntu mdadm package so that an automatic upgrade doesn't wipe out your change.

(This may not be the best and most Ubuntu-correct solution. It's just what we've done and tested right now.)

Sidebar: where the bits of this are on 12.04

  • /lib/udev/rules.d/85-mdadm.rules: the udev rule to incrementally assemble software RAID arrays as components become available.

Various parts of the initramfs boot process are found (on a running system) in /usr/share/initramfs-tools/scripts:

  • init-top/udev: the scriptlet that starts udev.

  • local-premount/mdadm: the scriptlet that checks for all arrays being good; however, it just runs some functions from the next bit. (All of local-premount is run by the local scriptlet, which is run by the initramfs /init if the system is booting from a local disk.)

  • mdadm-functions: the code that does all the work of checking and 'handling' incomplete software RAID arrays.

Looking at this, I suspect that a better solution is to stick our own script in local-premount, arranged to run before the mdadm script, and have it run the 'udevadm settle'. That would avoid changing any package-supplied scripts.

(Testing has shown that creating a local-top/mdadm-settle scriptlet isn't good enough. It gets run, but too early. This probably means that modifying the degraded_arrays function is the most reliable solution since it happens the closest to the actual check, and we just get to live with modifying a package-supplied file and so on.)

Ubuntu1204SoftwareRaidFail written at 23:23:08; Add Comment

2012-07-16

Getting an Ubuntu 12.04 machine to give you boot messages

As part of a slow move towards Ubuntu 12.04, we recently worked on the problem that our 12.04 servers were pretty much not showing boot messages and in particular they weren't showing any kernel messages. Not showing boot messages is a big issue for servers because if anything ever stalls or goes wrong in the boot process you wind up basically up the creek without boot messages; you have a hung server and no clue what's wrong.

(Since I've gone through this with a 12.04 server that was hanging during boot, I can tell you that various bits of magic SysRq are basically no help these days.)

The main changes we need to make are to /etc/default/grub, which magically controls the behavior of Grub2. We needed to make two main changes:

  • change GRUB_CMDLINE_LINUX_DEFAULT to delete 'quiet splash'. On 12.04 servers without a serial console, we leave this blank.
  • uncomment the 'GRUB_TERMINAL=console' line. Without this change the console stays blank for a while and only the later boot messages show.

    (I don't understand why this is necessary; my best understanding of the Grub2 documentation is that 'console' should be the default.)

We've also changed GRUB_TIMEOUT to 5 (seconds) and commented out GRUB_HIDDEN_TIMEOUT and GRUB_HIDDEN_TIMEOUT_QUIET. This causes the Grub2 menu to always show for five seconds, which I find much more useful than the default behavior of having to hold down Shift at exactly the right time in order to get the menu to show.

(I understand why a desktop install wants to hide the Grub menu by default, but this is the wrong behavior for a server.)

Remember that after you change /etc/default/grub you have to run update-grub to get the change to take. Forgetting this step can make you very puzzled and frustrated during testing (I speak from sad experience).

(This is where I could insert a rant about the huge mess of complexity that is Grub2. I do not consider having a programming language for Grub menus to exactly be progress, especially not when they become opaque and have to be machine generated.)

The remaining change is to /etc/init/tty1.conf. By default the virtual console logins clear the screen when they start; on tty1, this has the effect of erasing the last screen's worth of boot-time messages. To tell getty not to do this, we add --noclear to the exec line:

exec /sbin/getty --noclear -8 38400 tty1

Unfortunately the result of all of these changes isn't exactly perfect. We get kernel messages and now avoid wiping out what messages Upstart prints about starting user-level servers, but the 12.04 Upstart configuration doesn't print very many messages about that. I believe that only the remaining /etc/init.d scripts really produce boot time messages and there are an ever decreasing number of them; native /etc/init things don't seem to print much or any messages.

(There are ways to coax Upstart into logging messages about services, but I haven't found one that causes it to print 'starting <blah>' and "done starting <blah>' on the console during boot.)

Things that don't work to produce more verbose boot messages

I've experimented with a number of options and arguments that seem like they should help but in practice don't. All of these are supplied on the kernel command line:

  • debug=vc (from the initramfs-tools manpage): This prints relatively verbose debugging information from the /init script in the initial ramdisk. Unfortunately our problems have always been after this point, once the initial ramdisk had handed things over to the real Upstart init.

    (It is useful to verify that the Upstart init is being started with your debugging options, though.)

  • --verbose (from the upstart manpage): In theory this makes Upstart be verbose. In practice, I haven't been able to get this to print useful messages to the console so that you can see what services are being started when (so you can, say, identify which service is causing your boot to hang).

  • '--default-console output' (from the upstart manpage combined with init(5)): My memory is that this dumps output (if any) from the actual commands being run to the console but still doesn't tell you which services are starting. If the problem command is hanging silently, you're no better off than before.

(For reasons kind of described in my entry on the kernel command line, --default-console can't be written with an = in the way that the upstart manpage shows it. Fortunately Upstart uses standard GNU argument processing so we can write it with a space instead.)

Sidebar: what caused our Ubuntu 12.04 machines to hang on boot

It turns out that our 12.04 servers will stall during boot if a filesystem listed in /etc/fstab is not present. This happens even if the filesystem is marked noauto. It's possible that this stall eventually times out; if this is the case, the timeout duration is much longer than we're willing to wait for.

As best as I can determine, this behavior is not directly caused by anything in /etc/init and thus is not easy for us to change.

No, we are not happy about this. This might be vaguely excusable for regular filesystems; it's inexcusable for noauto filesystems.

Ubuntu1204VerboseBoot written at 22:36:23; Add Comment

2012-07-06

Exploring an ARP mystery: a little Linux surprise

Lately, some of our OpenBSD machines have been periodically logging kernel messages like the following:

arp: attempt to overwrite entry for IPADDR on bge0 by MAC-ADDR on nfe0

What this message means is that the OpenBSD machine had previously acquired an ARP entry for IPADDR on bge0 but now it was seeing the same IP address advertised in an ARP message from MAC-ADDR on nfe0. There are a number of things that can cause this; most of them are alarming.

Now I need to describe the network topology. This OpenBSD machine is obviously dual-homed on bge0 and nfe0; nfe0 is 'net-3', the primary subnet that most of our servers live on, and bge0 is 'net-5', a secondary subnet that we still have some machines on due to history. IPADDR is the net-5 IP address for our Samba server, which is also dual-homed on net-3 and net-5. Due to history again, the official IP address that everyone uses for the Samba server is IPADDR on net-5, not the Samba server's net-3 IP address.

When we fired up tcpdump on the Samba server's net-3 interface, we observed two things. The first was that it was sending TCP replies to net-3 machines out on net-3 with IPADDR (on net-5) as the source IP address. A bit of thought showed that this was the expected behavior of a traditional dual-homed host; given that outgoing traffic is normally routed based purely on the destination IP address, any traffic to a net-3 host would be routed out the machine's net-3 interface even if it was a reply to something that came in on the net-5 interface to a net-5 IP address.

(Such asymmetric routing normally only causes problems if you have a firewall in the way on one path, which isn't the case here.)

Second, we saw the Samba server generating ARP requests on its net-3 interface that looked like:

Request who-has <net-3 IP address> tell IPADDR

This was a bit surprising. Normally you would expect a machine to send ARP messages with the reply IP address set to an IP address that is actually on the interface and the subnet that the ARP request is directed to. In this case you'd expect that the Samba server would ARP listing its net-3 IP address, not its net-5 one.

(We could easily reproduce these ARP messages and show that they caused the OpenBSD kernel messages by deleting the ARP cache entry for a net-3 machine that had a connection to the Samba server. The next time the Samba server needed to send a reply packet to the net-3 machine, ding, out went an ARP message with IPADDR as the reply IP address.)

My only theory right now is that under some circumstances, Linux will send out ARP requests using not the address of the interface in question but instead the source IP address of the local IP packet that it wants to send (and thus that caused the ARP request to be generated). This is, in a particular view, a sensible thing to do. But as we can see, it's something that can cause other machines to twitch and I think it's at least a little bit surprising. Okay, quite a lot surprising.

(It'll take another entry to try to justify this as a sensible thing in the right view.)

OddLinuxArpBehavior written at 02:03:25; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.