Wandering Thoughts archives

2008-10-30

How Linux initrds used to be a hack

The traditional way that Unix kernels boot is to perform some very basic system setup, declare the running code to be a kernel process with PID 1, initialize all of the various important subsystems (such as networking), initialize drivers, mount the root filesystem from whatever device you told them was the root device, and then just exec() (from inside the kernel) /sbin/init. At least in the somewhat old days, Linux was no exception to this.

The problem for initrds is that the job of initrds is to do things (such as load driver modules) to let the system see the normal root filesystem, and a normal Unix system (and Linux was no exception) provides no way to change what the root filesystem is once init has started running. Thus you cannot implement initrds in the obvious way, by having the kernel treat the ramdisk as the root filesystem and exec()ing a program from it as init.

Instead, the original Linux initrd hooked into the boot process just before the 'mount the root filesystem' step. If there was an initrd, the kernel diverted sideways; it mounted the initrd, created a new process, and in the new process exec()'d /linuxrc. When this process exited, the kernel resumed the normal process of booting, expecting that the configured root device was present and so on. (I believe that the initrd could use a magic hack to change what the kernel had set as the root device.)

This magic diversion had some odd consequences. For example, since the kernel was PID 1, it had to arrange to reap orphaned processes while it was waiting for the initrd to finish running, because otherwise they could just pile up. Related to this, you couldn't populate your initrd with a normal but minimal system and just run from it since most versions of init become unhappy if they are not PID 1.

(As you might guess, the fix to make initrds not a hack was to add a way to change what the root filesystem is on a running system.)

InitrdHack written at 00:57:14; Add Comment

2008-10-14

Improving initial ramdisks

To summarize my earlier entry, the problem with initrds is that they are both fragile and opaque. To improve them, one needs to tackle both issues.

The easy way to tackle the fragility is to make initrds both more comprehensive and more dynamic. Since one way initrds break is by not including necessary kernel modules, simply include nearly everything by default (you can exclude modules that are clearly not needed for booting, like sound drivers); the extra disk space and boot-time memory space needed is likely to be insignificant on any modern system. To make things more dynamic, put hardware discovery and the related module loading into the initrd itself, at least as a fallback measure.

The trade off for the initrd dynamically exploring your hardware is that there are good reasons to defer as much hardware initialization as possible until you are running on the real system. Thus the initrd should know the normal path to the default root filesystem and only put its dynamic discovery into action if this failed, since your other alternative at that point is to just give up. I suspect that the normal path is best passed in as a kernel command line variable, so that you can change it without having to modify the initrd.

(A further helpful step would be to put a basic rescue environment into the initrd, one comprehensive enough that you could see what hardware and filesystems that the kernel sees, and tell it to set something as the real root and continue the boot.)

To solve the opacity problem is harder. Opacity is created because initrds are ultimately very free form, so to make them more transparent (in a way that tools can deal with) we need to make them more structured. This ultimately means getting people to agree on a fairly rigid structure for initrds, or at least common structural elements such as a fixed configuration file that gives basic information about the initrd.

Initrds will probably never be truly transparent, since they're ultimately blobs instead of spread out in the filesystem, but with structure it would be possible to write tools that inspect, explain, and check initrds, which would help a lot. Such a structure will always need escape clauses, but they should be rare and their presence should be easy to spot, so at least the normal tools can immediately flag an initrd as 'has special magic'.

(Department of attribution: this entry was inspired by thinking about this.)

ImprovingInitrds written at 02:33:05; Add Comment

2008-10-07

How we set up our iSCSI target servers

The SAN backend for our new fileserver environment is Linux servers acting as iSCSI targets. There's two sides to their setup; the hardware and the software.

Hardware-wise, the biggest problem was that there are essentially no generic servers (especially inexpensive ones) that have enough disks. So we punted; we're using 'generic' 1U servers, specifically SunFire X2100 M2s, with the iSCSI data disks in external ESATA-based enclosures. Our current enclosures give us twelve (data) disks in about 5U (counting both the enclosure and the server) at a reasonable cost.

(I would give a brand name but our current supplier recently told us they'd discontinued the model we had bought, so we're evaluating new enclosures from various people.)

Each server is connected to a single enclosure with a PCI-E card, currently SiI 3124-based ones. This requires a kernel that supports (E)SATA port multipliers, which only really appeared in 2.6.24 (for SII based cards) or 2.6.26 (for Marvell-based ones), and means that we have to build our own custom kernels. Software-wise, the servers run Red Hat Enterprise Linux 5 with the custom kernel and IET added as the iSCSI target mode driver, and are configured with mirrored system disks just in case.

(Recent versions of RHEL 5 have some support for SII based (E)SATA port multipliers, but 2.6.25 is what we've tested with and its support seems to work better.)

We don't do anything fancy with the iSCSI data disks (no software RAID, no LVM). Each disk is partitioned into even chunks of approximately 250 GBytes per chunk and then exported via iSCSI; we make each disk a separate target, and then each partition a LUN on that target. This makes life simpler when managing space allocation and diagnosing problems. (It also makes the IET configuration file contain a lot of canned text.)

We've decided to handle iSCSI as if it was a SAN, so we do not run iSCSI traffic over our regular switch fabric and VLANs. Instead, all of the iSCSI servers are connected to two separate iSCSI 'networks', which in this case means a dedicated physical switch and an unrouted private subnet; this gives us redundancy in case of switch, cable, or network port failure, and some extra bandwidth. Each server also has a regular 'management' interface on one of our normal internal subnets, so that we can ssh in to them and so on.

(Since they are X2100s, they also have a dedicated remote management processor on yet another internal subnet.)

LinuxISCSITargets written at 01:33:46; Add Comment

2008-10-01

Another consequence of the Debian OpenSSL security bug

Here is another consequence of the Debian OpenSSL security bug that I did not hear about (or realize) until recently: it lets an attacker steal any of your SSL certificates that were created with the broken (weak) OpenSSL versions. This includes signed certificates, with all that that entails.

How the attacker does it is simple. They get your actual certificate simply by connecting to your SSL-protected service (such as your website); the SSL protocol exchange necessarily sends them a copy of your certificate, complete with the signature of your CA. To get your private key, they just do a brute force search; there are only a few tens of thousands of possibilities. (Or they check to see if someone has already pregenerated a list of the vulnerable private/public key pairs of an appropriate bit length.)

Part of the unpleasant nature of this is that it is an entirely passive attack, at least if your SSL website or service is exposed to the open Internet. How paranoid you should be as a result of this if you had a vulnerable, signed SSL certificate is up to you.

(Locally, we missed being deeply affected by a fairly small margin; while we use Ubuntu a lot, only a few research groups had installed anything more recent than Ubuntu 6.06, and 6.06 was just early enough that it wasn't vulnerable.)

As it turns out, this isn't just a theoretical issue; Akamai had a weak key for some of their own servers and some customers, as was discovered shortly after the initial public notice of the bug. (Yes, I'm late to this particular party.)

DebianCertCompromise written at 23:56:40; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.