Wandering Thoughts

2017-06-21

The oddity of CVE-2014-9940 and the problem of recognizing kernel security patches

Let's start with my tweet:

Do I want to know why a reported and disclosed in 2017 Linux kernel vulnerability has a 2014 CVE number? Probably not.

Today, Ubuntu came out with USN-335-1, a security advisory about their Ubuntu 14.04 LTS kernel. Among the collection of CVEs fixed was one that caught my eye, CVE-2014-9940. This was simply because of the '2014' bit, which is normally the year of the CVE. At first I thought this might be Ubuntu's usual thing where they sometimes repeat old, long-patched issues in their update announcements, but no; as far as I can tell this is a new issue. Ubuntu's collection of links led to the May Android security bulletin, which says that CVE-2014-9940 was only reported on February 15th, 2017.

(I think that the Android security bulletin is the first report.)

So where does the 2014 come from? That's where I wound up looking more closely at the kernel commit that fixes it:

Author: Seung-Woo Kim
Date: Thu Dec 4 19:17:17 2014 +0900

regulator: core: Fix regualtor_ena_gpio_free not to access pin after freeing

After freeing pin from regulator_ena_gpio_free, loop can access the pin. So this patch fixes not to access pin after freeing.

This fix was authentically made between 3.18-rc1 and 3.19-rc1, and so it appears that the CVE number was assigned based on when the fix was made, not when the issue was reported.

The next question is why it took until 2017 for vendors using old kernels to patch them against this issue. Although I don't know for sure, I have a theory, namely that this simply wasn't recognized as a security vulnerability until early this year. Many fixes go into every kernel version, far too many to backport them all, so Linux distributions have to pick and choose. Naturally distributions grab security fixes, but that requires everyone involved to actually recognize that what they've written is a security fix. I rather suspect that back in 2014, no one realized that this use-after-free issue was an (exploitable) vulnerability.

It's interesting that this seems to have been reported around the time of CVE-2017-6074, where I started to hear that use-after-free issues in the kernel were increasingly exploitable. I wonder if people went trawling through kernel changelogs to find 'fixed use-after-free issue' changes like this one, then did some digging to see if the issues could be weaponized into vulnerabilities and if any currently used older kernels (such as Android kernels and old Ubuntu LTSes) had missed picking up the patches.

(If people are doing this sort of trawling, I suspect that we can expect a whole series of future CVEs on similar issues.)

If I'm right here, the story of CVE-2014-9940 makes for an excellent example of how not all security fixes are realized to be so at the time (which makes it hard to note them as security fixes in kernel changelogs). As this CVE demonstrates, sometimes it may take more than two years for someone to realize that a particular bugfix closes a security vulnerability. Then everyone with old kernels gets to scramble around to fix them.

(By the way, the answer to this is not 'everyone should run the latest kernel'. Nor is it 'the kernel should stop changing and everyone should focus on removing bugs from it'. The real answer is that there is no solution because this is a hard problem.)

CVE-2014-9940-Speculation written at 01:30:09; Add Comment

2017-06-11

How to see raw USB events on Linux via usbmon

Suppose, not entirely hypothetically, that you're wondering if your USB keyboard actually does anything for certain key presses, such as its Fn button plus a normal letter key like 's'. One way to try to find out is just to type the key combination while sitting at your shell prompt or in your editor or the like, and see what happens. A more elaborate method is to fire up xev and see what it says about X events, since there are a number of things that can happen between the level of X events and what your shell sees. Of course this starts to hint at the broad problem, which is that a modern graphical Linux environment has all sorts of layers that may swallow or distort raw events, so not seeing anything in the shell or in xev only means that things didn't make it that far.

(This is useful knowledge, of course, especially if your ultimate goal is getting characters to your shell or to an X program that should recognize them. And hitting that key produces some event, xev will tell you what it's been turned into.)

Usefully, the Linux kernel gives us a way to bypass all of the (potential) layers of input processing and see the raw USB events being generated (or not generated, as the case may be). Looking at the presence or absence of raw events is pretty definite. If the keyboard is not generating any USB events when you press some keys, well, that's it. How you do this is with the kernel's usbmon system, as covered in the kernel's usbmon.txt and Ubuntu's page on debugging USB. For quick checks, the text interface in /sys/kernel/debug/usb/usbmon is your most convenient option, but Linux distributions seem to vary as to whether you need to load a usbmon kernel module or not to get it.

(On my Fedora 25 machines, everything is ready to go out of the box, with no kernel modules needing to be loaded; the Fedora kernels are apparently built with CONFIG_USB_MON=y. On a relatively stock Ubuntu 16.04 server, there's a usbmon module and it's not loaded by default.)

Wireshark can capture and interpret USB bus traffic, per here, and in theory should be a good way to see the details of USB events in a more user-friendly format than the kernel's text dump. In practice I can't seem to persuade the Fedora 25 version to give me useful information here. I find it more helpful to read the output from the text interface, which at least lets me distinguish one sort of event from another (for example, the mouse scrollwheel going in one direction versus the other direction). Possibly I'm missing some magic Wireshark options, especially since I don't use Wireshark very often. Alternately, I'd need to know a non-casual amount about USB message formats and the details of the USB protocol in order to understand what Wireshark is showing me and extract the things I'm interested in.

(There also may be an issue that apparently Wireshark may only do a good job decoding things if it sees you plug in the USB device. This is perhaps sensible behavior for Wireshark, or even necessary, but it's not very useful for checking the details of what my (only) keyboard or mouse generate. I'm not really enthused about unplugging and then replugging them; it has somewhat annoying side effects.)

As a side note, since you can only monitor an entire USB buss (or all busses at once), it's not necessarily very important to be detailed about identifying what USB device is where. For instance, on my home machine, the only USB bus that reports any non-hub devices is bus 2, so even if a plain 'lsusb' doesn't clearly identify the specific USB device that's my keyboard or mouse, it's a pretty good bet that they're on bus 2. 'lsusb -t' can also give strong hints; on my home machine, all 'Class=Human Interface Device' USB devices are on bus 2 even if lsusb doesn't tell me exactly what their (product) names are.

(I started writing this entry just to have some information about the usbmon feature recorded for my later use, and wound up learning useful things about Wireshark's USB stuff.)

PS: Because Wireshark is actually using libpcap to get the USB traffic from the kernel, you can also use tcpdump to capture USB traces, for example 'tcpdump -i usbmon2 -s 0 -w /tmp/usbmon.pcap'. This may be handy if you have a server seeing USB oddities; at least around here, we're far more likely to have basic tcpdump installed on random machines than we are to put Wireshark on them. As usual, you can use tcpdump to capture the packet trace, transfer it to your workstation, and run Wireshark on the capture to decode and analyze it and so on.

(This is also how I avoid running Wireshark as root even if I'm capturing the traffic on my own machine.)

USBMonSeeingUSBEvents written at 00:52:36; Add Comment

2017-05-25

Using Linux's Magic Sysrq on modern keyboards without a dedicated Syrq key

In the old days, more or less all PC keyboards had a dedicated 'Print Screen/Sysrq' key and using Linux's Magic Sysrq was easy: you held down Alt, PrintScrn/Sysrq, and an appropriate key at the same time. However, this is becoming less and less common all the time, and in particular my current keyboard doesn't have a dedicated Sysrq key. Instead SysRq is overloaded on F9, and you get it by holding the keyboard's Fn key down when you push F9/PrintScrn. This presents a small difficulty in hitting Sysrq key combos, because holding that Fn key down while you hit regular keys often produces absolutely nothing. So if you just press, say, Fn + Alt + F9 + s to try to force a sync, nothing happens.

After some flailing around and unpredictable, intermittent successes, I think I have finally figured out how to use magic SysRq reliably on my keyboard and on similar keyboards without dedicated SysRq keys. The trick is this:

  1. press and hold Fn + Alt + your SysRq key
  2. release your SysRq key and Fn, while still holding Alt
  3. press your desired SysRq action key, such as s.

The entire sequence has proven to be important. Releasing Fn while still holding SysRq/F9 down has had a tendency to make the kernel see an Alt+F9 sequence (which switches virtual consoles and in the process wipes away a bunch of the messages I want to keep seeing, and which obviously aborts entering magic Sysrq stuff). Releasing all three keys ends the whole magic SysRq sequence, which means my s does nothing.

This turns out to be the suggested approach in the kernel.org guide to magic Sysrq, although their advice is about keyboards that don't like having too many keys down at once. My keyboard specifically appears to do nothing even with just Fn + s, so I don't think it's an issue of the number of keys held down at once. And yes, I used usbmon to verify that my keyboard sends no USB events for Fn + s.

(This is perhaps trivial but I want to documented it for my own future use, because I'm sure I'm going to forget it at some point.)

MagicSysrqOnModernKeyboards written at 16:14:45; Add Comment

2017-05-07

A mistake I made when setting up my ZFS SSD pool on my home machine

I recently actually started using a pair of SSDs on my home machine, and as part of that I set up a ZFS pool for my $HOME and other data that I want to be fast (as planned). Unfortunately, I recently realized that when I set that pool up I made a mistake of omission. Some people who know ZFS can guess my mistake: I didn't force an ashift setting, unlike what I did with my work ZFS pools.

(I half-fumbled several aspects of my ZFS pool setup, actually; for example I forgot to turn on relatime and set compression to on at the pool level. But those other things I could fix after the fact, although sometimes with a bit of pain.)

Unlike spinning rust hard disks, SSDs don't really have a straightforward physical sector size, and certainly not one that's very useful for most filesystems (the SSD erase block size is generally too large). So in practice their reported logical and physical sector sizes are arbitrary, and some drives are even switchable. As arbitrary numbers, SSDs report whatever their manufacturer considers convenient. In my case, Crucial apparently decided to make their MX300 750 GB SSDs report that they had 512 byte physical sectors. Then ZFS followed its defaults and created my pool with an ashift of 9, which means that I could run into problems if I have to replace SSDs.

(I'm actually a bit surprised that Crucial SSDs are set up this way; I expected them to report as 4K advanced format drives, since HDs have gone this way and some SSDs switched very abruptly. It's possible that SSD vendors have decided that reporting 512 byte sectors is the easiest or most compatible way forward, at least for consumer SSDs, given that the sizes are arbitrary anyway.)

Unfortunately the only fix for this issue is to destroy the pool and then recreate it (setting an explicit ashift this time around), which means copying all data out of it and then back into it. The amount of work and hassle involved in this creates the temptation to not do anything to the pool and just leave things as they are.

On the one hand, it's not guaranteed that I'll have problems in the future. My SSDs might never break and need to be replaced, and if a SSD does need to be replaced it might be that future consumer SSDs will continue to report 512 byte physical sectors and so be perfectly compatible with my current pool. On the other hand, this seems like a risky bet to make, especially since based on my past history this ZFS pool is likely to live a quite long time. My main LVM setup on my current machine is now more than ten years old; I set it up in 2006 and have carried it forward ever since, complete with its ext3 filesystems; I see no reason why this ZFS pool won't be equally durable. In ten years all of the SSDs may well report themselves as 4K physical sector drives simply because that's what all of the (remaining) HDs will report and so that's what all of the software expects.

Now is also my last good opportunity to fix this, because I haven't put much data in my SSD pool yet and I still have the old pair of 500 GB system HDs in my machine. The 500 GB HDs could easily hold the data from my SSD ZFS pool, so I could repartition them, set up a temporary ZFS pool on them, reliably and efficiently copy everything over to the scratch pool with 'zfs send' (which is generally easier than rsync or the like), then copy it all back later. If I delay, well, I should pull the old 500 GB disks out and put the SSDs in their proper place (partly so they get some real airflow to keep their temperatures down), and then things get more difficult and annoying.

(I'm partly writing this entry to motivate myself into actually doing all of this. It's the right thing to do, I just have to get around to it.)

ZFSSSDPoolSetupMistake written at 02:31:24; Add Comment

2017-05-04

My views on using LVM for your system disk and root filesystem

In a comment on my entry about perhaps standardizing the size of our server root filesystems, Goozbach asked a good question:

Any reason not to put LVM on top of raid for OS partitions? (it's saved my bacon more than once both resizing and moving disks)

First, let's be clear what we're talking about here. This is the choice between putting your root filesystem directly into a software RAID array (such as /dev/md0) or creating a LVM volume group on top of the software RAID array and then having your root filesystem be a logical volume in it. In a root-on-LVM-on-MD setup, I'm assuming that the root filesystem would still use up all of the disk space in the LVM volume group (for most of the same reasons outlined for the non-LVM case in the original entry).

For us, the answer is that there is basically no payoff for routinely doing this, because in order to need LVM for this we'd need a number of unusual things to be true all at once:

  • we can't just use space in the root filesystem; for some reason, it has to be an actual separate filesystem.
  • but this separate filesystem has to use space from the system disks, not from any additional disks that we might add to the server.
  • and there needs to be some reason why we can't just reinstall the server from scratch with the correct partitioning and must instead go through the work of shrinking the root filesystem and root LVM logical volume in order to make up enough spare space for the new filesystem.

Probably an important part of this is that our practice is to reinstall servers from scratch when we repurpose them, using our install system that makes this relatively easy. When we do this we get the option to redo the partitioning (although it's generally easier to keep things the same, since that means we don't even have to repartition, just tell the installer to use the existing software RAIDs). If we had such a special need for a separate filesystem, it's probably a sufficiently unique and important server that we would want to start it over from scratch, rather than awkwardly retrofitting an existing server into shape.

(One problem with a retrofitted server is that you can't be entirely sure you can reinstall it from scratch if you need to, for example because the hardware fails. Installing a new server from scratch does help a great deal to assure that you can reinstall it too.)

We do have servers with unusual local storage needs. But those servers mostly use additional disks or unusual disks to start with, especially now that we've started moving to small SSDs for our system disks. With small SSDs there just isn't much space left over for a second filesystem, especially if you want to leave a reasonable amount of space free on both it and the root filesystem in case of various contingencies (including just 'more logs got generated than we expected').

I also can't think of many things that would need a separate filesystem instead of just being part of the root filesystem and using up space there. If we're worried about this whatever it is running the root filesystem out of space, we almost certainly want to put in big, non-standard system disks in the first place rather than try to wedge it into whatever small disks the system already has. Leaving all the free space in a single (root) filesystem that everything uses has the same space flexibility as ZFS, and we're lazy enough to like that. It's possible that I'm missing some reasonably common special case here because we just don't do whatever it is that really needs a separate local filesystem.

(We used to have some servers that needed additional system filesystems because they needed or at least appeared to want special mount options. Those needs quietly went away over the years for various reasons.)

Sidebar: LVM plus a fixed-size root filesystem

One possible option to advance here is a hybrid approach between a fixed size root partition and a LVM setup: you make the underlying software RAID and LVM volume group as big as possible, but then you assign only a fixed and limited amount of that space to the root filesystem. The remaining space is left as uncommitted free space, and then is either allocated to the root if it needs to grow or used for additional filesystems if you need them.

I don't see much advantage to this setup, though. Since the software RAID array is maximum-sized, you still have the disk replacement problems that motivated my initial question. You add the chance of the root filesystem running out of space if you don't keep an eye on it and make the time to grow it as needed, and in order for this setup to pay off you still have to need the space in a separate filesystem for some reason, instead of as part of the root filesystem. What you save is the hassle of shrinking the root filesystem if you ever need to make that additional filesystem with its own space.

LVMForRootViews written at 00:18:43; Add Comment

2017-04-30

Do we want to standardize the size of our root filesystems on servers?

We install many of our Linux servers with mirrored system disks, and at the moment our standard partitioning is to have a 1 GB swap partition and then give the rest of the space to the root filesystem. In light of the complexity of shrinking even a RAID-1 swap partition, whose contents I could casually destroy, an obvious question came to me: did we want to switch to having our root filesystems normally being a standard size, say 80 GB, with the rest of the disk space left unused?

The argument for doing this is that it makes replacing dead system disks much less of a hassle than it is today, because almost any SATA disk we have lying around would do. Today, if a system disk breaks we need to find a disk of the same size or larger to replace it with, and we may not have a same-sized disk (we recycle random disks a lot), so we may wind up with weird mismatched disks with odd partitioning. An 80 GB root filesystem is good enough for basically any of our Linux servers; even with lots of packages and so on installed, they just don't need much space (we don't seem to have any that are using over about 45 GB of space, and that's including a bunch of saved syslog logs and so on).

The main argument against doing this is that this hasn't been a problem so far and there are some potential uses for having lots of spare space in the root filesystem. I admit that this may not sound too persuasive now that I write it down, but honestly 'this is not a real problem for us' is a valid argument. If we were going to pick a standard root filesystem size we'd have to figure out what it should be, monitor the growth of our as-installed root filesystems over time (and over new Ubuntu versions), maybe reconsider the size every so often, and so on. We'd probably want to actually calculate what minimum disk size we're going to get in the future and set the root filesystem size based on that, which implies doing some research and discussion. All of this adds up to kind of a hassle (and having people spend time on this does cost money, at least theoretically).

Given that it's not impossible to shrink an extN filesystem if we have to and that we usually default to using the smallest size of disks in our collection for new system disks, leaving our practices as they are is both pretty safe and what I expect we'll do.

(We also seem to only rarely lose (mirrored) system disks, even when they're relatively old disks. That may change in the future, or maybe not as we theoretically migrate to SSDs for system disks. Our practical migration is, well, not that far along, for reasons beyond the scope of this entry.)

FixedRootFSSizeQuestion written at 01:32:16; Add Comment

2017-04-24

Corebird and coming to a healthier relationship with Twitter

About two months ago I wrote about my then views on the Corebird Twitter client. In that entry I said that Corebird was a great client for checking in on Twitter and skimming through it, but wasn't my preference for actively following Twitter; for that I still wanted Choqok for various reasons. You know what? It turns out that I was wrong. I now feel that Corebird is both a better Linux Twitter client in general and that it's a better Twitter client for me in specific. Unsurprisingly, it's become the dominant Twitter client that I use.

Corebird is mostly a better Twitter client in general because it has much better support for modern Twitter features, even if it's not perfect and there are things from Choqok that I wish it did (even as options). It has niceties like displaying quoted tweets inline and letting me easily and rapidly look at attached media (pictures, animations, etc), and it's just more fluid in general (even if it has some awkward and missing bits, like frankly odd scrolling via the keyboard). Corebird has fast, smooth updates of new tweets more or less any time you want, and it can transparently pull in older tweets as you scroll backwards to a relatively impressive level. Going back to Choqok now actually feels clunky and limited, even though it has features that I theoretically rather want (apart from the bit where I know that several of those features are actually bad for me).

(Corebird's ability to display more things inline makes a surprising difference when skimming Twitter, because I can see more without having to click on links and spawn things in my browser and so on. I also worked out how to make Corebird open up multiple accounts on startup; it's hiding in the per-account settings.)

Corebird is a better Twitter client for me in specific because it clearly encourages me to have a healthier approach to Twitter, the approach I knew I needed a year ago. It's not actually good for me to have a Twitter client open all the time and to try to read everything, and it turns out that Corebird's lack of some features actively encourages me to not try to do this. There's no visible unread count to prod me to pay attention, there is no marker of read versus unread to push me to trying to read all of the unread Tweets one by one, and so on. That Corebird starts fast and lets me skim easily (and doesn't hide itself away in the system tray) also encourages me to close it and not pay attention to Twitter for a while. If I do keep Corebird running and peek in periodically, its combination of features make it easy and natural to skim, rapidly scan, or outright skip the new tweets, so I'm pretty sure I spend less time catching up than I did in Choqok.

(Fast starts matter because I know I can always come back easily if I really want to. As I have it configured, Choqok took quite a while to start up and there were side effects of closing it down with unread messages. In Corebird, startup is basically instant and I know that I can scroll backwards through my timeline to where I was, if I care enough. Mostly I don't, because I'm looking at Twitter to skim it for a bit, not to carefully read everything.)

The net result is that Corebird has turned checking Twitter into what is clearly a diversion, instead of something to actively follow. I call up Corebird when I want to spend some time on Twitter, and then if things get busy there is nothing to push me to get back to it and maybe I can quit out of it in order to make Twitter be even further away (sometimes Corebird helps out here by quietly crashing). This is not quite the 'stop fooling yourself you're not multitasking here' experience that using Twitter on my phone is, but it feels closer to it than Choqok did. Using Corebird has definitely been part of converting Twitter from a 'try to read it all' experience to a 'dip in and see what's going on' one, and the latter is much better for me.

(It turns out that I was right and wrong when I wrote about how UI details mattered for my Twitter experience. Back then I said that a significantly different client from Choqok would mean that my Twitter usage would have to change drastically. As you can see, I was right about that; my Twitter usage has changed drastically. I was just wrong about that necessarily being a bad thing.)

CorebirdViewsII written at 00:29:40; Add Comment

2017-04-21

A surprising reason grep may think a file is a binary file

Recently, 'fgrep THING FILE' for me has started to periodically report 'Binary file FILE matches' for files that are not in fact binary files. At first I thought a stray binary character might have snuck into one file this was happening to, because it's a log file that accumulates data partly from the Internet, but then it happened to a file that is only hand-edited and that definitely shouldn't contain any binary data. I spent a chunk of time tonight trying to find the binary characters or mis-encoded UTF-8 or whatever it might be in the file, before I did the system programmer thing and just fetched the Fedora debuginfo package for GNU grep so that I could read the source and set breakpoints.

(I was encouraged into this course of action by this Stackexchange question and answers, which quoted some of grep's source code and in the process gave me a starting point.)

As this answer notes, there are two cases where grep thinks your file is binary: if there's an encoding error detected, or if it detects some NUL bytes. Both of these sound at least conceptually simple, but it turns out that grep tries to be clever about detecting NULs. Not only does it scan the buffers that it reads for NULs, but it also attempts to see if it can determine that a file must have NULs in the remaining data, in a function helpfully called file_must_have_nulls.

You might wonder how grep or anything can tell if a file has NULs in the remaining data. Let me answer that with a comment from the source code:

/* If the file has holes, it must contain a null byte somewhere. */

Reasonably modern versions of Linux (since kernel 3.1) have some special additional lseek() options, per the manpage. One of them is SEEK_HOLE, which seeks to the nearest 'hole' in the file. Holes are unwritten data and Unix mandates that they read as NUL bytes, so if a file has holes, it's got NULs and so grep will call it a binary file.

SEEK_HOLE is not implemented on all filesystems. More to the point, the implementation of SEEK_HOLE may not be error-free on all filesystems all of the time. In my particular case, the files which are being unexpected reported as binary are on ZFS on Linux, and it appears that under some mysterious circumstances the latest development version of ZoL can report that there are holes in a file when there aren't. It appears that there is a timing issue, but strace gave me a clear smoking gun and I managed to reproduce it in a simple test program that gives me a clear trace:

open("testfile", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0600, st_size=33005, ...}) = 0
read(3, "aaaaaaaaa"..., 32768) = 32768
lseek(3, 32768, SEEK_HOLE)              = 32768

The file doesn't have any holes, yet sometimes it's being reported as having one at the exact current offset (and yes, the read() is apparently important to reproduce the issue).

(Interested parties can see more weirdness in the ZFS on Linux issue.)

GrepBinaryFileReason written at 00:57:12; Add Comment

2017-04-20

The big motivation for a separate /boot partition

In a recent comment on my entry on how there's no point in multiple system filesystems any more, I was asked:

What about old(er) computers stuck with BIOS booting? Wasn't the whole seperate /boot/ partially there to appease those systems?

There have been two historical motivations for a separate /boot filesystem in Linux. The lesser motivation is mismatches between what GRUB understands versus what your system does; with a separate /boot you can still have a root filesystem that is, say, the latest BTRFS format, without requiring a bootloader that understands the latest BTRFS. Instead you make a small /boot that uses whatever basic filesystem your bootloader is happy with, possibly all the way down to ext2.

The bigger motivation has been machines where the BIOS couldn't read data from the entire hard disk. All the stages of the bootloader read data using BIOS services, so all of the data they need had to be within a portion of the disk that the BIOS could reach; in fact, they all had to be within the area reachable by whatever (old and basic) BIOS service the bootloader was using. The first stage of the bootloader is at the start of the disk, so that's no problem, and the second stage is usually embedded shortly after it, which is also no problem. The real problem is things that fully live in the filesystem, like the GRUB grub.cfg menu and especially the kernel and initramfs that the bootloader needed to load into memory in order to boot.

(There have been various BIOS limits over the years (see also), and some of the early ones are rather small.)

If your /boot was part of the root filesystem, you had to make sure that your entire root filesystem was inside the area that the BIOS could read. On old machines with limited BIOSes, this could drastically constrain both the size and position of your entire root filesystem. If you had a (small) separate /boot filesystem, only the /boot filesystem had to be within this limited area of BIOS readable disk space; your root filesystem could spill outside of it without problems. You could make / as big as you wanted and put it wherever you wanted.

(If you care about this, it's not enough to have a separate /boot and to make it small; you need to put it as close to the start of the disk as possible, and it's been traditional to make it a primary partition instead of an extended one. Linux installers may or may not do this for you if you tell them to make a separate /boot filesystem.)

Today this concern is mostly obsolete and has been for some time. Even BIOS MBR only machines can generally boot from anywhere on the disk, or at least anywhere on the disk that the MBR partitions can address (which is anything up to 2 TB). In theory you could get into trouble if you had a HD larger than 2 TB, used GPT partitioning, put your root filesystem partly or completely after the 2 TB boundary, and your bootloader and BIOS didn't use LBA48 sector addressing. However I think that even this is relatively unlikely, given that LBA48 is pretty old by now.

(This was once common knowledge in the Linux world, but that was back in the days when it was actually necessary to know this because you might run into such a machine. Those days are probably at least half a decade ago, and probably more than that.)

WhySeparateBootFS written at 00:19:49; Add Comment

2017-04-17

Shrinking the partitions of a software RAID-1 swap partition

A few days ago I optimistically talked about my plans for a disk shuffle on my office workstation, by replacing my current 1 TB pair of drives (one of which had failed) with a 1.5 TB pair. Unfortunately when I started putting things into action this morning, one of the 1.5 TB drives failed immediately. We don't have any more spare 1.5 TB drives (at least none that I trust), but we did have what I believe is a trustworthy 1 TB drive, so I pressed that into service and changed my plans around to be somewhat less ambitious and more lazy. Rather than make a whole new set of RAID arrays on the new disks (and go through the effort of adding them to /etc/mdadm.conf and so on), I opted to just move most of the existing RAID arrays over to the new drives by attaching and detaching mirrors.

This presented a little bit of a problem for my mirrored swap partition, which I wanted to shrink from 4 GB to 1 GB. Fortunately it turns out that it's actually possible to shrink a software RAID-1 array these days. After some research, my process went like this:

  • Create the new 1 GB partitions for swap on the new disks as part of partitioning them. We can't directly add these to the existing swap array, /dev/md14, because they're too small.

  • Stop using the swap partition because we're about to drop 3/4ths of it. This is just 'swapoff -a'.

  • Shrink the amount of space to use on each drive of the RAID-1 array down to an amount of space that's smaller than the new partitions:

    mdadm --grow -z 960M /dev/md14
    

    I first tried using -Z (aka --array-size) to shrink the array size non-persistently, but mdadm still rejected adding a too-small new array component. I suppose I can't blame it.

  • Add in the new 1 GB partitions and pull out the old 4 GB partition:

    mdadm --add /dev/md14 /dev/sdc3
    # (wait for the resync to finish)
    mdadm --add /dev/md14 /dev/sdd3
    mdadm --repace /dev/md14 /dev/sde4
    # (wait for the resync to finish)
    madam -r /dev/md14 /dev/sde4
    

  • Tell software RAID to use all of the space on the new partitions:

    mdadm --grow -z max /dev/md14

At this point I almost just swapon'd the newly resized swap partition. Then it occurred to me that it probably still had a swap label that claimed it was a 4 GB swap area, and the kernel would probably be a little bit unhappy with me if I didn't fix that with 'mkswap /dev/md14'. Indeed mkswap reported that it was replacing an old swap label with a new one.

My understanding is that the same broad approach can be used to shift a software RAID-1 array for a filesystem to smaller partitions as well. For a filesystem that you want to keep intact, you first need to shrink the filesystem safely below the size you'll shrink the RAID array to, then at the end grow the filesystem back up. All things considered I hope that I never have to shrink or reshape the RAID array for a live filesystem this way; there are just too many places where I could blow my foot off.

(Life is easier if the filesystem is expendable and you're going to mkfs a new one on top of it later.)

You might ask why it's worth going through all of this instead of just making a new software RAID-1 array. That's a good question, and for me it comes down to how much of a pain it often is to set up a new array. These days I prefer to change /etc/mdadm.conf, /etc/fstab and so on as little as possible, which means that I really want to preserve the name and MD UUID of existing arrays when feasible instead of starting over from scratch.

This is also where I have an awkward admission: for some reason, I thought that you couldn't use 'mdadm --detail --scan' on a single RAID array, to conveniently generate the new line you need for mdadm.conf when you create a new array. This is wrong; you definitely can, so you can just do things like 'mdadm --detail --scan /dev/mdNN >>/etc/mdadm.conf' to set it up. Of course you may then have to regenerate your initramfs in order to make life happy.

(I hope I never need to do this sort of thing again, but if I do I want to have some notes about it. Sadly someday we may need to use a smaller replacement disk in a software RAID mirror in an emergency situation and I may get to call on this experience.)

ShrinkingSoftwareRAIDSwap written at 23:18:30; Add Comment

(Previous 10 or go back to April 2017 at 2017/04/16)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.