Wandering Thoughts

2017-05-07

A mistake I made when setting up my ZFS SSD pool on my home machine

I recently actually started using a pair of SSDs on my home machine, and as part of that I set up a ZFS pool for my $HOME and other data that I want to be fast (as planned). Unfortunately, I recently realized that when I set that pool up I made a mistake of omission. Some people who know ZFS can guess my mistake: I didn't force an ashift setting, unlike what I did with my work ZFS pools.

(I half-fumbled several aspects of my ZFS pool setup, actually; for example I forgot to turn on relatime and set compression to on at the pool level. But those other things I could fix after the fact, although sometimes with a bit of pain.)

Unlike spinning rust hard disks, SSDs don't really have a straightforward physical sector size, and certainly not one that's very useful for most filesystems (the SSD erase block size is generally too large). So in practice their reported logical and physical sector sizes are arbitrary, and some drives are even switchable. As arbitrary numbers, SSDs report whatever their manufacturer considers convenient. In my case, Crucial apparently decided to make their MX300 750 GB SSDs report that they had 512 byte physical sectors. Then ZFS followed its defaults and created my pool with an ashift of 9, which means that I could run into problems if I have to replace SSDs.

(I'm actually a bit surprised that Crucial SSDs are set up this way; I expected them to report as 4K advanced format drives, since HDs have gone this way and some SSDs switched very abruptly. It's possible that SSD vendors have decided that reporting 512 byte sectors is the easiest or most compatible way forward, at least for consumer SSDs, given that the sizes are arbitrary anyway.)

Unfortunately the only fix for this issue is to destroy the pool and then recreate it (setting an explicit ashift this time around), which means copying all data out of it and then back into it. The amount of work and hassle involved in this creates the temptation to not do anything to the pool and just leave things as they are.

On the one hand, it's not guaranteed that I'll have problems in the future. My SSDs might never break and need to be replaced, and if a SSD does need to be replaced it might be that future consumer SSDs will continue to report 512 byte physical sectors and so be perfectly compatible with my current pool. On the other hand, this seems like a risky bet to make, especially since based on my past history this ZFS pool is likely to live a quite long time. My main LVM setup on my current machine is now more than ten years old; I set it up in 2006 and have carried it forward ever since, complete with its ext3 filesystems; I see no reason why this ZFS pool won't be equally durable. In ten years all of the SSDs may well report themselves as 4K physical sector drives simply because that's what all of the (remaining) HDs will report and so that's what all of the software expects.

Now is also my last good opportunity to fix this, because I haven't put much data in my SSD pool yet and I still have the old pair of 500 GB system HDs in my machine. The 500 GB HDs could easily hold the data from my SSD ZFS pool, so I could repartition them, set up a temporary ZFS pool on them, reliably and efficiently copy everything over to the scratch pool with 'zfs send' (which is generally easier than rsync or the like), then copy it all back later. If I delay, well, I should pull the old 500 GB disks out and put the SSDs in their proper place (partly so they get some real airflow to keep their temperatures down), and then things get more difficult and annoying.

(I'm partly writing this entry to motivate myself into actually doing all of this. It's the right thing to do, I just have to get around to it.)

ZFSSSDPoolSetupMistake written at 02:31:24; Add Comment

2017-05-04

My views on using LVM for your system disk and root filesystem

In a comment on my entry about perhaps standardizing the size of our server root filesystems, Goozbach asked a good question:

Any reason not to put LVM on top of raid for OS partitions? (it's saved my bacon more than once both resizing and moving disks)

First, let's be clear what we're talking about here. This is the choice between putting your root filesystem directly into a software RAID array (such as /dev/md0) or creating a LVM volume group on top of the software RAID array and then having your root filesystem be a logical volume in it. In a root-on-LVM-on-MD setup, I'm assuming that the root filesystem would still use up all of the disk space in the LVM volume group (for most of the same reasons outlined for the non-LVM case in the original entry).

For us, the answer is that there is basically no payoff for routinely doing this, because in order to need LVM for this we'd need a number of unusual things to be true all at once:

  • we can't just use space in the root filesystem; for some reason, it has to be an actual separate filesystem.
  • but this separate filesystem has to use space from the system disks, not from any additional disks that we might add to the server.
  • and there needs to be some reason why we can't just reinstall the server from scratch with the correct partitioning and must instead go through the work of shrinking the root filesystem and root LVM logical volume in order to make up enough spare space for the new filesystem.

Probably an important part of this is that our practice is to reinstall servers from scratch when we repurpose them, using our install system that makes this relatively easy. When we do this we get the option to redo the partitioning (although it's generally easier to keep things the same, since that means we don't even have to repartition, just tell the installer to use the existing software RAIDs). If we had such a special need for a separate filesystem, it's probably a sufficiently unique and important server that we would want to start it over from scratch, rather than awkwardly retrofitting an existing server into shape.

(One problem with a retrofitted server is that you can't be entirely sure you can reinstall it from scratch if you need to, for example because the hardware fails. Installing a new server from scratch does help a great deal to assure that you can reinstall it too.)

We do have servers with unusual local storage needs. But those servers mostly use additional disks or unusual disks to start with, especially now that we've started moving to small SSDs for our system disks. With small SSDs there just isn't much space left over for a second filesystem, especially if you want to leave a reasonable amount of space free on both it and the root filesystem in case of various contingencies (including just 'more logs got generated than we expected').

I also can't think of many things that would need a separate filesystem instead of just being part of the root filesystem and using up space there. If we're worried about this whatever it is running the root filesystem out of space, we almost certainly want to put in big, non-standard system disks in the first place rather than try to wedge it into whatever small disks the system already has. Leaving all the free space in a single (root) filesystem that everything uses has the same space flexibility as ZFS, and we're lazy enough to like that. It's possible that I'm missing some reasonably common special case here because we just don't do whatever it is that really needs a separate local filesystem.

(We used to have some servers that needed additional system filesystems because they needed or at least appeared to want special mount options. Those needs quietly went away over the years for various reasons.)

Sidebar: LVM plus a fixed-size root filesystem

One possible option to advance here is a hybrid approach between a fixed size root partition and a LVM setup: you make the underlying software RAID and LVM volume group as big as possible, but then you assign only a fixed and limited amount of that space to the root filesystem. The remaining space is left as uncommitted free space, and then is either allocated to the root if it needs to grow or used for additional filesystems if you need them.

I don't see much advantage to this setup, though. Since the software RAID array is maximum-sized, you still have the disk replacement problems that motivated my initial question. You add the chance of the root filesystem running out of space if you don't keep an eye on it and make the time to grow it as needed, and in order for this setup to pay off you still have to need the space in a separate filesystem for some reason, instead of as part of the root filesystem. What you save is the hassle of shrinking the root filesystem if you ever need to make that additional filesystem with its own space.

LVMForRootViews written at 00:18:43; Add Comment

2017-04-30

Do we want to standardize the size of our root filesystems on servers?

We install many of our Linux servers with mirrored system disks, and at the moment our standard partitioning is to have a 1 GB swap partition and then give the rest of the space to the root filesystem. In light of the complexity of shrinking even a RAID-1 swap partition, whose contents I could casually destroy, an obvious question came to me: did we want to switch to having our root filesystems normally being a standard size, say 80 GB, with the rest of the disk space left unused?

The argument for doing this is that it makes replacing dead system disks much less of a hassle than it is today, because almost any SATA disk we have lying around would do. Today, if a system disk breaks we need to find a disk of the same size or larger to replace it with, and we may not have a same-sized disk (we recycle random disks a lot), so we may wind up with weird mismatched disks with odd partitioning. An 80 GB root filesystem is good enough for basically any of our Linux servers; even with lots of packages and so on installed, they just don't need much space (we don't seem to have any that are using over about 45 GB of space, and that's including a bunch of saved syslog logs and so on).

The main argument against doing this is that this hasn't been a problem so far and there are some potential uses for having lots of spare space in the root filesystem. I admit that this may not sound too persuasive now that I write it down, but honestly 'this is not a real problem for us' is a valid argument. If we were going to pick a standard root filesystem size we'd have to figure out what it should be, monitor the growth of our as-installed root filesystems over time (and over new Ubuntu versions), maybe reconsider the size every so often, and so on. We'd probably want to actually calculate what minimum disk size we're going to get in the future and set the root filesystem size based on that, which implies doing some research and discussion. All of this adds up to kind of a hassle (and having people spend time on this does cost money, at least theoretically).

Given that it's not impossible to shrink an extN filesystem if we have to and that we usually default to using the smallest size of disks in our collection for new system disks, leaving our practices as they are is both pretty safe and what I expect we'll do.

(We also seem to only rarely lose (mirrored) system disks, even when they're relatively old disks. That may change in the future, or maybe not as we theoretically migrate to SSDs for system disks. Our practical migration is, well, not that far along, for reasons beyond the scope of this entry.)

FixedRootFSSizeQuestion written at 01:32:16; Add Comment

2017-04-24

Corebird and coming to a healthier relationship with Twitter

About two months ago I wrote about my then views on the Corebird Twitter client. In that entry I said that Corebird was a great client for checking in on Twitter and skimming through it, but wasn't my preference for actively following Twitter; for that I still wanted Choqok for various reasons. You know what? It turns out that I was wrong. I now feel that Corebird is both a better Linux Twitter client in general and that it's a better Twitter client for me in specific. Unsurprisingly, it's become the dominant Twitter client that I use.

Corebird is mostly a better Twitter client in general because it has much better support for modern Twitter features, even if it's not perfect and there are things from Choqok that I wish it did (even as options). It has niceties like displaying quoted tweets inline and letting me easily and rapidly look at attached media (pictures, animations, etc), and it's just more fluid in general (even if it has some awkward and missing bits, like frankly odd scrolling via the keyboard). Corebird has fast, smooth updates of new tweets more or less any time you want, and it can transparently pull in older tweets as you scroll backwards to a relatively impressive level. Going back to Choqok now actually feels clunky and limited, even though it has features that I theoretically rather want (apart from the bit where I know that several of those features are actually bad for me).

(Corebird's ability to display more things inline makes a surprising difference when skimming Twitter, because I can see more without having to click on links and spawn things in my browser and so on. I also worked out how to make Corebird open up multiple accounts on startup; it's hiding in the per-account settings.)

Corebird is a better Twitter client for me in specific because it clearly encourages me to have a healthier approach to Twitter, the approach I knew I needed a year ago. It's not actually good for me to have a Twitter client open all the time and to try to read everything, and it turns out that Corebird's lack of some features actively encourages me to not try to do this. There's no visible unread count to prod me to pay attention, there is no marker of read versus unread to push me to trying to read all of the unread Tweets one by one, and so on. That Corebird starts fast and lets me skim easily (and doesn't hide itself away in the system tray) also encourages me to close it and not pay attention to Twitter for a while. If I do keep Corebird running and peek in periodically, its combination of features make it easy and natural to skim, rapidly scan, or outright skip the new tweets, so I'm pretty sure I spend less time catching up than I did in Choqok.

(Fast starts matter because I know I can always come back easily if I really want to. As I have it configured, Choqok took quite a while to start up and there were side effects of closing it down with unread messages. In Corebird, startup is basically instant and I know that I can scroll backwards through my timeline to where I was, if I care enough. Mostly I don't, because I'm looking at Twitter to skim it for a bit, not to carefully read everything.)

The net result is that Corebird has turned checking Twitter into what is clearly a diversion, instead of something to actively follow. I call up Corebird when I want to spend some time on Twitter, and then if things get busy there is nothing to push me to get back to it and maybe I can quit out of it in order to make Twitter be even further away (sometimes Corebird helps out here by quietly crashing). This is not quite the 'stop fooling yourself you're not multitasking here' experience that using Twitter on my phone is, but it feels closer to it than Choqok did. Using Corebird has definitely been part of converting Twitter from a 'try to read it all' experience to a 'dip in and see what's going on' one, and the latter is much better for me.

(It turns out that I was right and wrong when I wrote about how UI details mattered for my Twitter experience. Back then I said that a significantly different client from Choqok would mean that my Twitter usage would have to change drastically. As you can see, I was right about that; my Twitter usage has changed drastically. I was just wrong about that necessarily being a bad thing.)

CorebirdViewsII written at 00:29:40; Add Comment

2017-04-21

A surprising reason grep may think a file is a binary file

Recently, 'fgrep THING FILE' for me has started to periodically report 'Binary file FILE matches' for files that are not in fact binary files. At first I thought a stray binary character might have snuck into one file this was happening to, because it's a log file that accumulates data partly from the Internet, but then it happened to a file that is only hand-edited and that definitely shouldn't contain any binary data. I spent a chunk of time tonight trying to find the binary characters or mis-encoded UTF-8 or whatever it might be in the file, before I did the system programmer thing and just fetched the Fedora debuginfo package for GNU grep so that I could read the source and set breakpoints.

(I was encouraged into this course of action by this Stackexchange question and answers, which quoted some of grep's source code and in the process gave me a starting point.)

As this answer notes, there are two cases where grep thinks your file is binary: if there's an encoding error detected, or if it detects some NUL bytes. Both of these sound at least conceptually simple, but it turns out that grep tries to be clever about detecting NULs. Not only does it scan the buffers that it reads for NULs, but it also attempts to see if it can determine that a file must have NULs in the remaining data, in a function helpfully called file_must_have_nulls.

You might wonder how grep or anything can tell if a file has NULs in the remaining data. Let me answer that with a comment from the source code:

/* If the file has holes, it must contain a null byte somewhere. */

Reasonably modern versions of Linux (since kernel 3.1) have some special additional lseek() options, per the manpage. One of them is SEEK_HOLE, which seeks to the nearest 'hole' in the file. Holes are unwritten data and Unix mandates that they read as NUL bytes, so if a file has holes, it's got NULs and so grep will call it a binary file.

SEEK_HOLE is not implemented on all filesystems. More to the point, the implementation of SEEK_HOLE may not be error-free on all filesystems all of the time. In my particular case, the files which are being unexpected reported as binary are on ZFS on Linux, and it appears that under some mysterious circumstances the latest development version of ZoL can report that there are holes in a file when there aren't. It appears that there is a timing issue, but strace gave me a clear smoking gun and I managed to reproduce it in a simple test program that gives me a clear trace:

open("testfile", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0600, st_size=33005, ...}) = 0
read(3, "aaaaaaaaa"..., 32768) = 32768
lseek(3, 32768, SEEK_HOLE)              = 32768

The file doesn't have any holes, yet sometimes it's being reported as having one at the exact current offset (and yes, the read() is apparently important to reproduce the issue).

(Interested parties can see more weirdness in the ZFS on Linux issue.)

GrepBinaryFileReason written at 00:57:12; Add Comment

2017-04-20

The big motivation for a separate /boot partition

In a recent comment on my entry on how there's no point in multiple system filesystems any more, I was asked:

What about old(er) computers stuck with BIOS booting? Wasn't the whole seperate /boot/ partially there to appease those systems?

There have been two historical motivations for a separate /boot filesystem in Linux. The lesser motivation is mismatches between what GRUB understands versus what your system does; with a separate /boot you can still have a root filesystem that is, say, the latest BTRFS format, without requiring a bootloader that understands the latest BTRFS. Instead you make a small /boot that uses whatever basic filesystem your bootloader is happy with, possibly all the way down to ext2.

The bigger motivation has been machines where the BIOS couldn't read data from the entire hard disk. All the stages of the bootloader read data using BIOS services, so all of the data they need had to be within a portion of the disk that the BIOS could reach; in fact, they all had to be within the area reachable by whatever (old and basic) BIOS service the bootloader was using. The first stage of the bootloader is at the start of the disk, so that's no problem, and the second stage is usually embedded shortly after it, which is also no problem. The real problem is things that fully live in the filesystem, like the GRUB grub.cfg menu and especially the kernel and initramfs that the bootloader needed to load into memory in order to boot.

(There have been various BIOS limits over the years (see also), and some of the early ones are rather small.)

If your /boot was part of the root filesystem, you had to make sure that your entire root filesystem was inside the area that the BIOS could read. On old machines with limited BIOSes, this could drastically constrain both the size and position of your entire root filesystem. If you had a (small) separate /boot filesystem, only the /boot filesystem had to be within this limited area of BIOS readable disk space; your root filesystem could spill outside of it without problems. You could make / as big as you wanted and put it wherever you wanted.

(If you care about this, it's not enough to have a separate /boot and to make it small; you need to put it as close to the start of the disk as possible, and it's been traditional to make it a primary partition instead of an extended one. Linux installers may or may not do this for you if you tell them to make a separate /boot filesystem.)

Today this concern is mostly obsolete and has been for some time. Even BIOS MBR only machines can generally boot from anywhere on the disk, or at least anywhere on the disk that the MBR partitions can address (which is anything up to 2 TB). In theory you could get into trouble if you had a HD larger than 2 TB, used GPT partitioning, put your root filesystem partly or completely after the 2 TB boundary, and your bootloader and BIOS didn't use LBA48 sector addressing. However I think that even this is relatively unlikely, given that LBA48 is pretty old by now.

(This was once common knowledge in the Linux world, but that was back in the days when it was actually necessary to know this because you might run into such a machine. Those days are probably at least half a decade ago, and probably more than that.)

WhySeparateBootFS written at 00:19:49; Add Comment

2017-04-17

Shrinking the partitions of a software RAID-1 swap partition

A few days ago I optimistically talked about my plans for a disk shuffle on my office workstation, by replacing my current 1 TB pair of drives (one of which had failed) with a 1.5 TB pair. Unfortunately when I started putting things into action this morning, one of the 1.5 TB drives failed immediately. We don't have any more spare 1.5 TB drives (at least none that I trust), but we did have what I believe is a trustworthy 1 TB drive, so I pressed that into service and changed my plans around to be somewhat less ambitious and more lazy. Rather than make a whole new set of RAID arrays on the new disks (and go through the effort of adding them to /etc/mdadm.conf and so on), I opted to just move most of the existing RAID arrays over to the new drives by attaching and detaching mirrors.

This presented a little bit of a problem for my mirrored swap partition, which I wanted to shrink from 4 GB to 1 GB. Fortunately it turns out that it's actually possible to shrink a software RAID-1 array these days. After some research, my process went like this:

  • Create the new 1 GB partitions for swap on the new disks as part of partitioning them. We can't directly add these to the existing swap array, /dev/md14, because they're too small.

  • Stop using the swap partition because we're about to drop 3/4ths of it. This is just 'swapoff -a'.

  • Shrink the amount of space to use on each drive of the RAID-1 array down to an amount of space that's smaller than the new partitions:

    mdadm --grow -z 960M /dev/md14
    

    I first tried using -Z (aka --array-size) to shrink the array size non-persistently, but mdadm still rejected adding a too-small new array component. I suppose I can't blame it.

  • Add in the new 1 GB partitions and pull out the old 4 GB partition:

    mdadm --add /dev/md14 /dev/sdc3
    # (wait for the resync to finish)
    mdadm --add /dev/md14 /dev/sdd3
    mdadm --repace /dev/md14 /dev/sde4
    # (wait for the resync to finish)
    madam -r /dev/md14 /dev/sde4
    

  • Tell software RAID to use all of the space on the new partitions:

    mdadm --grow -z max /dev/md14

At this point I almost just swapon'd the newly resized swap partition. Then it occurred to me that it probably still had a swap label that claimed it was a 4 GB swap area, and the kernel would probably be a little bit unhappy with me if I didn't fix that with 'mkswap /dev/md14'. Indeed mkswap reported that it was replacing an old swap label with a new one.

My understanding is that the same broad approach can be used to shift a software RAID-1 array for a filesystem to smaller partitions as well. For a filesystem that you want to keep intact, you first need to shrink the filesystem safely below the size you'll shrink the RAID array to, then at the end grow the filesystem back up. All things considered I hope that I never have to shrink or reshape the RAID array for a live filesystem this way; there are just too many places where I could blow my foot off.

(Life is easier if the filesystem is expendable and you're going to mkfs a new one on top of it later.)

You might ask why it's worth going through all of this instead of just making a new software RAID-1 array. That's a good question, and for me it comes down to how much of a pain it often is to set up a new array. These days I prefer to change /etc/mdadm.conf, /etc/fstab and so on as little as possible, which means that I really want to preserve the name and MD UUID of existing arrays when feasible instead of starting over from scratch.

This is also where I have an awkward admission: for some reason, I thought that you couldn't use 'mdadm --detail --scan' on a single RAID array, to conveniently generate the new line you need for mdadm.conf when you create a new array. This is wrong; you definitely can, so you can just do things like 'mdadm --detail --scan /dev/mdNN >>/etc/mdadm.conf' to set it up. Of course you may then have to regenerate your initramfs in order to make life happy.

(I hope I never need to do this sort of thing again, but if I do I want to have some notes about it. Sadly someday we may need to use a smaller replacement disk in a software RAID mirror in an emergency situation and I may get to call on this experience.)

ShrinkingSoftwareRAIDSwap written at 23:18:30; Add Comment

2017-04-16

Migrating a separate /boot filesystem into the root filesystem

In my notes on moving a software RAID-1 root filesystem around, I mentioned that I still had a separate /boot partition on my home and office machines and I wanted to merge it into the root filesystem at some point as I no longer believe in a separate /boot under most circumstances. Having now done this both in a test Fedora 25 virtual machine and on my office workstation, it's time to write up the procedure. For a slight extra complexity, my /boot and root filesystems are both in Linux software RAID-1 arrays.

First, make sure you're actually booting off your root drives, which may require BIOS fiddling. I probably still have to do this at home (on my new SSDs), but it was already done on my office machine a while back. Then my process is:

  • Make the directory that will be the new mount point of your old separate /boot filesystem. I use /boot2.
  • change /etc/fstab to have your separate /boot filesystem mounted there.
  • unmount /boot and remount it on /boot2. With fstab updated, this is just 'umount /boot; mount /boot2'.

  • copy what is now /boot2 to /boot. I reflexively do this with dump:
    cd /boot && dump -0f - /boot2 | restore -xf -
    

    (We don't need to remake /boot, because unmounting the filesystem from it leaves behind an empty directory that previously was the mountpoint.)

  • You need to know the UUIDs of the root filesystem and your root software RAID device. You can get these from /etc/fstab and /etc/mdadm.conf, or you can use commands such as 'lsblk --fs /dev/mdNN' and 'mdadm -D /dev/mdNN'.

    The software RAID UUID will have :'s in it, for example '35d6ec50:bd4d1f53:7401540f:6f971527'. You're going to need a version of the UUID without them, ie just one long string of hex digits.

  • Also find the filesystem UUID of what is now /boot2 and the software RAID UUID of its array. One convenient way to do this is to look in your current /boot/grub2/grub.cfg for lines of the form:

    set root='mduuid/05a2f6e13830b3a102f7636b98d651f3'
    

    and

    search --no-floppy --fs-uuid --set=root 91aed8b1-c145-4673-8ece-119e19f7038f
    

    Those are the MD UUID and the filesystem UUID respectively. As you may have guessed, you need a version of the MD UUID without colons, because that's what Grub2 wants.

  • Now you need to edit /boot/grub2/grub.cfg to make two sets of changes. The first one is to change the old /boot2 MD UUID and filesystem UUID to the MD UUID and filesystem UUID of your root filesystem, so that Grub2 is looking in the right filesystem for everything. This should be easy with something like_vim_ search and replace, but remember to make sure your command changes all occurrences on a line, not just the first one (eg, vim's g modifier on the s command).

    (Note that the MD UUIDs used in grub.cfg must be the version without :'s in it. Grub2 probably won't like you too much if you use the other form, although I haven't tested it. Yes, it is a bit annoying that Grub wants a different form of MD UUID than the tools normally produce.)

    However, this isn't sufficient by itself, because grub.cfg contains filesystem-relative paths for the Linux kernel and its initial ramdisks. In your old /boot2 these were in the root of the filesystem, but now they are /boot/vmlinuz-*, because /boot is now a filesystem subdirectory. So you need to edit every set of lines that are like:

    linux  /vmlinuz-[...]
    initrd /initramfs-[...]
    

    Change them to add a /boot at the front, like so:

    linux  /boot/vmlinuz-[...]
    initrd /boot/initramfs-[...]
    

    Some versions of grub.cfg may use 'linux16' and 'initrd16' as the directives instead.

  • Finally, run grub2-install to reinstall the Grub boot blocks onto your boot drives, eg 'grub2-install /dev/sda'. This updates the embedded Grub so it knows about your new location for /boot; if you skip this step, your Grub boot blocks will probably quietly keep on using what is now /boot2.

To test that this is working, reboot your machine and go into the Grub menu during boot. When you look at the text for a specific boot entry, you should see that it's using 'linux /boot/...' and so on; the presence of the /boot bit is your proof that Grub is actually reading your updated /boot/grub2/grub.cfg instead of the old, un-updated grub.cfg in /boot2.

(As a final test you can erase everything in /boot2 and then fill it up with data from /dev/zero or /dev/urandom to overwrite all of the disk blocks. If the system still boots happily, you are definitely not using anything from the now thoroughly destroyed /boot2. I feel that going this far is only appropriate in a test environment; I did it with my Fedora 25 test virtual machine, but not on my office workstation.)

At some point in the future when you decommission the /boot2 filesystem, you will want to edit the kernel command lines in /boot/grub2/grub.cfg to take out the 'rd.md.uuid=<...>' argument that is telling dracut to try to bring up that software RAID array.

(You might wonder why we didn't have to add or change a rd.md.uuid kernel command line argument for the /boot migration. The answer is that since /boot is now on the root filesystem, the kernel's command line already has a rd.md.uuid argument to bring it up on boot. Or at least it should.)

Sidebar: Why I haven't done this to my home machine yet

One version of why is that rebooting my home machine usually feels like more of a pain than rebooting my office machine, because I'm usually doing something with my home machine that I don't want to interrupt. So I drag my heels on anything and everything that can call for a reboot, such as Fedora kernel upgrades.

A more complete version is that not only do I have to reboot my machine, but I should really also open it up to take out my two old 500 GB system HDs, which won't really be in use after this shuffle (on my home machine, I put my small swap partition on my SSDs). And the reason I want to do this is that right now my pair of SSDs aren't actually mounted in drive bays, because I don't have enough. Instead they're just resting loose in a 5.25" drive bay, and the problem with that is the 5.25" drive bays get clearly less air circulation than the 3.5" drive bays.

I should really just do all of this shuffling, but I may have recently mentioned that I'm lazy. I really do want to do it soon, though. Hopefully having written this will motivate me to do it tomorrow.

MigratingBootFSIntoRootFS written at 03:06:09; Add Comment

2017-04-14

Planning out a disk shuffle for my office workstation

As I mentioned yesterday, my office workstation lost one of its remaining HDs the other day. This HD is one side of a pair of 1 TB drives that's used for various mirrored partitions, so I haven't lost anything (unless the other side also fails before Monday, so let's hope not), but now I need to replace it with one of the ready spares I have sitting around for exactly this eventuality.

The obvious way to replace the failed disk is to do literally that; put in the new disk, partition it up to basically match the existing setup, and attach appropriate partitions to software RAID devices and my ZFS pool. That would certainly work, but it's not necessarily the right answer because my office workstation's current disk setup has mutated over time and the actual current setup of much of it is not what I would build from scratch.

Specifically, right now I have five drives with the following odd split up:

  • two paired SSDs, used for / and a SSD ZFS pool where my home directory and other important things live. (That my home directory and so on is on the SSDs is one reason I am not too worried about the current disk failure.)

  • two paired 1 TB HDs that used to be my only two drives. Because of that they still have partitions for my old HD-based root filesystem (in two copies), the /boot partition I just recently migrated into the SSD /, and finally my HD-based ZFS pool, which used to hold everything but now mostly holds less important data. Oh, and it turns out they also have my swap partition.

  • One 500 GB HD that I used as overflow for unimportant virtual machines, back in the days when I thought I needed to conserve space on the mirrored 1 TB HDs (in fact it may date from the days when these were actually 750 GB HDs).

My replacements for the 1 TB drives are 1.5 TB, so this already gives me a space expansion, and there's now any number of things on those two 1 TB drives I don't need any more, and also one thing I would like to have. So my current plan is to replace both 1 TB drives (the dead and the still alive one) and set up the space on the new pair of 1.5 TB drives as follows:

  • a small 1 GB swap partition, because it still seems like a good idea to give the Linux kernel some swap space to make it happy.
  • a 200 GB or so ext4 partition, which will be used for an assortment of things that aren't important enough to go in the SSD / but that I don't want to have in ZFS, partly because they're things I may need to get back access to my ZFS pools if package upgrades go wrong.

  • a 100 GB backup / partition. As is my current habit, before major events like Fedora upgrades I'll copy the SSD / to here so that I can kludge together a reversion to a good environment if something goes badly wrong.

  • all the remaining space goes to my HD-based ZFS pool, which should expand the space a decent amount.

(All partitions and ZFS and so on will be mirrored between the two drives.)

Given the likely space expansion in my HD-based ZFS pool, I think I'll also get rid of that overflow 500 GB HD by folding its 120 GB or so of actual space usage into the main HD-based ZFS pool. It was always sort of a hack and a bit awkward (on top of having no redundancy). Plus this will simplify everything, and I can always put the drive (or a bigger replacement drive) back in and redo this if I turn out to actually need a bunch more space for virtual machines.

In a way, I'm grateful that my 1 TB HD finally died and also that it happened under the circumstances it did, where I couldn't immediately rush into replacing it in the most obvious way possible and had forced time to sit back and think about whether the obvious path was the right one. I'm probably going to wind up with a nicer, more sensibly set up system as a result of this disk failure, and I probably never would have done this rearrangement without being pushed.

OfficeWorkstationDiskShuffle written at 22:53:11; Add Comment

2017-03-31

What I know about process virtual size versus RSS on Linux

Up until very recently, I would have confidently told you that a Linux process's 'virtual size' was always at least as large as its resident set size. After all, how could it be otherwise? Your 'virtual size' was the total amount of mapped address space you had, the resident set size was how many pages you had in memory, and you could hardly have pages in memory without having them as part of your mapped address space. As Julia Evans has discovered, this is apparently not the case; in top terminology, it's possible to have processes with RES (ie RSS) and SHR that is larger than VIRT. So here is what I know about this.

To start with, top extracts this information from /proc/PID/statm, and this information is the same as what you can find as VmSize and VmRSS in /proc/PID/status. Top doesn't manipulate or postprocess these numbers (apart from converting them all from pages to Kb or other size units), so what you see it display is a faithful reproduction of what the kernel is actually reporting.

However, these two groups of numbers are maintained by different subsystems in the kernel's memory management system; there is nothing that directly ties them together or forces them to always be in sync. VmSize, VmPeak, VmData, and several other numbers come from per-mm_struct counters such as mm->total_vm; per Rick Branson these numbers are mostly maintained through vm_stat_account in mm/mmap.c. These numbers change when you make system calls like mmap() and mremap() (or when the kernel does similar things internally). Meanwhile, VmRSS, VmSwap, top's SHR, and RssAnon, RssFile, and RssShmem all come from page tracking, which mostly involves calling things like inc_mm_counter and add_mm_counter in places like mm/memory.c; these numbers change when pages are materialized and de-materialized in various ways.

(You can see where all of the memory stats in status come from in task_mem in fs/proc/task_mmu.c.)

I don't have anywhere near enough knowledge about the Linux kernel memory system to know if there's any way for a process to acquire a page through a path where it isn't accounted for in VmSize. One would think not, but clearly something funny is going on. On the other hand, this doesn't appear to be a common thing, because I wrote a simple brute-force checker script that compared every process's VmSize to its VmRSS, and I couldn't find any such odd process on any of our systems (a mixture of Ubuntu 12.04, 14.04, and 16.04, Fedora 25, and CentOS 6 and 7). It's quite possible that this requires a very unusual setup; Julia Evans' case is (or was) an active Chrome process and Chrome is known to play all sorts of weird games with its collection of processes that very few other programs do.

(If you find such a case it would be quite interesting to collect /proc/PID/smaps, which might show which specific mappings are doing this.)

PS: The one area of this that makes me wonder is how RSS is tracked over fork(), because there seem to be at least some oddities there. Or perhaps the child does not get PTEs and thus RSS for the mappings it shares with the parent until it touches them in some way.

VirtualSizeVersusRSS written at 02:01:19; Add Comment

(Previous 10 or go back to March 2017 at 2017/03/30)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.