2019-12-20
My new Linux office workstation disk partitioning for the end of 2019
I've just had the rare opportunity to replace all of my office machine's disks at once, without having to carry over any of the previous generation the way I've usually had to. As part of replacing everything I got the chance to redo the partitioning and setup of all of my disks, again all at once without the need to integrate a mix of the future and the past. For various reasons, I want to write down the partitioning and filesystem setup I decided on.
My office machine's new set of disks are a pair of 500 GB NVMe drives and a pair of 2 TB SATA SSDs. I'm using GPT partitioning on all four drives for various reasons. All four drives start with my standard two little partitions, a 256 MB EFI System Partition (ESP, gdisk code EF00) and a 1 MB BIOS boot partition (gdisk code EF02). I don't currently use either of them (my past attempt to switch from MBR booting to UEFI was a failure), but they're cheap insurance for a future. Similarly, putting these partitions on all four drives instead of just my 'system' drives is more cheap insurance.
(Writing this down has made me realize that I didn't format the ESPs. Although I don't use UEFI for booting, I have in the past put updated BIOS firmware images there in order to update the BIOS.)
The two NVMe are my 'system' drives. They have three additional
partitions; a 70 GB partition used for a Linux software RAID mirror
of the root filesystem (including /usr
and /var
, since I put
all of the system into one filesystem), a 1 GB partition that is a
Linux software RAID mirror swap partition, and the remaining 394.5
GB as a mirrored ZFS pool that holds filesystems that I want to be
as fast as possible and that I can be confident won't grow to be
too large. Right now that's my home directory filesystem and the
filesystem that holds source code (where I build Firefox, Go, and
ZFS on Linux, for example).
The two SATA SSDs are my 'data' drives, holding various larger but less important things. They have two 70 GB partitions that are Linux software RAID mirrors and the remaining space is in in a single partition for another mirrored ZFS pool. One of the two 70 GB partitions is so that I can make backup copies of my root filesystem before upgrading Fedora (if I bother to do so); the other is essentially an 'overflow' filesystem for some data that I want on an ext4 filesystem instead of in a ZFS pool (including a backup copy of all recent versions of ZFS on Linux that I've installed on my machine, so that if I update and the very latest version has a problem, I can immediately reinstall a previous one). The ZFS pool on the SSDs contains larger and generally less important things like my VMWare virtual machine images and the ISOs I use to install them, and archived data.
Both ZFS pools are set up following my historical ZFS on Linux
practice, where they use the /dev/disk/by-id
names for my disks instead of the sdX and nvme... names. Both pools
are actually relatively old; I didn't create new pools for this and
migrate my data, but instead just attached new mirrors to the old
pools and then detached the old drives (more or less). The root filesystem was similarly migrated
from my old SSDs by attaching and removing software RAID mirrors;
the other Linux software RAID filesystems are newly made and copied
through ext4 dump
and restore
(and the new software RAID arrays
were added to /etc/mdadm.conf
more or less by hand).
(Since I just looked it up, the ZFS pool on the SATA SSDs was created in August of 2014, originally on HDs, and the pool on the NVMe drives was created in January of 2016, originally on my first pair of (smaller) SSDs.)
Following my old guide to RAID superblock formats, I continued to use the version 1.0 format for everything except the new swap partition, where I used the version 1.2 format. By this point using 1.0 is probably superstition; if I have serious problems (for example), I'm likely to just boot from a Fedora USB live image instead of trying anything more complicated.
All of this feels very straightforward and predictable by now. I've moved away from complex partitioning schemes over time and almost all of the complexity left is simply that I have two different sets of disks with different characteristics, and I want some filesystems to be fast more than others. I would like all of my filesystems to be on NVMe drives, but I'm not likely to have NVMe drives that big for years to come.
(The most tangled bit is the 70 GB software RAID array reserved for a backup copy of my root filesystem during major upgrades, but in practice it's been quite a while since I bothered to use it. Still, having it available is cheap insurance in case I decide I want to do that someday during an especially risky Fedora upgrade.)
Splitting a mirrored ZFS pool in ZFS on Linux
Suppose, not hypothetically, that you're replacing a pair of old disks with a pair of new disks in a ZFS pool that uses mirrors. If you're a cautious person and you worry about issues like infant mortality in your new drives, you don't necessarily want to immediately switch from the old disks to the new ones; you want to run them in parallel for at least a bit of time. ZFS makes this very easy, since it supports up to four way mirrors and you can just attach devices to add extra mirrors (and then detach devices later). Eventually it will come time to stop using the old disks, and at this point you have a choice of what to do.
The straightforward thing is to drop the old disks out of the ZFS
mirror vdev with 'zpool detach
', which cleanly removes them (and
they won't come back later, unlike with Linux software RAID). However this is a little bit
wasteful, in a sense. Those old disks have a perfectly good backup
copy of your ZFS pool on them, but when you detach them you lose
any real possibility of using that copy. Perhaps you would like to
keep that data as an actual backup copy, just in case. Modern
versions of ZFS can do this through splitting the pool with 'zpool
split
'.
To quote the manpage here:
Splits devices off pool creating newpool. All vdevs in pool must be mirrors and the pool must not be in the process of resilvering. At the time of the split, newpool will be a replica of pool. [...]
In theory the manpage's description suggests that you can split a
four-way mirror vdev in half, pulling off two devices at once in a
'zpool split
' operation. In practice it appears that the current
0.8.x version of ZFS on Linux can only
split off a single device from each mirror vdev. This meant that
I needed to split my pool in a multi-step operation.
Let's start with a pool, maindata
, with four disks in a single
mirrored vdev, oldA
, oldB
, newC
, and newD
. We want to split
maindata
so that there is a new pool with oldA
and oldB
.
First, we split one old device out of the pool:
zpool split -R /mnt maindata maindata-hds oldA
Normally the just split off newpool is not imported (as far as I
know), and certainly you don't want it imported if your filesystems
have explicit 'mountpoint
' settings (because then filesystems
from the original and the split off pool will fight over who gets
to be mounted there). However, you can't add devices to exported
pools and we need to add oldB
, so we have to import the new pool
in an altroot. I use /mnt
here out of tradition but you can use
any convenient empty directory.
With the pool split off, we need to detach oldB
from the regular
pool and attach it to oldA
in the new pool to make the new pool
actually be mirrored:
zpool detach maindata oldB zpool attach maindata-hds oldA oldB
This will then resilver the maindata-hds
new pool on to oldB
(even though oldB
has an almost exact copy already). Once the
resilver is done, you can export the pool:
zpool export maindata-hds
You now have your mirrored backup copy sitting around with relatively little work on your part.
All of this appears to have worked completely fine for me. I scrubbed
my maindata
pool before splitting it, just in case, but I don't
think I bothered to scrub the maindata-hds
new pool after the
resilver. It's only an emergency backup pool anyway (and it gets
less and less useful over time, since there are more divergences
between it and the live pool).
PS: I don't know if you can make snapshots, split a pool, and then do incremental ZFS sends from filesystems in one copy of the pool to the other to keep your backup copy more or less up to date. I wouldn't be surprised if it worked, but I also wouldn't be surprised if it didn't.