Wandering Thoughts archives

2014-12-29

How I have partitioning et al set up for ZFS On Linux

This is a deeper look into how I have my office workstation configured with ZFS On Linux for all of my user data, because I figure that this may be of interest for people.

My office workstation's primary disks are a pair of 1 TB SATA drives. Each drive is partitioned identically into five partitions. The first four of those partitions (well, pairs of those partitions) are used for software RAID mirrors for swap, /, /boot, and a backup copy of / that I use when I do yum upgrades from one version of Fedora to another. If I was redoing this partitioning today I would not use a separate /boot partition, but this partitioning predates my enlightenment on that.

(Actually because I'm using GPT partitioning there are a few more partitions sitting around for UEFI stuff; I have extra 'EFI System' and and 'BIOS boot partition' partitions. I've ignored them as long as this system has been set up.)

All together these partitions use up about 170 GB of the disks (mostly in the two root filesystem partitions). The rest of the disk is in the final large partition, and this partition (on both disks) is what ZFS uses for the maindata pool that holds all my filesystems. The pool is of course set up with a single mirror vdev that uses both partitions. Following more or less the ZoL recommendations that I found, I set it up using the /dev/disk/by-id/ 'wnn-....-part7' names for the two partitions in question (and I set it up with an explicit 'ashift=12' option as future-proofing, although these disks are not 4K disks themselves).

I later added a SSD as a L2ARC because we had a spare SSD lying around and I had the spare chassis space. Because I had nothing else on the SSD at all, I added it with the bare /dev/disk/by-id wnn-* name and let ZoL partition the disk itself (and I didn't attempt to force an ashift for the L2ARC). As I believe is standard ZoL behavior, ZoL partitioned it as a GPT disk with an 8 MB spacer partition at the end. ZoL set the GPT partition type to 'zfs' (BF01).

(ZoL doesn't seem to require specific GPT partition types if you give it explicit partitions; my maindata partitions are still labeled as 'Linux LVM'.)

Basically all of the filesystems from my maindata pool are set up with explicit mountpoint= settings that put them where the past LVM versions went; for most of them this is various names in / (eg /homes, /data, /vmware, and /archive). I have ZoL set up to mount these in the normal ZFS way, ie as the ZFS pools and services are brought up (instead of attempting to do something through /etc/fstab). I also have a collection of bind mounts that also materialize bits of these filesystems in other places, mostly because I'm a bit crazy. Since all of the mount points and bind targets are in the root filesystem, I don't have to worry about mount order dependencies; if the system is up enough to be bringing up ZFS, the root filesystem is there to be mounted on.

Sidebar: on using wnn-* names here

I normally prefer physical location based names like /dev/sda and so on. However ZoL people recommend using stable /dev/disk/by-* names in general and they prefer the by-id names that are tied to the physical disk instead of what's plugged in where. When I was setting up ZFS I decided that this was okay by me because, after all, the ZFS pool itself is tied to these specific disks unless I go crazy and do something like dd the ZoL data from one disk to another. Really, it's no different from Linux's software RAID automatically finding its component disks regardless of what they're called today and I'm perfectly fine with that.

(And tying ZoL to specific disks will save me if someday I'm shuffling SATA cables around and what was /dev/sdb accidentally winds up as /dev/sdd.)

Of course I'd be happier if ZoL would just go look at the disks and find the ZFS metadata and assemble things automatically the way that Linux software RAID does. But ZoL has apparently kept most of the irritating Solaris bits around zpool.cache and how the system does relatively crazy things while bringing pools up on boot. I can't really blame them for not wanting to rewrite all of that code and make things more or less gratuitously different from the Illumos ZFS codebase.

ZFSOnLinuxDiskSetup written at 01:45:32; Add Comment

2014-12-27

My ZFS On Linux memory problem: competition from the page cache

When I moved my data to ZFS on Linux on my office workstation, I didn't move the entire system to ZoL for various reasons. My old setup was having my data on ext4 in LVM on a software RAID mirror, but having the root filesystem and swap in separate software RAID mirrors outside of LVM. When I moved to ZoL, I converted the ext4 in LVM on MD portion to a ZFS pool with the various data filesystems (my home directory, virtual machine images, and so on), but I left the root fileystem (and swap) alone. The net effect is that I was left with a relatively small ext4 root filesystem and a relatively large ZFS pool that had all of my real data.

ZFS on Linux does its best to integrate into the Linux kernel memory management system, and these days that seems to be pretty good. But it doesn't take part in the kernel's generic filesystem page cache; instead it has its own system for this, called ARC. In effect, in a system with both conventional filesystems (such as my ext4 root filesystem) and ZFS filesystems, you have two page caches, one used by your conventional filesystems and the separate one used by your ZFS filesystems. What I found on my machine was that the overall system was bad at balancing memory usage between these two. In particular, the ZFS ARC didn't seem to compete strongly enough with the kernel page cache.

If everything was going well, what I'd have expected was for there to be relatively little kernel page cache and a relatively large ARC, because ext4 (the page cache user) held much less of my actively used data than ZFS did. Certainly I ran some things from the root filesystem (such as compilers), but I rather thought not necessarily all that much compared to what I was doing from my ZFS filesystems. In real life, things seemed to go the other way; I would wind up with a relatively large page cache and a quite tiny ARC that was caching relatively little data. As far as I could tell, over time ext4 was simply out-competing ZFS for memory despite having much less actual filesystem data.

I assume that this is due to the page cache and the ZFS ARC being separate from each other, so that there just isn't any way of having some sort of global view of disk buffer usage that would let the ZFS ARC push directly on the page cache (and vice versa if necessary). As far as I know there's no way to limit page cache usage or push more aggressively only on the page cache in a way that won't hit ZFS at least as hard. So a mixed system with ZFS and something else just gets to live with this.

(The crude fix is to periodically empty the page cache with 'echo 1 >/proc/sys/vm/drop_caches'. Note that you don't want to use '2' or '3' here, because that will also drop ZFS ARC caches and other ZFS memory.)

The build up of page cache is not immediate when the system is in use. Instead it seems to come over time, possibly as things like system backups run and temporarily pull large amounts of the root filesystem into cache. I believe it generally took a day or two for page cache usage to grow and start strangling the ARC after I explicitly dropped the caches, and in the mean time I could do relatively root filesystem intensive things like compiling Firefox from source without making the page cache hulk up.

(The Firefox source code itself was in a ZFS filesystem, but the C++ compiler and system headers and system libraries and so on are all in the root filesystem.)

Moving from 16 GB of RAM to 32 GB of RAM hasn't eliminated this problem for me as such, but what it did do was allow the ZFS ARC to use enough memory despite this that it's reasonably effective anyways. With 16 GB I could see ext4 using 4 GB of page cache or more while the ZFS ARC was squeezed to 4 GB or less and suffering for it. With 32 GB the page cache may wind up at 10 GB, but this still leaves room for the ZFS ARC itself to be 9 GB with a reasonable amount of data cached.

(And 32 GB is more useful than 16 GB anyways these days, what with OSes in virtual machines that want 2 GB or 4 GB of RAM to run reasonably happily and so on. And I'm crazy enough to spin up multiple virtual machines at once for testing certain sorts of environments.)

Presumably this is completely solved if you have no non-ZFS filesystems on the machine, but that requires a ZFS root filesystem and that's currently far too alarming for me to even consider.

ZFSOnLinuxPageCacheProblem written at 02:25:07; Add Comment

2014-12-26

My experience with ZFS on Linux: it's worked pretty well for me

A while back I wrote an entry about the temptation to switch my office workstation to having my data in ZFS on Linux. Not that long after I wrote the entry, in early August according to 'zpool history', I gave in to the temptation and did just that. I've been running with everything except the root filesystem and swap in a ZoL pool ever since, and the short summary is everything has worked smoothly.

(Because I'm crazy, I did some wrestling with systemd and bind mounts instead of just using symlinks. That was about all of the hassles required to get things going.)

I'm running the development version of ZFS, hot from the git repository and generally the latest version, mostly because I wanted all of the bugfixes and improvements from the official release. At this point, I've repeatedly upgraded kernels (including over several significant kernel bumps) without any problems with the ZFS modules being able to rebuild themselves under DKMS. There was one unfortunate bit where the ZFS modules got broken by a DKMS update because they'd been doing something that was actually wrong, but that would have been much less alarming (ie, less of a surprise) if I'd actually paid attention to various messages and failures during my 'yum upgrade' run.

(The issue was also promptly fixed by the ZFS on Linux people, and I fixed it by reverting to the previous version of the Fedora DKMS package.)

My moderately big worry beforehand was memory issues, but I didn't really have any problems. I ran into one interesting issue with kernel memory usage that I'm going to write up as another entry, but it wasn't a problem as such; the short summary is that ZFS was using too little memory instead of too much. However my system did get happier with life when I increased its memory from an already relatively large 16 GB all the way up to 32 GB, and I don't know how ZFS on Linux would do on a machine with only, say, 4 GB of RAM.

The major benefit of using ZoL instead of my previous approach of ext4 over LVM over MD has been the great reassurance value of ZFS checksums and 'zpool scrub'. I took advantage of ZFS's support for L2ARC to safely add a caching SSD to the pool (which is not currently possible with Fedora's version of LVM), but I don't know if it's doing me much good. And of course I like the flexible space usage that ZFS gives me, as I no longer have to preallocate space to each filesystem and then watch them all to expand them as needed. I haven't attempted any sort of performance measurements, so all I say is that my machine doesn't feel noticeably slower or faster than before.

(I've taken advantage of ZFS snapshots a couple of times in little ways, but so far that's basically just been playing around with them on Linux.)

Thus, on the whole I would say that if you're tempted by the idea of ZoL that you should try it out, and you should definitely try it out from the latest development version. I watch the ZoL git changelogs carefully and the changes that get committed are almost entirely fixes for bugs, with periodic improvements and things from other ZFS implementations. Building RPMs for a DKMS install from the source tree is just as simple as the ZoL site describes it.

(I was hoping to be able to report that my ZoL setup had also survived a Fedora 20 to Fedora 21 upgrade without problems, but I haven't been able to try that for other reasons. I honestly don't expect any problems because the F20 and F21 kernel versions are just about the same. I did do a test upgrade on a virtual machine with the ZoL DKMS packages installed and a ZFS pool configured, and had no problems with it (apart from the systemd crash, which happens with or without ZoL).)

After this positive experience (and after reading about various btrfs things), I expect that my next home machine build will also use ZFS on Linux. ZFS checksums and pool scrubs are enough of a reason to do it all on their own, never mind the other advantages. But that's probably several years in the future at this point, as my current home machine isn't that old by my standards.

ZFSOnLinuxExperience written at 02:20:44; Add Comment

2014-12-12

The bad side of systemd: two recent systemd failures

In the past I've written a number of favorable entries about systemd. In the interests of balance, among other things, I now feel that I should rake it over the coals for today's bad experiences that I ran into in the course of trying to do a yum upgrade of one system from Fedora 20 to Fedora 21, which did not go well.

The first and worst failure is that I've consistently had systemd's master process (ie, PID 1, the true init) segfault during the upgrade process on this particular machine. I can say it's a consistent thing because this is a virtual machine and I snapshotted the disk image before starting the upgrade; I've rolled it back and retried the upgrade with variations several times and it's always segfaulted. This issue is apparently Fedora bug #1167044 (and I know of at least one other person it's happened to). Needless to say this has put somewhat of a cramp in my plans to upgrade my office and home machines to Fedora 21.

(Note that this is a real segfault and not an assertion failure. In fact this looks like a fairly bad code bug somewhere, with some form of memory scrambling involved.)

The slightly good news is that PID 1 segfaulting does not reboot the machine on the spot. I'm not sure if PID 1 is completely stopped afterwards or if it's just badly damaged, but the bad news is that a remarkably large number of things stop working after this happens. Everything trying to talk to systemd fails and usually times out after a long wait, for example attempts to do 'systemctl daemon-reload' from postinstall scripts. Attempts to log in or to su to root from an existing login either fail or hang. A plain reboot will try to talk to systemd and thus fails, although you can force a reboot in various ways (including 'reboot -f').

The merely bad experience is that as a result of this I had occasion to use journalctl (I normally don't). More specifically, I had occasion to use 'journalctl -l', because of course if you're going to make a bug report you want to give full messages. Unfortunately, 'journalctl -l does not actually show you the full message. Not if you just run it by itself. Oh, the full message is available, all right, but journalctl specifically and deliberately invokes the pager in a mode where you have to scroll sideways to see long lines. Under no circumstance is all of a long line visible on screen at once so that you may, for example, copy it into a bug report.

This is not a useful decision. In fact it is a screamingly frustrating decision, one that is about the complete reverse of what I think most people would expect -l to do. In the grand systemd tradition, there is no option to control this; all you can do is force journalctl to not use a pager or work out how to change things inside the pager to not do this.

(Oh, and journalctl goes out of its way to set up this behavior. Not by passing command line arguments to less, because that would be too obvious (you might spot it in a ps listing, for example); instead it mangles $LESS to effectively add the '-S' option, among other things.)

While I'm here, let me mention that journalctl's default behavior of 'show all messages since the beginning of time in forward chronological order' is about the most useless default I can imagine. Doing it is robot logic, not human logic. Unfortunately the systemd journal is unlikely to change its course in any significant way so I expect we'll get to live with this for years.

(I suppose what I need to do next is find out wherever abrt et al puts core dumps from root processes so that I can run gdb on my systemd core to poke around. Oh wait, I think it's in the systemd journal now. This is my unhappy face, especially since I am having to deal with a crash in systemd itself.)

SystemdCrashAndMore written at 01:51:53; Add Comment

2014-12-10

What good kernel messages should be about and be like

Linux is unfortunately a haven of terrible kernel messages and terrible kernel message handling, as I have brought up before. In a spirit of shouting at the sea, today I feel like writing down my principles of good kernel messages.

The first and most important rule of kernel messages is that any kernel message that is emitted by default should be aimed at system administrators, not kernel developers. There are very few kernel developers and they do not look at very many systems, so it's pretty much guaranteed that most kernel messages are read by sysadmins. If a kernel message is for developers, it's useless for almost everyone reading it (and potentially confusing). Ergo it should not be generated by default settings; developers who need it for debugging can turn it on in various ways (including kernel command line parameters). This core rule guides basically all of the rest of my rules.

The direct consequence of this is that all messages should be clear, without in-jokes or cleverness that is only really comprehensible to kernel developers (especially only subsystem developers). In other words, no yama-style messages. If sysadmins looking at your message have no idea what it might refer to, no lead on what kernel subsystem it came from, and no clue where to look for further information, your message is bad.

Comprehensible messages are only half of the issue, though; the other half is only emitting useful messages. To be useful, my view is that a kernel message should be one of two things: it should either be what they call actionable or it should be necessary in order to reconstruct system state (one example is hardware appearing or disappearing, another is log messages that explain why memory allocations failed). An actionable message should cause sysadmins to do something and really it should mean that sysadmins need to do something.

It follows that generally other systems should not be able to cause the kernel to log messages by throwing outside traffic at it (these days that generally means network traffic), because outsiders should not be able to harm your kernel to the degree where you need to do anything; if this is the case, they are not actionable for the sysadmin of the local machine. And yes, I bang on this particular drum a fair bit; that's because it keeps happening.

Finally, almost all messages should be strongly ratelimited. Unfortunately I've come around to the view that this is essentially impossible to do at a purely textual level (at least with acceptable impact for kernel code), so it needs to be considered everywhere kernel code can generate a message. This very definitely includes things like messages about hardware coming and going, because sooner or later someone is going to have a flaky USB adapter or SATA HD that starts disappearing and then reappearing once or twice a second.

To say this more compactly, everything in your kernel messages should be important to you. Kernel messages should not be a random swamp that you go wading in after problems happen in order to see if you can spot any clues amidst the mud; they should be something that you can watch live to see if there are problems emerging.

GoodKernelMessages written at 22:52:03; Add Comment

2014-12-07

How we install Ubuntu machines here

We have a standard install system for our Ubuntu machines (which are the majority of machines that we build). I wouldn't call it an automated install system (in the past I've used the term 'scripted'), but it is mostly automated with only a relatively modest amount of human intervention. The choice of partially automated installs may seem odd to people, but in our case it meets our needs and is easy to work with.

Our install process runs in three stages. First we have a customized Ubuntu server install image that is set up with a preseed file and a few other things on the CD image. The preseed file pre-selects a basic package set, answers a bunch of the installer questions for static things like time zones, sets a standard initial root password, and drops some files in /root on the installed system, most importantly a postinstall script.

After the system is installed and reboots, we log into it (generally over the network) and run the pre-dropped postinstall script. This grinds through a bunch of standard setup things (including making sure that the machine's time is synchronized to our NTP servers) but its most important job is bootstrapping the system far enough that it can do NFS mounts in our NFS mount authentication environment. Among other things this requires setting up the machine's canonical SSH host keys, which involves a manual authentication step to fetch them and thus demands a human there to type the access password. After getting the system to a state where it can do NFS mounts, it mounts our central administrative filesystem.

The third step is a general postinstall script that lives on this central administrative filesystem. This script asks a few questions about how the machine will be set up and what sort of package set it should have, then grinds through all the work of installing many packages (some of them from non-default repositories), setting up various local configuration files from the master versions, and applying various other customizations. After this process finishes, a standard login or compute server is basically done and in general the machine is fully enrolled in various automated management systems for things like password propagate and NFS mount management.

(Machines that aren't standard login or compute servers generally then need some additional steps following our documented build procedures.)

In general our approach here has been to standardize and script everything that is either easy to do, that's very tricky to do by hand, or that's something we do a lot. We haven't tried to go all the way to almost fully automated installs, partly because it seems too much work for the reward given the modest amount of (re)installs we do and partly because there's some steps in this process that intrinsically require human involvement. Our current system works very well; we can spin up standard new systems roughly as fast as the disks can unpack packages and with minimal human involvement, and the whole system is easy to develop and manage.

Also, let me be blunt about one reason I prefer the human in the loop approach here: unfortunately Debian and Ubuntu packages have a really annoying habit of pausing to give you quizzes every so often. These quizzes basically function as land mines if you're trying to do a fully automated install, because you can never be sure if you've pre-answered all of them and if force-ignoring one is going to blow up in your face. Having a human in the loop to say 'augh no I need to pick THIS answer' is a lot safer.

(I can't say that humans are better at picking up on problems if something comes up, because the Ubuntu package install process spews out so much text that it's basically impossible to read. In practice we all tune out while it flows by.)

UbuntuOurInstallSystem written at 01:10:43; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.