Wandering Thoughts archives

2015-01-05

Today on Linux, ZFS is your only real choice for an advanced filesystem

Yesterday I wrote about what I consider advanced filesystems are in general, namely filesystems with the minimum feature of checksums so you know when your data has been damaged and ideally with some ability to use redundancy to repair from damage. As far as I know, today on Linux there are only two filesystems that are advanced in this way: btrfs and ZFS, via ZFS on Linux.

(If you don't care about disk checksums, you have lots of choice among perfectly good filesystems. I would just run ext4 unless you had a good reason to know that eg XFS was a better choice in your particular environment; it's what I do and what most people do, so ext4 gets a lot of exercise and attention.)

In theory, you might choose either and you might even default to btrfs as the in-kernel solution. In practice, I believe that you only have one real choice today and that choice is ZFS on Linux. This is not because ZFS might be better than btrfs on a technical level (although I believe it is), it is simply because people keep having problems with btrfs (the latest example I was exposed to was this one). Far too many things I read about btrfs wind up saying stuff like 'it's been stable for a few months since the last problem' or 'I had a problem recently but it wasn't too bad' or the like. Btrfs does not appear to be stable yet and it doesn't appear likely to be stable any time soon; everything I wrote in 2013 about why not to consider btrfs yet still apply.

Btrfs will hopefully someday be one of the filesystems of the future. But it is not the filesystem of today unless you feel very daring. If you want an advanced filesystem today on Linux, your only real option is ZFS on Linux.

Now, ZoL is not perfect. People do still report problems with it from time to time, including kernel memory issues, and you will want to test it in your environment to make sure it works okay. But from all the reports I've read there are plenty of people running it in production in various ways (in more demanding circumstances than mine) and it isn't blowing up in their faces.

In short, ZFS on Linux is something that you can reasonably consider today, and in practice things will probably work fine. I think that considering btrfs today is demonstrably relatively crazy.

(I'm aware that Facebook is using btrfs internally to some degree. Facebook also has Chris Mason working for them to find and fix their btrfs problems and likely a team that immediately packages those changes up into custom Facebook kernels. See also.)

ZFSOnLinuxvsBtrfsToday written at 02:25:21; Add Comment

2014-12-29

How I have partitioning et al set up for ZFS On Linux

This is a deeper look into how I have my office workstation configured with ZFS On Linux for all of my user data, because I figure that this may be of interest for people.

My office workstation's primary disks are a pair of 1 TB SATA drives. Each drive is partitioned identically into five partitions. The first four of those partitions (well, pairs of those partitions) are used for software RAID mirrors for swap, /, /boot, and a backup copy of / that I use when I do yum upgrades from one version of Fedora to another. If I was redoing this partitioning today I would not use a separate /boot partition, but this partitioning predates my enlightenment on that.

(Actually because I'm using GPT partitioning there are a few more partitions sitting around for UEFI stuff; I have extra 'EFI System' and and 'BIOS boot partition' partitions. I've ignored them as long as this system has been set up.)

All together these partitions use up about 170 GB of the disks (mostly in the two root filesystem partitions). The rest of the disk is in the final large partition, and this partition (on both disks) is what ZFS uses for the maindata pool that holds all my filesystems. The pool is of course set up with a single mirror vdev that uses both partitions. Following more or less the ZoL recommendations that I found, I set it up using the /dev/disk/by-id/ 'wnn-....-part7' names for the two partitions in question (and I set it up with an explicit 'ashift=12' option as future-proofing, although these disks are not 4K disks themselves).

I later added a SSD as a L2ARC because we had a spare SSD lying around and I had the spare chassis space. Because I had nothing else on the SSD at all, I added it with the bare /dev/disk/by-id wnn-* name and let ZoL partition the disk itself (and I didn't attempt to force an ashift for the L2ARC). As I believe is standard ZoL behavior, ZoL partitioned it as a GPT disk with an 8 MB spacer partition at the end. ZoL set the GPT partition type to 'zfs' (BF01).

(ZoL doesn't seem to require specific GPT partition types if you give it explicit partitions; my maindata partitions are still labeled as 'Linux LVM'.)

Basically all of the filesystems from my maindata pool are set up with explicit mountpoint= settings that put them where the past LVM versions went; for most of them this is various names in / (eg /homes, /data, /vmware, and /archive). I have ZoL set up to mount these in the normal ZFS way, ie as the ZFS pools and services are brought up (instead of attempting to do something through /etc/fstab). I also have a collection of bind mounts that also materialize bits of these filesystems in other places, mostly because I'm a bit crazy. Since all of the mount points and bind targets are in the root filesystem, I don't have to worry about mount order dependencies; if the system is up enough to be bringing up ZFS, the root filesystem is there to be mounted on.

Sidebar: on using wnn-* names here

I normally prefer physical location based names like /dev/sda and so on. However ZoL people recommend using stable /dev/disk/by-* names in general and they prefer the by-id names that are tied to the physical disk instead of what's plugged in where. When I was setting up ZFS I decided that this was okay by me because, after all, the ZFS pool itself is tied to these specific disks unless I go crazy and do something like dd the ZoL data from one disk to another. Really, it's no different from Linux's software RAID automatically finding its component disks regardless of what they're called today and I'm perfectly fine with that.

(And tying ZoL to specific disks will save me if someday I'm shuffling SATA cables around and what was /dev/sdb accidentally winds up as /dev/sdd.)

Of course I'd be happier if ZoL would just go look at the disks and find the ZFS metadata and assemble things automatically the way that Linux software RAID does. But ZoL has apparently kept most of the irritating Solaris bits around zpool.cache and how the system does relatively crazy things while bringing pools up on boot. I can't really blame them for not wanting to rewrite all of that code and make things more or less gratuitously different from the Illumos ZFS codebase.

ZFSOnLinuxDiskSetup written at 01:45:32; Add Comment

2014-12-27

My ZFS On Linux memory problem: competition from the page cache

When I moved my data to ZFS on Linux on my office workstation, I didn't move the entire system to ZoL for various reasons. My old setup was having my data on ext4 in LVM on a software RAID mirror, but having the root filesystem and swap in separate software RAID mirrors outside of LVM. When I moved to ZoL, I converted the ext4 in LVM on MD portion to a ZFS pool with the various data filesystems (my home directory, virtual machine images, and so on), but I left the root fileystem (and swap) alone. The net effect is that I was left with a relatively small ext4 root filesystem and a relatively large ZFS pool that had all of my real data.

ZFS on Linux does its best to integrate into the Linux kernel memory management system, and these days that seems to be pretty good. But it doesn't take part in the kernel's generic filesystem page cache; instead it has its own system for this, called ARC. In effect, in a system with both conventional filesystems (such as my ext4 root filesystem) and ZFS filesystems, you have two page caches, one used by your conventional filesystems and the separate one used by your ZFS filesystems. What I found on my machine was that the overall system was bad at balancing memory usage between these two. In particular, the ZFS ARC didn't seem to compete strongly enough with the kernel page cache.

If everything was going well, what I'd have expected was for there to be relatively little kernel page cache and a relatively large ARC, because ext4 (the page cache user) held much less of my actively used data than ZFS did. Certainly I ran some things from the root filesystem (such as compilers), but I rather thought not necessarily all that much compared to what I was doing from my ZFS filesystems. In real life, things seemed to go the other way; I would wind up with a relatively large page cache and a quite tiny ARC that was caching relatively little data. As far as I could tell, over time ext4 was simply out-competing ZFS for memory despite having much less actual filesystem data.

I assume that this is due to the page cache and the ZFS ARC being separate from each other, so that there just isn't any way of having some sort of global view of disk buffer usage that would let the ZFS ARC push directly on the page cache (and vice versa if necessary). As far as I know there's no way to limit page cache usage or push more aggressively only on the page cache in a way that won't hit ZFS at least as hard. So a mixed system with ZFS and something else just gets to live with this.

(The crude fix is to periodically empty the page cache with 'echo 1 >/proc/sys/vm/drop_caches'. Note that you don't want to use '2' or '3' here, because that will also drop ZFS ARC caches and other ZFS memory.)

The build up of page cache is not immediate when the system is in use. Instead it seems to come over time, possibly as things like system backups run and temporarily pull large amounts of the root filesystem into cache. I believe it generally took a day or two for page cache usage to grow and start strangling the ARC after I explicitly dropped the caches, and in the mean time I could do relatively root filesystem intensive things like compiling Firefox from source without making the page cache hulk up.

(The Firefox source code itself was in a ZFS filesystem, but the C++ compiler and system headers and system libraries and so on are all in the root filesystem.)

Moving from 16 GB of RAM to 32 GB of RAM hasn't eliminated this problem for me as such, but what it did do was allow the ZFS ARC to use enough memory despite this that it's reasonably effective anyways. With 16 GB I could see ext4 using 4 GB of page cache or more while the ZFS ARC was squeezed to 4 GB or less and suffering for it. With 32 GB the page cache may wind up at 10 GB, but this still leaves room for the ZFS ARC itself to be 9 GB with a reasonable amount of data cached.

(And 32 GB is more useful than 16 GB anyways these days, what with OSes in virtual machines that want 2 GB or 4 GB of RAM to run reasonably happily and so on. And I'm crazy enough to spin up multiple virtual machines at once for testing certain sorts of environments.)

Presumably this is completely solved if you have no non-ZFS filesystems on the machine, but that requires a ZFS root filesystem and that's currently far too alarming for me to even consider.

ZFSOnLinuxPageCacheProblem written at 02:25:07; Add Comment

2014-12-26

My experience with ZFS on Linux: it's worked pretty well for me

A while back I wrote an entry about the temptation to switch my office workstation to having my data in ZFS on Linux. Not that long after I wrote the entry, in early August according to 'zpool history', I gave in to the temptation and did just that. I've been running with everything except the root filesystem and swap in a ZoL pool ever since, and the short summary is everything has worked smoothly.

(Because I'm crazy, I did some wrestling with systemd and bind mounts instead of just using symlinks. That was about all of the hassles required to get things going.)

I'm running the development version of ZFS, hot from the git repository and generally the latest version, mostly because I wanted all of the bugfixes and improvements from the official release. At this point, I've repeatedly upgraded kernels (including over several significant kernel bumps) without any problems with the ZFS modules being able to rebuild themselves under DKMS. There was one unfortunate bit where the ZFS modules got broken by a DKMS update because they'd been doing something that was actually wrong, but that would have been much less alarming (ie, less of a surprise) if I'd actually paid attention to various messages and failures during my 'yum upgrade' run.

(The issue was also promptly fixed by the ZFS on Linux people, and I fixed it by reverting to the previous version of the Fedora DKMS package.)

My moderately big worry beforehand was memory issues, but I didn't really have any problems. I ran into one interesting issue with kernel memory usage that I'm going to write up as another entry, but it wasn't a problem as such; the short summary is that ZFS was using too little memory instead of too much. However my system did get happier with life when I increased its memory from an already relatively large 16 GB all the way up to 32 GB, and I don't know how ZFS on Linux would do on a machine with only, say, 4 GB of RAM.

The major benefit of using ZoL instead of my previous approach of ext4 over LVM over MD has been the great reassurance value of ZFS checksums and 'zpool scrub'. I took advantage of ZFS's support for L2ARC to safely add a caching SSD to the pool (which is not currently possible with Fedora's version of LVM), but I don't know if it's doing me much good. And of course I like the flexible space usage that ZFS gives me, as I no longer have to preallocate space to each filesystem and then watch them all to expand them as needed. I haven't attempted any sort of performance measurements, so all I say is that my machine doesn't feel noticeably slower or faster than before.

(I've taken advantage of ZFS snapshots a couple of times in little ways, but so far that's basically just been playing around with them on Linux.)

Thus, on the whole I would say that if you're tempted by the idea of ZoL that you should try it out, and you should definitely try it out from the latest development version. I watch the ZoL git changelogs carefully and the changes that get committed are almost entirely fixes for bugs, with periodic improvements and things from other ZFS implementations. Building RPMs for a DKMS install from the source tree is just as simple as the ZoL site describes it.

(I was hoping to be able to report that my ZoL setup had also survived a Fedora 20 to Fedora 21 upgrade without problems, but I haven't been able to try that for other reasons. I honestly don't expect any problems because the F20 and F21 kernel versions are just about the same. I did do a test upgrade on a virtual machine with the ZoL DKMS packages installed and a ZFS pool configured, and had no problems with it (apart from the systemd crash, which happens with or without ZoL).)

After this positive experience (and after reading about various btrfs things), I expect that my next home machine build will also use ZFS on Linux. ZFS checksums and pool scrubs are enough of a reason to do it all on their own, never mind the other advantages. But that's probably several years in the future at this point, as my current home machine isn't that old by my standards.

ZFSOnLinuxExperience written at 02:20:44; Add Comment

2014-12-12

The bad side of systemd: two recent systemd failures

In the past I've written a number of favorable entries about systemd. In the interests of balance, among other things, I now feel that I should rake it over the coals for today's bad experiences that I ran into in the course of trying to do a yum upgrade of one system from Fedora 20 to Fedora 21, which did not go well.

The first and worst failure is that I've consistently had systemd's master process (ie, PID 1, the true init) segfault during the upgrade process on this particular machine. I can say it's a consistent thing because this is a virtual machine and I snapshotted the disk image before starting the upgrade; I've rolled it back and retried the upgrade with variations several times and it's always segfaulted. This issue is apparently Fedora bug #1167044 (and I know of at least one other person it's happened to). Needless to say this has put somewhat of a cramp in my plans to upgrade my office and home machines to Fedora 21.

(Note that this is a real segfault and not an assertion failure. In fact this looks like a fairly bad code bug somewhere, with some form of memory scrambling involved.)

The slightly good news is that PID 1 segfaulting does not reboot the machine on the spot. I'm not sure if PID 1 is completely stopped afterwards or if it's just badly damaged, but the bad news is that a remarkably large number of things stop working after this happens. Everything trying to talk to systemd fails and usually times out after a long wait, for example attempts to do 'systemctl daemon-reload' from postinstall scripts. Attempts to log in or to su to root from an existing login either fail or hang. A plain reboot will try to talk to systemd and thus fails, although you can force a reboot in various ways (including 'reboot -f').

The merely bad experience is that as a result of this I had occasion to use journalctl (I normally don't). More specifically, I had occasion to use 'journalctl -l', because of course if you're going to make a bug report you want to give full messages. Unfortunately, 'journalctl -l does not actually show you the full message. Not if you just run it by itself. Oh, the full message is available, all right, but journalctl specifically and deliberately invokes the pager in a mode where you have to scroll sideways to see long lines. Under no circumstance is all of a long line visible on screen at once so that you may, for example, copy it into a bug report.

This is not a useful decision. In fact it is a screamingly frustrating decision, one that is about the complete reverse of what I think most people would expect -l to do. In the grand systemd tradition, there is no option to control this; all you can do is force journalctl to not use a pager or work out how to change things inside the pager to not do this.

(Oh, and journalctl goes out of its way to set up this behavior. Not by passing command line arguments to less, because that would be too obvious (you might spot it in a ps listing, for example); instead it mangles $LESS to effectively add the '-S' option, among other things.)

While I'm here, let me mention that journalctl's default behavior of 'show all messages since the beginning of time in forward chronological order' is about the most useless default I can imagine. Doing it is robot logic, not human logic. Unfortunately the systemd journal is unlikely to change its course in any significant way so I expect we'll get to live with this for years.

(I suppose what I need to do next is find out wherever abrt et al puts core dumps from root processes so that I can run gdb on my systemd core to poke around. Oh wait, I think it's in the systemd journal now. This is my unhappy face, especially since I am having to deal with a crash in systemd itself.)

SystemdCrashAndMore written at 01:51:53; Add Comment

2014-12-10

What good kernel messages should be about and be like

Linux is unfortunately a haven of terrible kernel messages and terrible kernel message handling, as I have brought up before. In a spirit of shouting at the sea, today I feel like writing down my principles of good kernel messages.

The first and most important rule of kernel messages is that any kernel message that is emitted by default should be aimed at system administrators, not kernel developers. There are very few kernel developers and they do not look at very many systems, so it's pretty much guaranteed that most kernel messages are read by sysadmins. If a kernel message is for developers, it's useless for almost everyone reading it (and potentially confusing). Ergo it should not be generated by default settings; developers who need it for debugging can turn it on in various ways (including kernel command line parameters). This core rule guides basically all of the rest of my rules.

The direct consequence of this is that all messages should be clear, without in-jokes or cleverness that is only really comprehensible to kernel developers (especially only subsystem developers). In other words, no yama-style messages. If sysadmins looking at your message have no idea what it might refer to, no lead on what kernel subsystem it came from, and no clue where to look for further information, your message is bad.

Comprehensible messages are only half of the issue, though; the other half is only emitting useful messages. To be useful, my view is that a kernel message should be one of two things: it should either be what they call actionable or it should be necessary in order to reconstruct system state (one example is hardware appearing or disappearing, another is log messages that explain why memory allocations failed). An actionable message should cause sysadmins to do something and really it should mean that sysadmins need to do something.

It follows that generally other systems should not be able to cause the kernel to log messages by throwing outside traffic at it (these days that generally means network traffic), because outsiders should not be able to harm your kernel to the degree where you need to do anything; if this is the case, they are not actionable for the sysadmin of the local machine. And yes, I bang on this particular drum a fair bit; that's because it keeps happening.

Finally, almost all messages should be strongly ratelimited. Unfortunately I've come around to the view that this is essentially impossible to do at a purely textual level (at least with acceptable impact for kernel code), so it needs to be considered everywhere kernel code can generate a message. This very definitely includes things like messages about hardware coming and going, because sooner or later someone is going to have a flaky USB adapter or SATA HD that starts disappearing and then reappearing once or twice a second.

To say this more compactly, everything in your kernel messages should be important to you. Kernel messages should not be a random swamp that you go wading in after problems happen in order to see if you can spot any clues amidst the mud; they should be something that you can watch live to see if there are problems emerging.

GoodKernelMessages written at 22:52:03; Add Comment

2014-12-07

How we install Ubuntu machines here

We have a standard install system for our Ubuntu machines (which are the majority of machines that we build). I wouldn't call it an automated install system (in the past I've used the term 'scripted'), but it is mostly automated with only a relatively modest amount of human intervention. The choice of partially automated installs may seem odd to people, but in our case it meets our needs and is easy to work with.

Our install process runs in three stages. First we have a customized Ubuntu server install image that is set up with a preseed file and a few other things on the CD image. The preseed file pre-selects a basic package set, answers a bunch of the installer questions for static things like time zones, sets a standard initial root password, and drops some files in /root on the installed system, most importantly a postinstall script.

After the system is installed and reboots, we log into it (generally over the network) and run the pre-dropped postinstall script. This grinds through a bunch of standard setup things (including making sure that the machine's time is synchronized to our NTP servers) but its most important job is bootstrapping the system far enough that it can do NFS mounts in our NFS mount authentication environment. Among other things this requires setting up the machine's canonical SSH host keys, which involves a manual authentication step to fetch them and thus demands a human there to type the access password. After getting the system to a state where it can do NFS mounts, it mounts our central administrative filesystem.

The third step is a general postinstall script that lives on this central administrative filesystem. This script asks a few questions about how the machine will be set up and what sort of package set it should have, then grinds through all the work of installing many packages (some of them from non-default repositories), setting up various local configuration files from the master versions, and applying various other customizations. After this process finishes, a standard login or compute server is basically done and in general the machine is fully enrolled in various automated management systems for things like password propagate and NFS mount management.

(Machines that aren't standard login or compute servers generally then need some additional steps following our documented build procedures.)

In general our approach here has been to standardize and script everything that is either easy to do, that's very tricky to do by hand, or that's something we do a lot. We haven't tried to go all the way to almost fully automated installs, partly because it seems too much work for the reward given the modest amount of (re)installs we do and partly because there's some steps in this process that intrinsically require human involvement. Our current system works very well; we can spin up standard new systems roughly as fast as the disks can unpack packages and with minimal human involvement, and the whole system is easy to develop and manage.

Also, let me be blunt about one reason I prefer the human in the loop approach here: unfortunately Debian and Ubuntu packages have a really annoying habit of pausing to give you quizzes every so often. These quizzes basically function as land mines if you're trying to do a fully automated install, because you can never be sure if you've pre-answered all of them and if force-ignoring one is going to blow up in your face. Having a human in the loop to say 'augh no I need to pick THIS answer' is a lot safer.

(I can't say that humans are better at picking up on problems if something comes up, because the Ubuntu package install process spews out so much text that it's basically impossible to read. In practice we all tune out while it flows by.)

UbuntuOurInstallSystem written at 01:10:43; Add Comment

2014-11-28

How I made IPSec IKE work for a point to point GRE tunnel on Fedora 20

The basic overview of my IPSec needs is that I want to make my home machine (with an outside address) appear as an inside IP address on the same subnet as my work machine is on. Because of Linux proxy ARP limitations, the core mechanics of this involve a GRE tunnel, which must be encrypted and authenticated by IPSec. Previously I was doing this with a static IPSec configuration created by direct use of setkey, which had the drawback that it didn't automatically change encryption keys or notice if something went wrong with the IPSec stuff. The normal solution to these drawbacks is to use an IKE daemon to automatically negotiate IPSec (and time it out if the other end stops), but unfortunately this is not a configuration that IKE daemons such as Fedora 20's Pluto support directly. I can't really blame them; anything involving proxy ARP is at least reasonably peculiar and most sane people either use routing on subnets or NAT the remote machines.

My first step to a working configuration came about after I fixed my configuration to block unprotected GRE traffic. Afterwards I realized this meant that I could completely ignore managing GRE in my IKE configuration and only have it deal with IPSec stuff; I'd just leave the GRE tunnel up all the time and if IPSec was down, the iptables rules would stop traffic. After I gritted my teeth and read through the libreswan ipsec.conf manpage, this turned out to be a reasonably simple configuration. The core of it is this:

conn cksgre
    left=<work IP alias>
    leftsourceip=<work IP alias>
    right=<home public IP>
    ikev2=insist
    # what you want for always-up IPSec
    auto=start
    # I only want to use IPSec on GRE traffic
    leftprotoport=gre
    rightprotoport=gre

    # authentication is:
    authby=rsasig
    rightrsasigkey=[...]
    leftrsaksigkey=[...]

The two IP addresses used here are the two endpoints of my GRE tunnel (the 'remote' and 'local' addresses in 'ip tunnel <...>'). Note that this configuration has absolutely no reference to the local and peer IP addresses that you set on the inside of the tunnel; in my setup IPSec is completely indifferent to them.

I initially attempted to do authentication via PSK aka a (pre) shared secret. This caused my setup of the Fedora 20 version of Pluto to dump core with an assertion failure (for what seems to be somewhat coincidental reasons), which turned out to be lucky because there's a better way. Pluto supports what it calls 'RSA signature authentication', which people who use SSH also know as 'public key authentication'; just as with SSH, you give each end its own keypair and then list the public key(s) in your configuration and you're done. How to create the necessary RSA keypairs and set everything up is not well documented in the Fedora 20 manpages; in fact, I didn't even realize it was possible. Fortunately I stumbled over this invaluable blog entry on setting up a basic IPSec connection which covered the magic required.

This got the basic setup working, but after a while the novelty wore off and my urge to fiddle with things got the better of me so I decided to link the GRE tunnel to the IKE connection, so it would be torn down if the connection died (and brought up when the connection was made). You get your commands run on such connection events through the leftupdown="..." or rightupdown="..." configuration setting; your command gets information about what's going on through a pile of environment variables (which are documented in the ipsec_pluto manpage). For me this is a script that inspects $PLUTO_VERB to find out what's going on and runs one of my existing scripts to set up or tear down things on up-host and down-host actions. As far as I can tell, my configuration does not need to run the default 'ipsec _updown' command.

(My existing scripts used to do both GRE setup and IPSec setup, but of course now they only do the GRE setup and the IPSec stuff is commented out.)

This left IPSec connection initiation (and termination) itself. On my home machine I used to bring up and tear down the entire IPSec and GRE stuff when my PPPoE DSL link came up or went down. In theory one could now leave this up to a running Pluto based on its normal keying retries and timeouts; in practice this doesn't really work well and I wound up needing to do manual steps. Manual control of Pluto is done through 'ipsec whack' and if everything is running smoothly doing the following on DSL link up or down is enough:

ipsec whack --initiate|--terminate --name cksgre >/dev/null 2>&1

Unfortunately this is not always sufficient. Pluto does not notice dynamically appearing and disappearing network links and addresses, so if it's (re)started while my DSL link is down (for example on boot) it can't find either IP address associated with the cksgre connection and then refuses to try to do anything even if you explicitly ask it to initiate the connection. To make Pluto re-check the system's IP addresses and thus become willing to activate the IPSec connection, I need to do:

ipsec whack --listen

Even though the IPSec connection is set to autostart, Pluto does not actually autostart it when --listen causes it to notice that the necessary IP address now exists; instead I have to explicitly initiate it with 'ipsec whack --initiate --name cksgre'. My current setup wraps this all up in a script and runs it from /etc/ppp/ip-up.local and ip-down.local (in the same place where I previously invoked my own IPSec and GRE setup and stop scripts).

So far merely poking Pluto with --listen has been sufficient to get it to behave, but I haven't extensively tested this. My script currently has a fallback that will do a 'systemctl restart ipsec' if nothing else works.

PS: Note that taking down the GRE tunnel on IPSec failure has some potential security implications in my environment. I think I'm okay with them, but that's really something for another entry.

Sidebar: What ipsec _updown is and does

On Fedora 20 this is /usr/libexec/ipsec/_updown, which runs one of the _updown.* scripts in that directory depending on what the kernel protocol is; on most Linux machines (and certainly on Fedora 20) this is NETKEY, so _updown.netkey is what gets run in the end. What these scripts can do for you and maybe do do for you is neither clear nor documented and they make me nervous. They certainly seem to have the potential to do any number of things, some of them interesting and some of them alarming.

Having now scanned _updown.netkey, it appears that the only thing it might possibly be doing for me is mangling my /etc/resolv.conf. So, uh, no thanks.

IKEForPointToPointGRE written at 03:01:37; Add Comment

2014-11-26

Using iptables to block traffic that's not protected by IPSec

When I talk about my IPSec setup, I often say that I use GRE over IPSec (or 'an IPSec based GRE tunnel'). However, this is not really what is going on; a more accurate but more opaque description is that I have a GRE tunnel that is encrypted and protected by IPSec. The problem, and the reason that the difference matters, is that there is nothing that intrinsically ties the two pieces together, unlike something where you are genuinely running X over Y such as 'forwarding X11 over SSH'. In the X11 over SSH case, if SSH is not working you do not get anything. But in my case if IPSec isn't there for some reason my GRE tunnel will cheerfully continue working, just without any protection against either eavesdropping or impersonation.

In theory this is undoubtedly not supposed to happen, since you (I) designed your GRE setup to work in conjunction with IPSec. Unfortunately in practice in practice there are any number of ways for IPSec to go away on you, possibly without destroying the GRE tunnel in the process. Your IPSec IKE daemon probably removes the IPSec security policies that reject unencrypted traffic when it shuts down, for example, and if you're manually configuring IPSec with setkey you can do all sorts of fun things like accidentally leaving a 'spdflush;' command in a control file that only (re)loads keys and is no longer used to set up the security policies.

The obvious safety method is to add some iptables rules that block unencrypted GRE traffic. If you are like me, you'll start out by writing the obvious iptables ruleset:

iptables -A INPUT -p esp -j ACCEPT
iptables -A INPUT -p gre -j DROP

This doesn't work. As far as I can tell, the Linux IPSec system effectively re-injects the decrypted packets into the IP stack, where they will be seen in their unencrypted state by iptables rules (as well as by tcpdump, which can be both confusing and alarming). The result is that after the re-injection the ipfilters rules see a plain GRE packet and drop it.

Courtesy of this netfilter mailing list message, it turns out that what you need is to match packets that will be or have been processed by IPSec. This is done with a policy match:

iptables -A INPUT -m policy --dir in --pol ipsec -j ACCEPT
iptables -A INPUT -p gre -j DROP

# and for outgoing packets:
iptables -A OUTPUT -m policy --dir out --pol ipsec -j ACCEPT
iptables -A OUTPUT -p gre -j DROP

Reading the iptables-extensions manpage suggests that I should add at least '--proto esp' to the policy match for extra paranoia.

I've tested these rules and they work. They pass GRE traffic that is protected by IPSec, but if I remove the IPSec security policies that force IPSec for my GRE traffic these iptables rules block the unprotected GRE traffic as I want.

(Extension to non-GRE traffic is left as an exercise to the reader. I have a simple IPSec story in that I'm only using it to protect GRE and I never want GRE traffic to flow without IPSec to any destination under any circumstances. Note that there are potentially tricky rule ordering issues here and you probably want to always put this set of rules at the end of your processing.)

IptablesBlockNonIpsec written at 23:16:09; Add Comment

2014-11-25

My Linux IPSec/VPN setup and requirements

In response to my entry mentioning perhaps writing my own daemon to rekey my IPSec tunnel, a number of people made suggestions in comments. Rather than write a long response, I've decided to write up how my current IPSec tunnel works and what my requirements are for it or any equivalent. As far as I know these requirements rule out most VPN software, at least in its normal setup.

My IPSec based GRE tunnel runs between my home machine and my work machine and its fundamental purpose is to cause my home machine to appear on the work network as just another distinct host with its own IP address. Importantly this IP address is publicly visible, not just an internal one. My home machine routes some but not all of its traffic over the IPSec tunnel and for various reasons I need full dual identity routing for it; traffic to or from the internal IP must flow over the IPSec tunnel while traffic to or from the external IP must not. My work machine also has additional interfaces that I need to isolate, which can get a bit complicated.

(The actual setup of this turns out to be kind of tangled, with some side effects.)

This tunnel is normally up all of the time, although under special circumstances it needs to be pulled down locally on my work machine (and independently on my home machine). Both home and work machines have static IPs. All of this works today; the only thing that my IPSec setup lacks is periodic automatic rekeying of the IPSec symmetric keys used for encryption and authentication.

Most VPN software that I've seen wants to either masquerade your traffic as coming from the VPN IP itself or to make clients appear on a (virtual) subnet behind the VPN server with explicit routing. Neither is what I want. Some VPNs will bridge networks together; this is not appropriate either because I have no desire to funnel all of the broadcast traffic running around on the work subnet over my DSL PPPoE link. Nor can I use pure IPSec alone, due to a Linux proxy ARP limitation (unless this has been fixed since then).

I suspect that there is no way to tell IKE daemons 'I don't need you to set things up, just to rekey this periodically'; this would be the minimally intrusive change. There is probably a way to configure a pair of IKE daemons to do everything, so that they fully control the whole IPSec and GRE tunnel setup; there is probably even a way to tell them to kick off the setup of policy based routing when a connection is negotiated. However for obvious reasons my motivation for learning enough about IKE configuration to recreate my whole setup is somewhat low, as much of the work is pure overhead that's required just to get me to where I already am now. On the other hand, if a working IKE based configuration for all of this fell out of the sky I would probably be perfectly happy to use it; I'm not intrinsically opposed to IKE, just far from convinced that investing a bunch of effort into decoding how I need to set it up will get me much or be interesting.

(It would be different if documentation for IKE daemons was clear and easy to follow, but so far I haven't found any that is. Any time I skim any of it I can see a great deal of annoyance in my future.)

PS: It's possible that my hardcoded IPSec setup is not the most modern in terms of security, since it dates from many years ago. Switching to a fully IKE-mediated setup would in theory give me a free ride on future best practices for IPSec algorithm choices so I don't have to worry about this.

Sidebar: why I feel that writing my own rekeying daemon is safe

The short version is that the daemon would not involved in setting up the secure tunnel itself, just getting new keys from /dev/urandom, telling the authenticated other end about them, writing them to a setkey script file, and running the necessary commands to (re)load them. I'd completely agree with everyone who is telling me to use IKE if I was attempting to actively negotiate a full IPSec setup, but I'm not. The IPSec setup is very firmly fixed; the only thing that varies is the keys. There are ways to lose badly here, but they're almost entirely covered by using a transport protocol with strong encryption and authentication and then insisting on fixed IP addresses on top of it.

(Note that I won't be negotiating keys with the other end as such. Whichever end initiates a rekey will contact the other end to say more or less 'here are my new keys, now use them'. And I don't intend to give the daemon the ability to report on the current keys. If I need to know them I can go look outside of the daemon. If the keys are out of sync or broken, well, the easy thing is to force an immediate rekey to fix it, not to read out current keys to try to resync each end.)

MyIPSecRequirements written at 00:25:56; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.