2015-01-05
Today on Linux, ZFS is your only real choice for an advanced filesystem
Yesterday I wrote about what I consider advanced filesystems are in general, namely filesystems with the minimum feature of checksums so you know when your data has been damaged and ideally with some ability to use redundancy to repair from damage. As far as I know, today on Linux there are only two filesystems that are advanced in this way: btrfs and ZFS, via ZFS on Linux.
(If you don't care about disk checksums, you have lots of choice among perfectly good filesystems. I would just run ext4 unless you had a good reason to know that eg XFS was a better choice in your particular environment; it's what I do and what most people do, so ext4 gets a lot of exercise and attention.)
In theory, you might choose either and you might even default to btrfs as the in-kernel solution. In practice, I believe that you only have one real choice today and that choice is ZFS on Linux. This is not because ZFS might be better than btrfs on a technical level (although I believe it is), it is simply because people keep having problems with btrfs (the latest example I was exposed to was this one). Far too many things I read about btrfs wind up saying stuff like 'it's been stable for a few months since the last problem' or 'I had a problem recently but it wasn't too bad' or the like. Btrfs does not appear to be stable yet and it doesn't appear likely to be stable any time soon; everything I wrote in 2013 about why not to consider btrfs yet still apply.
Btrfs will hopefully someday be one of the filesystems of the future. But it is not the filesystem of today unless you feel very daring. If you want an advanced filesystem today on Linux, your only real option is ZFS on Linux.
Now, ZoL is not perfect. People do still report problems with it from time to time, including kernel memory issues, and you will want to test it in your environment to make sure it works okay. But from all the reports I've read there are plenty of people running it in production in various ways (in more demanding circumstances than mine) and it isn't blowing up in their faces.
In short, ZFS on Linux is something that you can reasonably consider today, and in practice things will probably work fine. I think that considering btrfs today is demonstrably relatively crazy.
(I'm aware that Facebook is using btrfs internally to some degree. Facebook also has Chris Mason working for them to find and fix their btrfs problems and likely a team that immediately packages those changes up into custom Facebook kernels. See also.)
2014-12-29
How I have partitioning et al set up for ZFS On Linux
This is a deeper look into how I have my office workstation configured with ZFS On Linux for all of my user data, because I figure that this may be of interest for people.
My office workstation's primary disks are a pair of 1 TB SATA drives.
Each drive is partitioned identically into five partitions. The first
four of those partitions (well, pairs of those partitions) are used
for software RAID mirrors for swap, /, /boot, and a backup copy
of / that I use when I do yum upgrades from one version of Fedora
to another. If I was redoing this partitioning today I would not use
a separate /boot partition, but this partitioning predates my
enlightenment on that.
(Actually because I'm using GPT partitioning there are a few more partitions sitting around for UEFI stuff; I have extra 'EFI System' and and 'BIOS boot partition' partitions. I've ignored them as long as this system has been set up.)
All together these partitions use up about 170 GB of the disks (mostly
in the two root filesystem partitions). The rest of the disk is in the
final large partition, and this partition (on both disks) is what ZFS
uses for the maindata pool that holds all my filesystems. The pool is
of course set up with a single mirror vdev that uses both partitions.
Following more or less the ZoL recommendations that I found, I set it
up using the /dev/disk/by-id/ 'wnn-....-part7' names for the two
partitions in question (and I set it up with an explicit 'ashift=12'
option as future-proofing, although
these disks are not 4K disks themselves).
I later added a SSD as a L2ARC because we had a spare SSD lying around and I had the spare chassis space. Because I had nothing else on the SSD at all, I added it with the bare /dev/disk/by-id wnn-* name and let ZoL partition the disk itself (and I didn't attempt to force an ashift for the L2ARC). As I believe is standard ZoL behavior, ZoL partitioned it as a GPT disk with an 8 MB spacer partition at the end. ZoL set the GPT partition type to 'zfs' (BF01).
(ZoL doesn't seem to require specific GPT partition types if you give
it explicit partitions; my maindata partitions are still labeled as
'Linux LVM'.)
Basically all of the filesystems from my maindata pool are set up
with explicit mountpoint= settings that put them where the past
LVM versions went; for most of them this is various names in /
(eg /homes, /data, /vmware, and /archive). I have ZoL set
up to mount these in the normal ZFS way, ie as the ZFS pools and
services are brought up (instead of attempting to do something through
/etc/fstab). I also have a collection of bind mounts that also
materialize bits of these filesystems in other places, mostly because
I'm a bit crazy. Since all of the mount points
and bind targets are in the root filesystem, I don't have to worry about
mount order dependencies; if the system is up enough to be bringing up
ZFS, the root filesystem is there to be mounted on.
Sidebar: on using wnn-* names here
I normally prefer physical location based names like /dev/sda and so
on. However ZoL people recommend using stable /dev/disk/by-* names
in general and they prefer the by-id names that are tied to the
physical disk instead of what's plugged in where. When I was setting
up ZFS I decided that this was okay by me because, after all, the ZFS
pool itself is tied to these specific disks unless I go crazy and
do something like dd the ZoL data from one disk to another. Really,
it's no different from Linux's software RAID automatically finding
its component disks regardless of what they're called today and I'm
perfectly fine with that.
(And tying ZoL to specific disks will save me if someday I'm shuffling
SATA cables around and what was /dev/sdb accidentally winds up as
/dev/sdd.)
Of course I'd be happier if ZoL would just go look at the disks and
find the ZFS metadata and assemble things automatically the way that
Linux software RAID does. But ZoL has apparently kept most of the
irritating Solaris bits around zpool.cache and how the system does
relatively crazy things while bringing pools up on boot. I can't really
blame them for not wanting to rewrite all of that code and make things
more or less gratuitously different from the Illumos ZFS codebase.
2014-12-27
My ZFS On Linux memory problem: competition from the page cache
When I moved my data to ZFS on Linux on my office workstation, I didn't move the entire system to ZoL for various reasons. My old setup was having my data on ext4 in LVM on a software RAID mirror, but having the root filesystem and swap in separate software RAID mirrors outside of LVM. When I moved to ZoL, I converted the ext4 in LVM on MD portion to a ZFS pool with the various data filesystems (my home directory, virtual machine images, and so on), but I left the root fileystem (and swap) alone. The net effect is that I was left with a relatively small ext4 root filesystem and a relatively large ZFS pool that had all of my real data.
ZFS on Linux does its best to integrate into the Linux kernel memory management system, and these days that seems to be pretty good. But it doesn't take part in the kernel's generic filesystem page cache; instead it has its own system for this, called ARC. In effect, in a system with both conventional filesystems (such as my ext4 root filesystem) and ZFS filesystems, you have two page caches, one used by your conventional filesystems and the separate one used by your ZFS filesystems. What I found on my machine was that the overall system was bad at balancing memory usage between these two. In particular, the ZFS ARC didn't seem to compete strongly enough with the kernel page cache.
If everything was going well, what I'd have expected was for there to be relatively little kernel page cache and a relatively large ARC, because ext4 (the page cache user) held much less of my actively used data than ZFS did. Certainly I ran some things from the root filesystem (such as compilers), but I rather thought not necessarily all that much compared to what I was doing from my ZFS filesystems. In real life, things seemed to go the other way; I would wind up with a relatively large page cache and a quite tiny ARC that was caching relatively little data. As far as I could tell, over time ext4 was simply out-competing ZFS for memory despite having much less actual filesystem data.
I assume that this is due to the page cache and the ZFS ARC being separate from each other, so that there just isn't any way of having some sort of global view of disk buffer usage that would let the ZFS ARC push directly on the page cache (and vice versa if necessary). As far as I know there's no way to limit page cache usage or push more aggressively only on the page cache in a way that won't hit ZFS at least as hard. So a mixed system with ZFS and something else just gets to live with this.
(The crude fix is to periodically empty the page cache with
'echo 1 >/proc/sys/vm/drop_caches'. Note that you don't want
to use '2' or '3' here, because that will also drop ZFS ARC caches
and other ZFS memory.)
The build up of page cache is not immediate when the system is in use. Instead it seems to come over time, possibly as things like system backups run and temporarily pull large amounts of the root filesystem into cache. I believe it generally took a day or two for page cache usage to grow and start strangling the ARC after I explicitly dropped the caches, and in the mean time I could do relatively root filesystem intensive things like compiling Firefox from source without making the page cache hulk up.
(The Firefox source code itself was in a ZFS filesystem, but the C++ compiler and system headers and system libraries and so on are all in the root filesystem.)
Moving from 16 GB of RAM to 32 GB of RAM hasn't eliminated this problem for me as such, but what it did do was allow the ZFS ARC to use enough memory despite this that it's reasonably effective anyways. With 16 GB I could see ext4 using 4 GB of page cache or more while the ZFS ARC was squeezed to 4 GB or less and suffering for it. With 32 GB the page cache may wind up at 10 GB, but this still leaves room for the ZFS ARC itself to be 9 GB with a reasonable amount of data cached.
(And 32 GB is more useful than 16 GB anyways these days, what with OSes in virtual machines that want 2 GB or 4 GB of RAM to run reasonably happily and so on. And I'm crazy enough to spin up multiple virtual machines at once for testing certain sorts of environments.)
Presumably this is completely solved if you have no non-ZFS filesystems on the machine, but that requires a ZFS root filesystem and that's currently far too alarming for me to even consider.
2014-12-26
My experience with ZFS on Linux: it's worked pretty well for me
A while back I wrote an entry about the temptation to switch my office
workstation to having my data in ZFS on Linux.
Not that long after I wrote the entry, in early August according to
'zpool history', I gave in to the temptation and did just that. I've
been running with everything except the root filesystem and swap in a
ZoL pool ever since, and the short summary is everything has worked
smoothly.
(Because I'm crazy, I did some wrestling with systemd and bind mounts instead of just using symlinks. That was about all of the hassles required to get things going.)
I'm running the development version of ZFS, hot from the git
repository and generally the latest version, mostly because I wanted
all of the bugfixes and improvements from the official release. At
this point, I've repeatedly upgraded kernels (including over several
significant kernel bumps) without any problems with the ZFS modules
being able to rebuild themselves under DKMS. There was one unfortunate
bit where the ZFS modules got broken by a DKMS update because they'd
been doing something that was actually wrong, but that would have
been much less alarming (ie, less of a surprise) if I'd actually
paid attention to various messages and failures during my 'yum
upgrade' run.
(The issue was also promptly fixed by the ZFS on Linux people, and I fixed it by reverting to the previous version of the Fedora DKMS package.)
My moderately big worry beforehand was memory issues, but I didn't really have any problems. I ran into one interesting issue with kernel memory usage that I'm going to write up as another entry, but it wasn't a problem as such; the short summary is that ZFS was using too little memory instead of too much. However my system did get happier with life when I increased its memory from an already relatively large 16 GB all the way up to 32 GB, and I don't know how ZFS on Linux would do on a machine with only, say, 4 GB of RAM.
The major benefit of using ZoL instead of my previous approach of
ext4 over LVM over MD has been the great reassurance value of ZFS
checksums and 'zpool scrub'. I took advantage of ZFS's support
for L2ARC to safely add a caching SSD to the pool (which is not
currently possible with Fedora's version of LVM),
but I don't know if it's doing me much good. And of course I like
the flexible space usage that ZFS gives me, as I no longer have to
preallocate space to each filesystem and then watch them all to
expand them as needed. I haven't attempted any sort of performance
measurements, so all I say is that my machine doesn't feel noticeably
slower or faster than before.
(I've taken advantage of ZFS snapshots a couple of times in little ways, but so far that's basically just been playing around with them on Linux.)
Thus, on the whole I would say that if you're tempted by the idea of ZoL that you should try it out, and you should definitely try it out from the latest development version. I watch the ZoL git changelogs carefully and the changes that get committed are almost entirely fixes for bugs, with periodic improvements and things from other ZFS implementations. Building RPMs for a DKMS install from the source tree is just as simple as the ZoL site describes it.
(I was hoping to be able to report that my ZoL setup had also survived a Fedora 20 to Fedora 21 upgrade without problems, but I haven't been able to try that for other reasons. I honestly don't expect any problems because the F20 and F21 kernel versions are just about the same. I did do a test upgrade on a virtual machine with the ZoL DKMS packages installed and a ZFS pool configured, and had no problems with it (apart from the systemd crash, which happens with or without ZoL).)
After this positive experience (and after reading about various btrfs things), I expect that my next home machine build will also use ZFS on Linux. ZFS checksums and pool scrubs are enough of a reason to do it all on their own, never mind the other advantages. But that's probably several years in the future at this point, as my current home machine isn't that old by my standards.
2014-12-12
The bad side of systemd: two recent systemd failures
In the past I've written a number of favorable entries about systemd.
In the interests of balance, among other things, I now feel that I
should rake it over the coals for today's bad experiences that I
ran into in the course of trying to do a yum upgrade of one system
from Fedora 20 to Fedora 21, which did not go well.
The first and worst failure is that I've consistently had systemd's master process (ie, PID 1, the true init) segfault during the upgrade process on this particular machine. I can say it's a consistent thing because this is a virtual machine and I snapshotted the disk image before starting the upgrade; I've rolled it back and retried the upgrade with variations several times and it's always segfaulted. This issue is apparently Fedora bug #1167044 (and I know of at least one other person it's happened to). Needless to say this has put somewhat of a cramp in my plans to upgrade my office and home machines to Fedora 21.
(Note that this is a real segfault and not an assertion failure. In fact this looks like a fairly bad code bug somewhere, with some form of memory scrambling involved.)
The slightly good news is that PID 1 segfaulting does not reboot
the machine on the spot. I'm not sure if PID 1 is completely stopped
afterwards or if it's just badly damaged, but the bad news is that
a remarkably large number of things stop working after this happens.
Everything trying to talk to systemd fails and usually times out
after a long wait, for example attempts to do 'systemctl daemon-reload'
from postinstall scripts. Attempts to log in or to su to root from
an existing login either fail or hang. A plain reboot will try to
talk to systemd and thus fails, although you can force a reboot in
various ways (including 'reboot -f').
The merely bad experience is that as a result of this I had occasion
to use journalctl (I normally don't). More specifically, I had
occasion to use 'journalctl -l', because of course if you're going
to make a bug report you want to give full messages. Unfortunately,
'journalctl -l does not actually show you the full message.
Not if you just run it by itself. Oh, the full message is available,
all right, but journalctl specifically and deliberately invokes
the pager in a mode where you have to scroll sideways to see long
lines. Under no circumstance is all of a long line visible on screen
at once so that you may, for example, copy it into a bug report.
This is not a useful decision. In fact it is a screamingly frustrating
decision, one that is about the complete reverse of what I think most
people would expect -l to do. In the grand systemd tradition, there is
no option to control this; all you can do is force journalctl to not
use a pager or work out how to change things inside the pager to not do
this.
(Oh, and journalctl goes out of its way to set up this behavior. Not
by passing command line arguments to less, because that would be too
obvious (you might spot it in a ps listing, for example); instead
it mangles $LESS to effectively add the '-S' option, among other
things.)
While I'm here, let me mention that journalctl's default behavior
of 'show all messages since the beginning of time in forward
chronological order' is about the most useless default I can imagine.
Doing it is robot logic, not human logic.
Unfortunately the systemd journal is unlikely to change its course
in any significant way so I expect we'll get to live
with this for years.
(I suppose what I need to do next is find out wherever abrt et
al puts core dumps from root processes so that I can run gdb on
my systemd core to poke around. Oh wait, I think it's in the
systemd journal now.
This is my unhappy face, especially since I am having to deal with
a crash in systemd itself.)
2014-12-10
What good kernel messages should be about and be like
Linux is unfortunately a haven of terrible kernel messages and terrible kernel message handling, as I have brought up before. In a spirit of shouting at the sea, today I feel like writing down my principles of good kernel messages.
The first and most important rule of kernel messages is that any kernel message that is emitted by default should be aimed at system administrators, not kernel developers. There are very few kernel developers and they do not look at very many systems, so it's pretty much guaranteed that most kernel messages are read by sysadmins. If a kernel message is for developers, it's useless for almost everyone reading it (and potentially confusing). Ergo it should not be generated by default settings; developers who need it for debugging can turn it on in various ways (including kernel command line parameters). This core rule guides basically all of the rest of my rules.
The direct consequence of this is that all messages should be clear, without in-jokes or cleverness that is only really comprehensible to kernel developers (especially only subsystem developers). In other words, no yama-style messages. If sysadmins looking at your message have no idea what it might refer to, no lead on what kernel subsystem it came from, and no clue where to look for further information, your message is bad.
Comprehensible messages are only half of the issue, though; the other half is only emitting useful messages. To be useful, my view is that a kernel message should be one of two things: it should either be what they call actionable or it should be necessary in order to reconstruct system state (one example is hardware appearing or disappearing, another is log messages that explain why memory allocations failed). An actionable message should cause sysadmins to do something and really it should mean that sysadmins need to do something.
It follows that generally other systems should not be able to cause the kernel to log messages by throwing outside traffic at it (these days that generally means network traffic), because outsiders should not be able to harm your kernel to the degree where you need to do anything; if this is the case, they are not actionable for the sysadmin of the local machine. And yes, I bang on this particular drum a fair bit; that's because it keeps happening.
Finally, almost all messages should be strongly ratelimited. Unfortunately I've come around to the view that this is essentially impossible to do at a purely textual level (at least with acceptable impact for kernel code), so it needs to be considered everywhere kernel code can generate a message. This very definitely includes things like messages about hardware coming and going, because sooner or later someone is going to have a flaky USB adapter or SATA HD that starts disappearing and then reappearing once or twice a second.
To say this more compactly, everything in your kernel messages should be important to you. Kernel messages should not be a random swamp that you go wading in after problems happen in order to see if you can spot any clues amidst the mud; they should be something that you can watch live to see if there are problems emerging.
2014-12-07
How we install Ubuntu machines here
We have a standard install system for our Ubuntu machines (which are the majority of machines that we build). I wouldn't call it an automated install system (in the past I've used the term 'scripted'), but it is mostly automated with only a relatively modest amount of human intervention. The choice of partially automated installs may seem odd to people, but in our case it meets our needs and is easy to work with.
Our install process runs in three stages. First we have a customized
Ubuntu server install image that is set up with a preseed file and a
few other things on the CD image. The preseed file pre-selects a basic
package set, answers a bunch of the installer questions for static
things like time zones, sets a standard initial root password, and
drops some files in /root on the installed system, most importantly a
postinstall script.
After the system is installed and reboots, we log into it (generally over the network) and run the pre-dropped postinstall script. This grinds through a bunch of standard setup things (including making sure that the machine's time is synchronized to our NTP servers) but its most important job is bootstrapping the system far enough that it can do NFS mounts in our NFS mount authentication environment. Among other things this requires setting up the machine's canonical SSH host keys, which involves a manual authentication step to fetch them and thus demands a human there to type the access password. After getting the system to a state where it can do NFS mounts, it mounts our central administrative filesystem.
The third step is a general postinstall script that lives on this central administrative filesystem. This script asks a few questions about how the machine will be set up and what sort of package set it should have, then grinds through all the work of installing many packages (some of them from non-default repositories), setting up various local configuration files from the master versions, and applying various other customizations. After this process finishes, a standard login or compute server is basically done and in general the machine is fully enrolled in various automated management systems for things like password propagate and NFS mount management.
(Machines that aren't standard login or compute servers generally then need some additional steps following our documented build procedures.)
In general our approach here has been to standardize and script everything that is either easy to do, that's very tricky to do by hand, or that's something we do a lot. We haven't tried to go all the way to almost fully automated installs, partly because it seems too much work for the reward given the modest amount of (re)installs we do and partly because there's some steps in this process that intrinsically require human involvement. Our current system works very well; we can spin up standard new systems roughly as fast as the disks can unpack packages and with minimal human involvement, and the whole system is easy to develop and manage.
Also, let me be blunt about one reason I prefer the human in the loop approach here: unfortunately Debian and Ubuntu packages have a really annoying habit of pausing to give you quizzes every so often. These quizzes basically function as land mines if you're trying to do a fully automated install, because you can never be sure if you've pre-answered all of them and if force-ignoring one is going to blow up in your face. Having a human in the loop to say 'augh no I need to pick THIS answer' is a lot safer.
(I can't say that humans are better at picking up on problems if something comes up, because the Ubuntu package install process spews out so much text that it's basically impossible to read. In practice we all tune out while it flows by.)
2014-11-28
How I made IPSec IKE work for a point to point GRE tunnel on Fedora 20
The basic overview of my IPSec needs is that
I want to make my home machine (with an outside address) appear as
an inside IP address on the same subnet as my work machine is on.
Because of Linux proxy ARP limitations, the core mechanics of this
involve a GRE tunnel, which must be encrypted and authenticated by
IPSec. Previously I was doing this with a static IPSec configuration
created by direct use of setkey, which had the drawback that it didn't
automatically change encryption keys or notice if something went wrong
with the IPSec stuff. The normal solution to these drawbacks is to use
an IKE daemon to
automatically negotiate IPSec (and time it out if the other end stops),
but unfortunately this is not a configuration that IKE daemons such as
Fedora 20's Pluto support directly. I can't really blame them; anything
involving proxy ARP is at least reasonably peculiar and most sane people
either use routing on subnets or NAT the remote machines.
My first step to a working configuration came about after I fixed my
configuration to block unprotected GRE traffic.
Afterwards I realized this meant that I could completely ignore managing
GRE in my IKE configuration and only have it deal with IPSec stuff; I'd
just leave the GRE tunnel up all the time and if IPSec was down, the
iptables rules would stop traffic. After I gritted my teeth and read
through the libreswan ipsec.conf manpage,
this turned out to be a reasonably simple configuration. The core of it
is this:
conn cksgre
left=<work IP alias>
leftsourceip=<work IP alias>
right=<home public IP>
ikev2=insist
# what you want for always-up IPSec
auto=start
# I only want to use IPSec on GRE traffic
leftprotoport=gre
rightprotoport=gre
# authentication is:
authby=rsasig
rightrsasigkey=[...]
leftrsaksigkey=[...]
The two IP addresses used here are the two endpoints of my GRE tunnel
(the 'remote' and 'local' addresses in 'ip tunnel <...>'). Note that
this configuration has absolutely no reference to the local and peer IP
addresses that you set on the inside of the tunnel; in my setup IPSec is
completely indifferent to them.
I initially attempted to do authentication via PSK aka a (pre) shared secret. This caused my setup of the Fedora 20 version of Pluto to dump core with an assertion failure (for what seems to be somewhat coincidental reasons), which turned out to be lucky because there's a better way. Pluto supports what it calls 'RSA signature authentication', which people who use SSH also know as 'public key authentication'; just as with SSH, you give each end its own keypair and then list the public key(s) in your configuration and you're done. How to create the necessary RSA keypairs and set everything up is not well documented in the Fedora 20 manpages; in fact, I didn't even realize it was possible. Fortunately I stumbled over this invaluable blog entry on setting up a basic IPSec connection which covered the magic required.
This got the basic setup working, but after a while the novelty wore
off and my urge to fiddle with things got the better of me so I decided
to link the GRE tunnel to the IKE connection, so it would be torn
down if the connection died (and brought up when the connection was
made). You get your commands run on such connection events through the
leftupdown="..." or rightupdown="..." configuration setting; your
command gets information about what's going on through a pile of
environment variables (which are documented in the ipsec_pluto
manpage). For me this is a script that inspects $PLUTO_VERB to
find out what's going on and runs one of my existing scripts to set up
or tear down things on up-host and down-host actions. As far as I
can tell, my configuration does not need to run the default 'ipsec
_updown' command.
(My existing scripts used to do both GRE setup and IPSec setup, but of course now they only do the GRE setup and the IPSec stuff is commented out.)
This left IPSec connection initiation (and termination) itself. On
my home machine I used to bring up and tear down the entire IPSec and
GRE stuff when my PPPoE DSL link came up or went down. In theory one
could now leave this up to a running Pluto based on its normal keying
retries and timeouts; in practice this doesn't really work well and I
wound up needing to do manual steps. Manual control of Pluto is done
through 'ipsec whack' and if everything is running smoothly doing the
following on DSL link up or down is enough:
ipsec whack --initiate|--terminate --name cksgre >/dev/null 2>&1
Unfortunately this is not always sufficient. Pluto does not notice
dynamically appearing and disappearing network links and addresses, so
if it's (re)started while my DSL link is down (for example on boot) it
can't find either IP address associated with the cksgre connection
and then refuses to try to do anything even if you explicitly ask it
to initiate the connection. To make Pluto re-check the system's IP
addresses and thus become willing to activate the IPSec connection,
I need to do:
ipsec whack --listen
Even though the IPSec connection is set to autostart, Pluto does not
actually autostart it when --listen causes it to notice that the
necessary IP address now exists; instead I have to explicitly initiate
it with 'ipsec whack --initiate --name cksgre'. My current setup
wraps this all up in a script and runs it from /etc/ppp/ip-up.local
and ip-down.local (in the same place where I previously invoked my own
IPSec and GRE setup and stop scripts).
So far merely poking Pluto with --listen has been sufficient to get it
to behave, but I haven't extensively tested this. My script currently
has a fallback that will do a 'systemctl restart ipsec' if nothing
else works.
PS: Note that taking down the GRE tunnel on IPSec failure has some potential security implications in my environment. I think I'm okay with them, but that's really something for another entry.
Sidebar: What ipsec _updown is and does
On Fedora 20 this is /usr/libexec/ipsec/_updown, which runs one of
the _updown.* scripts in that directory depending on what the kernel
protocol is; on most Linux machines (and certainly on Fedora 20) this is
NETKEY, so _updown.netkey is what gets run in the end. What these
scripts can do for you and maybe do do for you is neither clear nor
documented and they make me nervous. They certainly seem to have the
potential to do any number of things, some of them interesting and
some of them alarming.
Having now scanned _updown.netkey, it appears that the only thing
it might possibly be doing for me is mangling my /etc/resolv.conf.
So, uh, no thanks.
2014-11-26
Using iptables to block traffic that's not protected by IPSec
When I talk about my IPSec setup, I often say that I use GRE over IPSec (or 'an IPSec based GRE tunnel'). However, this is not really what is going on; a more accurate but more opaque description is that I have a GRE tunnel that is encrypted and protected by IPSec. The problem, and the reason that the difference matters, is that there is nothing that intrinsically ties the two pieces together, unlike something where you are genuinely running X over Y such as 'forwarding X11 over SSH'. In the X11 over SSH case, if SSH is not working you do not get anything. But in my case if IPSec isn't there for some reason my GRE tunnel will cheerfully continue working, just without any protection against either eavesdropping or impersonation.
In theory this is undoubtedly not supposed to happen, since
you (I) designed your GRE setup to work in conjunction with
IPSec. Unfortunately in practice in practice there are any number of
ways for IPSec to go away on you, possibly without destroying the GRE
tunnel in the process. Your IPSec IKE daemon probably removes the IPSec
security policies that reject unencrypted traffic when it shuts down,
for example, and if you're manually configuring IPSec with setkey you
can do all sorts of fun things like accidentally leaving a 'spdflush;'
command in a control file that only (re)loads keys and is no longer used
to set up the security policies.
The obvious safety method is to add some iptables rules that block
unencrypted GRE traffic. If you are like me, you'll start out by writing
the obvious iptables ruleset:
iptables -A INPUT -p esp -j ACCEPT iptables -A INPUT -p gre -j DROP
This doesn't work. As far as I can tell, the Linux IPSec system
effectively re-injects the decrypted packets into the IP stack,
where they will be seen in their unencrypted state by iptables rules
(as well as by tcpdump, which can be both confusing and alarming).
The result is that after the re-injection the ipfilters rules see
a plain GRE packet and drop it.
Courtesy of this netfilter mailing list message, it turns out that what you need is to match packets that will be or have been processed by IPSec. This is done with a policy match:
iptables -A INPUT -m policy --dir in --pol ipsec -j ACCEPT iptables -A INPUT -p gre -j DROP # and for outgoing packets: iptables -A OUTPUT -m policy --dir out --pol ipsec -j ACCEPT iptables -A OUTPUT -p gre -j DROP
Reading the iptables-extensions manpage suggests that I should add
at least '--proto esp' to the policy match for extra paranoia.
I've tested these rules and they work. They pass GRE traffic that is protected by IPSec, but if I remove the IPSec security policies that force IPSec for my GRE traffic these iptables rules block the unprotected GRE traffic as I want.
(Extension to non-GRE traffic is left as an exercise to the reader. I have a simple IPSec story in that I'm only using it to protect GRE and I never want GRE traffic to flow without IPSec to any destination under any circumstances. Note that there are potentially tricky rule ordering issues here and you probably want to always put this set of rules at the end of your processing.)
2014-11-25
My Linux IPSec/VPN setup and requirements
In response to my entry mentioning perhaps writing my own daemon to rekey my IPSec tunnel, a number of people made suggestions in comments. Rather than write a long response, I've decided to write up how my current IPSec tunnel works and what my requirements are for it or any equivalent. As far as I know these requirements rule out most VPN software, at least in its normal setup.
My IPSec based GRE tunnel runs between my home machine and my work machine and its fundamental purpose is to cause my home machine to appear on the work network as just another distinct host with its own IP address. Importantly this IP address is publicly visible, not just an internal one. My home machine routes some but not all of its traffic over the IPSec tunnel and for various reasons I need full dual identity routing for it; traffic to or from the internal IP must flow over the IPSec tunnel while traffic to or from the external IP must not. My work machine also has additional interfaces that I need to isolate, which can get a bit complicated.
(The actual setup of this turns out to be kind of tangled, with some side effects.)
This tunnel is normally up all of the time, although under special circumstances it needs to be pulled down locally on my work machine (and independently on my home machine). Both home and work machines have static IPs. All of this works today; the only thing that my IPSec setup lacks is periodic automatic rekeying of the IPSec symmetric keys used for encryption and authentication.
Most VPN software that I've seen wants to either masquerade your traffic as coming from the VPN IP itself or to make clients appear on a (virtual) subnet behind the VPN server with explicit routing. Neither is what I want. Some VPNs will bridge networks together; this is not appropriate either because I have no desire to funnel all of the broadcast traffic running around on the work subnet over my DSL PPPoE link. Nor can I use pure IPSec alone, due to a Linux proxy ARP limitation (unless this has been fixed since then).
I suspect that there is no way to tell IKE daemons 'I don't need you to set things up, just to rekey this periodically'; this would be the minimally intrusive change. There is probably a way to configure a pair of IKE daemons to do everything, so that they fully control the whole IPSec and GRE tunnel setup; there is probably even a way to tell them to kick off the setup of policy based routing when a connection is negotiated. However for obvious reasons my motivation for learning enough about IKE configuration to recreate my whole setup is somewhat low, as much of the work is pure overhead that's required just to get me to where I already am now. On the other hand, if a working IKE based configuration for all of this fell out of the sky I would probably be perfectly happy to use it; I'm not intrinsically opposed to IKE, just far from convinced that investing a bunch of effort into decoding how I need to set it up will get me much or be interesting.
(It would be different if documentation for IKE daemons was clear and easy to follow, but so far I haven't found any that is. Any time I skim any of it I can see a great deal of annoyance in my future.)
PS: It's possible that my hardcoded IPSec setup is not the most modern in terms of security, since it dates from many years ago. Switching to a fully IKE-mediated setup would in theory give me a free ride on future best practices for IPSec algorithm choices so I don't have to worry about this.
Sidebar: why I feel that writing my own rekeying daemon is safe
The short version is that the daemon would not involved in setting up
the secure tunnel itself, just getting new keys from /dev/urandom,
telling the authenticated other end about them, writing them to a
setkey script file, and running
the necessary commands to (re)load them.
I'd completely agree with everyone who is telling me to use IKE if
I was attempting to actively negotiate a full IPSec setup, but I'm
not. The IPSec setup is very firmly fixed; the only thing that varies
is the keys. There are ways to lose badly here, but they're almost
entirely covered by using a transport protocol with strong encryption
and authentication and then insisting on fixed IP addresses on top of
it.
(Note that I won't be negotiating keys with the other end as such. Whichever end initiates a rekey will contact the other end to say more or less 'here are my new keys, now use them'. And I don't intend to give the daemon the ability to report on the current keys. If I need to know them I can go look outside of the daemon. If the keys are out of sync or broken, well, the easy thing is to force an immediate rekey to fix it, not to read out current keys to try to resync each end.)