Why my home backup situation is currently a bit awkward
In this recent entry I mentioned that my home backup strategy is an awkward subject. Today I want to talk about why that is so, which has two or perhaps three sides; the annoyances of hardware, that disks are slow, and that software doesn't just do what I want, partly because I want contradictory things.
In theory, the way to good backups is straightforward. You buy an external disk drive enclosure and a disk for it, connect it to your machine periodically, and 'do a backup' (whatever that is). Ideally you will be disciplined about how frequently you do this. And indeed, relatively early on I set myself up to do this, except that back then I made a mistake; rather than get an external enclosure with both USB and eSATA, I got one with just USB because I had (on my machine at the time) no eSATA ports. To be more precise I got an enclosure with USB 2.0, because that's what was available at the time.
If you know USB 2.0 disk performance, you are now wincing. USB 2.0 disks are dog slow, at least on Linux (I believe I once got a benchmark result on the order of 15 MBytes/sec), and they also usually hammer the responsiveness of your machine into the ground. On top of that I didn't really trust the heat dissipation of the external drive case, which meant that I was nervous about leaving the drive powered on and running overnight or the like. So I didn't do too many backups to that external enclosure and drive. It was just too much of a pain for too long.
With my second external drive case and drive, I learned better (at least in theory); I bought a case with USB and eSATA. Unfortunately only USB 2.0, and then something in the combination of the eSATA port on my new machine and the case didn't work really reliably. I've been able to sort of work around that but the workaround doesn't make me really happy to have the drive connected, there's still a performance impact from backups, and the heat concerns haven't gone away.
(My replacement for the eSATA port is to patch a regular SATA port through the case. This works but makes me nervous and I think I've seen it have some side effects on the machine when the drive connects or disconnects. In general, eSATA is probably not the right technology here.)
This brings me to slow disks. I can't remember how fast my last backup run went, but between the overheads of actually making backups (in walking the filesystem and reading files and so on) and the overheads of writing them out, I'd be surprised if they ran faster than 50 MBytes/sec (and I suspect they went somewhat slower). At that rate, it takes an hour to back up only 175 GB. With current disks and hardware, backups of lots of data are just going to be multi-hour things, which does not encourage me to do them regularly at the best of times.
(Life would be different if I could happily leave the backups to run when I wasn't present, but I don't trust the heat dissipation of the external drive case that much, or for that matter the 'eSATA' connection. Right now I feel I have to actively watch the whole process.)
As I wrote up in more detail here,
my ideal backup software would basically let me incrementally make
full backups. Lacking something to do that, the low effort system
I've wound up with for most things uses
dump. Dump captures exact
full backups of extN filesystems and can be compressed (and I can
keep multiple copies), but it's not something you can do incrementally.
dump against a filesystem is an all or nothing affair;
either you let it run for as many hours as it winds up taking, or
you abort it and get nothing. Using
dump also requires manually
managing the process, including keeping track of old filesystem
backups and removing some of them to make space for new ones.
(Life would be somewhat different if my external backup disk was much larger than my system disk, but as it happens it isn't.)
This is far from an ideal situation. In theory I could have regular, good backups; in practice there is enough friction in all of the various pieces that I have de facto bad ones, generally only made when something makes me alarmed. Since I'm a sysadmin and I preach the gospel of backups in general, this feels especially embarrassing (and awkward).
(I think I see what I want my situation to look like moving forwards, but this entry is long enough without trying to get into that.)
Low level issues can have quite odd high level symptoms (again)
Let's start with my tweet from yesterday:
So the recently released Fedora 22 libreswan update appears to have broken IPSec tunnels for me. Good going. Debugging this will be hell.
This was quite consistent: if I installed the latest Fedora 22 update to libreswan, my IPSec based point to point tunnel stopped working. More specifically, my home end (running Fedora 22) could not do an IKE negotiation with my office machine. If I reverted back to the older libreswan version, everything worked. This is exactly the sort of thing that is hell to debug and hell to get as a bug report.
(Fedora's libreswan update jumped several point versions, from 3.13 to 3.16. There could be a lot of changes in there.)
Today I put some time into trying to narrow down the symptoms and
what the new libreswan was doing. It was an odd problem, because
tcpdump was claiming that the initial ISAKMP
packets were going out from my home machine, but I didn't see them
on my office machine or even on our exterior firewall. Given prior
experiences I suspected that
the new version of libreswan was setting up IPSec security associations
that were blocking traffic and making
tcpdump mislead me about
whether packets were really getting out. But I couldn't see any
sign of errant SPD entries and using
tcpdump at the PPPoE level suggested very strongly that my ISAKMP packets
really were being transmitted. But at the same time I could flip
back and forth between libreswan versions, with one working and the
other not. So in the end I did the obvious thing: I grabbed tcpdump
output from a working session and a non-working session and started
staring at them to see if anything looked different.
Reading the packet dumps, my eyes settled on this (non-working first, then working):
PPPoE [ses 0xdf7] IP (tos 0x0, ttl 64, id 10253, offset 0, flags [DF], proto UDP (17), length 1464) X.X.X.X.isakmp > Y.Y.Y.Y.isakmp: isakmp 2.0 msgid 00000000: parent_sa ikev2_init[I]: [...] PPPoE [ses 0xdf7] IP (tos 0x0, ttl 64, id 32119, offset 0, flags [DF], proto UDP (17), length 1168) X.X.X.X.isakmp > Y.Y.Y.Y.isakmp: isakmp 2.0 msgid 00000000: parent_sa ikev2_init[I]: [...]
I noticed that the packet length was different. The working packet
was significantly shorter and the non-working one was not too far
from the 1492 byte MTU of the PPP link itself. A little light turned
on in my head, and some quick tests with
ping later I had my
answer: my PPPoE PPP MTU was too high, and as a result something
in the path between me and the outside world was dropping any
too-big packets that my machine generated.
(It's probably the DSL modem and DSL hop, based on some tests with traceroute.)
The reason things broke with the newer libreswan was that the newer version added several more cipher choices, which pushed the size of the initial ISAKMP packet over the actual working MTU. With the DF bit set in the UDP packet, there was basically no chance of the packet getting fragmented when it hit wherever the block was; instead it was just summarily dropped.
(I think I never saw issues with TCP connections because I'd long ago set a PPPoE option to clamp the MSS to 1412 bytes. So only UDP traffic would be affected, and of course I don't really do anything that generates large UDP packets. On the other hand, maybe this was a factor in an earlier mysterious network problem, which I eventually made go away by disabling SPDY in Firefox.)
What this illustrates for me, once again, is that I simply can't predict what the high level symptoms are going to be for a low level network problem. Or, more usefully, given a high level problem I can't even be sure if it's actually due to some low level network issue or if it has a high level cause of its own (like 'code changes between 3.13 and 3.16').
Sidebar: what happens with my office machine's ISAKMP packets
My office machine is running libreswan 3.16 too, so I immediately wondered if its initial ISAKMP packets were also getting dropped because of this (which would mean that my IPSec tunnel would only come up when my home machine initiated it). Looking into this revealed something weird: while my office machine is sending out large UDP ISAKMP packets with the DF bit set, something is stripping DF off and then fragmenting those UDP packets before they get to my home machine. Based on some experimentation, the largest inbound UDP packet I can receive un-fragmented is 1436 bytes. The DF bit gets stripped regardless of the packet size.
(I suspect that my ISP's DSL PPP touchdown points are doing this. It's an obvious thing to do, really. Interesting, the 1436 byte size restriction is smaller than the outbound MTU I can use.)