Low level issues can have quite odd high level symptoms (again)
Let's start with my tweet from yesterday:
So the recently released Fedora 22 libreswan update appears to have broken IPSec tunnels for me. Good going. Debugging this will be hell.
This was quite consistent: if I installed the latest Fedora 22 update to libreswan, my IPSec based point to point tunnel stopped working. More specifically, my home end (running Fedora 22) could not do an IKE negotiation with my office machine. If I reverted back to the older libreswan version, everything worked. This is exactly the sort of thing that is hell to debug and hell to get as a bug report.
(Fedora's libreswan update jumped several point versions, from 3.13 to 3.16. There could be a lot of changes in there.)
Today I put some time into trying to narrow down the symptoms and
what the new libreswan was doing. It was an odd problem, because
tcpdump was claiming that the initial ISAKMP
packets were going out from my home machine, but I didn't see them
on my office machine or even on our exterior firewall. Given prior
experiences I suspected that
the new version of libreswan was setting up IPSec security associations
that were blocking traffic and making
tcpdump mislead me about
whether packets were really getting out. But I couldn't see any
sign of errant SPD entries and using
tcpdump at the PPPoE level suggested very strongly that my ISAKMP packets
really were being transmitted. But at the same time I could flip
back and forth between libreswan versions, with one working and the
other not. So in the end I did the obvious thing: I grabbed tcpdump
output from a working session and a non-working session and started
staring at them to see if anything looked different.
Reading the packet dumps, my eyes settled on this (non-working first, then working):
PPPoE [ses 0xdf7] IP (tos 0x0, ttl 64, id 10253, offset 0, flags [DF], proto UDP (17), length 1464) X.X.X.X.isakmp > Y.Y.Y.Y.isakmp: isakmp 2.0 msgid 00000000: parent_sa ikev2_init[I]: [...] PPPoE [ses 0xdf7] IP (tos 0x0, ttl 64, id 32119, offset 0, flags [DF], proto UDP (17), length 1168) X.X.X.X.isakmp > Y.Y.Y.Y.isakmp: isakmp 2.0 msgid 00000000: parent_sa ikev2_init[I]: [...]
I noticed that the packet length was different. The working packet
was significantly shorter and the non-working one was not too far
from the 1492 byte MTU of the PPP link itself. A little light turned
on in my head, and some quick tests with
ping later I had my
answer: my PPPoE PPP MTU was too high, and as a result something
in the path between me and the outside world was dropping any
too-big packets that my machine generated.
(It's probably the DSL modem and DSL hop, based on some tests with traceroute.)
The reason things broke with the newer libreswan was that the newer version added several more cipher choices, which pushed the size of the initial ISAKMP packet over the actual working MTU. With the DF bit set in the UDP packet, there was basically no chance of the packet getting fragmented when it hit wherever the block was; instead it was just summarily dropped.
(I think I never saw issues with TCP connections because I'd long ago set a PPPoE option to clamp the MSS to 1412 bytes. So only UDP traffic would be affected, and of course I don't really do anything that generates large UDP packets. On the other hand, maybe this was a factor in an earlier mysterious network problem, which I eventually made go away by disabling SPDY in Firefox.)
What this illustrates for me, once again, is that I simply can't predict what the high level symptoms are going to be for a low level network problem. Or, more usefully, given a high level problem I can't even be sure if it's actually due to some low level network issue or if it has a high level cause of its own (like 'code changes between 3.13 and 3.16').
Sidebar: what happens with my office machine's ISAKMP packets
My office machine is running libreswan 3.16 too, so I immediately wondered if its initial ISAKMP packets were also getting dropped because of this (which would mean that my IPSec tunnel would only come up when my home machine initiated it). Looking into this revealed something weird: while my office machine is sending out large UDP ISAKMP packets with the DF bit set, something is stripping DF off and then fragmenting those UDP packets before they get to my home machine. Based on some experimentation, the largest inbound UDP packet I can receive un-fragmented is 1436 bytes. The DF bit gets stripped regardless of the packet size.
(I suspect that my ISP's DSL PPP touchdown points are doing this. It's an obvious thing to do, really. Interesting, the 1436 byte size restriction is smaller than the outbound MTU I can use.)