Low level issues can have quite odd high level symptoms (again)

January 26, 2016

Let's start with my tweet from yesterday:

So the recently released Fedora 22 libreswan update appears to have broken IPSec tunnels for me. Good going. Debugging this will be hell.

This was quite consistent: if I installed the latest Fedora 22 update to libreswan, my IPSec based point to point tunnel stopped working. More specifically, my home end (running Fedora 22) could not do an IKE negotiation with my office machine. If I reverted back to the older libreswan version, everything worked. This is exactly the sort of thing that is hell to debug and hell to get as a bug report.

(Fedora's libreswan update jumped several point versions, from 3.13 to 3.16. There could be a lot of changes in there.)

Today I put some time into trying to narrow down the symptoms and what the new libreswan was doing. It was an odd problem, because tcpdump was claiming that the initial ISAKMP packets were going out from my home machine, but I didn't see them on my office machine or even on our exterior firewall. Given prior experiences I suspected that the new version of libreswan was setting up IPSec security associations that were blocking traffic and making tcpdump mislead me about whether packets were really getting out. But I couldn't see any sign of errant SPD entries and using tcpdump at the PPPoE level suggested very strongly that my ISAKMP packets really were being transmitted. But at the same time I could flip back and forth between libreswan versions, with one working and the other not. So in the end I did the obvious thing: I grabbed tcpdump output from a working session and a non-working session and started staring at them to see if anything looked different.

Reading the packet dumps, my eyes settled on this (non-working first, then working):

PPPoE [ses 0xdf7] IP (tos 0x0, ttl 64, id 10253, offset 0, flags [DF], proto UDP (17), length 1464)
   X.X.X.X.isakmp > Y.Y.Y.Y.isakmp: isakmp 2.0 msgid 00000000: parent_sa ikev2_init[I]:

PPPoE [ses 0xdf7] IP (tos 0x0, ttl 64, id 32119, offset 0, flags [DF], proto UDP (17), length 1168)
   X.X.X.X.isakmp > Y.Y.Y.Y.isakmp: isakmp 2.0 msgid 00000000: parent_sa ikev2_init[I]:

I noticed that the packet length was different. The working packet was significantly shorter and the non-working one was not too far from the 1492 byte MTU of the PPP link itself. A little light turned on in my head, and some quick tests with ping later I had my answer: my PPPoE PPP MTU was too high, and as a result something in the path between me and the outside world was dropping any too-big packets that my machine generated.

(It's probably the DSL modem and DSL hop, based on some tests with traceroute.)

The reason things broke with the newer libreswan was that the newer version added several more cipher choices, which pushed the size of the initial ISAKMP packet over the actual working MTU. With the DF bit set in the UDP packet, there was basically no chance of the packet getting fragmented when it hit wherever the block was; instead it was just summarily dropped.

(I think I never saw issues with TCP connections because I'd long ago set a PPPoE option to clamp the MSS to 1412 bytes. So only UDP traffic would be affected, and of course I don't really do anything that generates large UDP packets. On the other hand, maybe this was a factor in an earlier mysterious network problem, which I eventually made go away by disabling SPDY in Firefox.)

What this illustrates for me, once again, is that I simply can't predict what the high level symptoms are going to be for a low level network problem. Or, more usefully, given a high level problem I can't even be sure if it's actually due to some low level network issue or if it has a high level cause of its own (like 'code changes between 3.13 and 3.16').

Sidebar: what happens with my office machine's ISAKMP packets

My office machine is running libreswan 3.16 too, so I immediately wondered if its initial ISAKMP packets were also getting dropped because of this (which would mean that my IPSec tunnel would only come up when my home machine initiated it). Looking into this revealed something weird: while my office machine is sending out large UDP ISAKMP packets with the DF bit set, something is stripping DF off and then fragmenting those UDP packets before they get to my home machine. Based on some experimentation, the largest inbound UDP packet I can receive un-fragmented is 1436 bytes. The DF bit gets stripped regardless of the packet size.

(I suspect that my ISP's DSL PPP touchdown points are doing this. It's an obvious thing to do, really. Interesting, the 1436 byte size restriction is smaller than the outbound MTU I can use.)

Written on 26 January 2016.
« A Python wish: an easy, widely supported way to turn a path into a module
Why my home backup situation is currently a bit awkward »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Jan 26 01:16:22 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.