Low level issues can have quite odd high level symptoms (again)

January 26, 2016

Let's start with my tweet from yesterday:

So the recently released Fedora 22 libreswan update appears to have broken IPSec tunnels for me. Good going. Debugging this will be hell.

This was quite consistent: if I installed the latest Fedora 22 update to libreswan, my IPSec based point to point tunnel stopped working. More specifically, my home end (running Fedora 22) could not do an IKE negotiation with my office machine. If I reverted back to the older libreswan version, everything worked. This is exactly the sort of thing that is hell to debug and hell to get as a bug report.

(Fedora's libreswan update jumped several point versions, from 3.13 to 3.16. There could be a lot of changes in there.)

Today I put some time into trying to narrow down the symptoms and what the new libreswan was doing. It was an odd problem, because tcpdump was claiming that the initial ISAKMP packets were going out from my home machine, but I didn't see them on my office machine or even on our exterior firewall. Given prior experiences I suspected that the new version of libreswan was setting up IPSec security associations that were blocking traffic and making tcpdump mislead me about whether packets were really getting out. But I couldn't see any sign of errant SPD entries and using tcpdump at the PPPoE level suggested very strongly that my ISAKMP packets really were being transmitted. But at the same time I could flip back and forth between libreswan versions, with one working and the other not. So in the end I did the obvious thing: I grabbed tcpdump output from a working session and a non-working session and started staring at them to see if anything looked different.

Reading the packet dumps, my eyes settled on this (non-working first, then working):

PPPoE [ses 0xdf7] IP (tos 0x0, ttl 64, id 10253, offset 0, flags [DF], proto UDP (17), length 1464)
   X.X.X.X.isakmp > Y.Y.Y.Y.isakmp: isakmp 2.0 msgid 00000000: parent_sa ikev2_init[I]:
   [...]

PPPoE [ses 0xdf7] IP (tos 0x0, ttl 64, id 32119, offset 0, flags [DF], proto UDP (17), length 1168)
   X.X.X.X.isakmp > Y.Y.Y.Y.isakmp: isakmp 2.0 msgid 00000000: parent_sa ikev2_init[I]:
   [...]

I noticed that the packet length was different. The working packet was significantly shorter and the non-working one was not too far from the 1492 byte MTU of the PPP link itself. A little light turned on in my head, and some quick tests with ping later I had my answer: my PPPoE PPP MTU was too high, and as a result something in the path between me and the outside world was dropping any too-big packets that my machine generated.

(It's probably the DSL modem and DSL hop, based on some tests with traceroute.)

The reason things broke with the newer libreswan was that the newer version added several more cipher choices, which pushed the size of the initial ISAKMP packet over the actual working MTU. With the DF bit set in the UDP packet, there was basically no chance of the packet getting fragmented when it hit wherever the block was; instead it was just summarily dropped.

(I think I never saw issues with TCP connections because I'd long ago set a PPPoE option to clamp the MSS to 1412 bytes. So only UDP traffic would be affected, and of course I don't really do anything that generates large UDP packets. On the other hand, maybe this was a factor in an earlier mysterious network problem, which I eventually made go away by disabling SPDY in Firefox.)

What this illustrates for me, once again, is that I simply can't predict what the high level symptoms are going to be for a low level network problem. Or, more usefully, given a high level problem I can't even be sure if it's actually due to some low level network issue or if it has a high level cause of its own (like 'code changes between 3.13 and 3.16').

Sidebar: what happens with my office machine's ISAKMP packets

My office machine is running libreswan 3.16 too, so I immediately wondered if its initial ISAKMP packets were also getting dropped because of this (which would mean that my IPSec tunnel would only come up when my home machine initiated it). Looking into this revealed something weird: while my office machine is sending out large UDP ISAKMP packets with the DF bit set, something is stripping DF off and then fragmenting those UDP packets before they get to my home machine. Based on some experimentation, the largest inbound UDP packet I can receive un-fragmented is 1436 bytes. The DF bit gets stripped regardless of the packet size.

(I suspect that my ISP's DSL PPP touchdown points are doing this. It's an obvious thing to do, really. Interesting, the 1436 byte size restriction is smaller than the outbound MTU I can use.)


Comments on this page:

I wonder if there isn’t a way to write some kind of test suite that checks for known problems like this, so you could run it and get a sanity check that your setup works at least within certain parameters. There must be a better way to discover problems than waiting until you do something that exercises some functionality.

By Ewen McNeill at 2016-01-26 14:48:49:

MTU issues are a common (these days) cause of "works, until I try to do a bit more" -- it's particularly obvious with TCP, as (without MSS clamping) the TCP connection will hang as soon as something happens to cause non-trivial data to flow. And while MSS clamping will work around the TCP issue, UDP/ICMP/etc are still vulnerable to the problems. (As presumably is IPv6 if you've clamped only IPv4 TCP MSS....) I've encountered it enough over the last few years with tunnels involved that I typically explicitly test for it when debugging "leaves here, doesn't get there" issues that seem to only affect non-trivially sized frames...

The usual way to debug path MTU issues is with a ping packet with "DF" (don't fragment) set, and then gradually ramping up the size. This is somewhat complicated by various ping implementations having both different syntax and different ideas of how to interpret the "size" parameter -- so looking at the resulting sizes in a packet capture. But basically if:

   ping -M do DEST    # default size, usually 56 byte payload; 64 inc ICMP

works, and:

   ping -M do -s NN DEST

works, but:

   ping -M do -s (NN+1) DEST

does not, then "NN plus headers added by ping" (eg, ICMP/IPv4) is the actual path MTU (typically most easily read out of a packet capture looking at the largest size where you're still getting return ping echos).

If the actual path MTU found this way is lower than what you expect (eg, lower than your local interface MTU), you have one or more tunnels/encapsulations/lower MTU hops that you haven't taken into account... for which one fix, if there's a tunnel endpoint at/near at least one end (ideally both) that all the traffic goes into is to ensure that tunnel MTU is small enough that the resulting encapsulated packets fit through the path MTU. And if the issue is that it breaks some tunnel you're trying to create, the usual idea is to make the MTU of that tunnel small enough to "fit with the overhead added". (Which of course doesn't help if it's the tunnel setup negotiation that is failing.... :-) )

BTW, the above "ping" examples are in the Linux syntax -- "-M do" means "do set the DF bit", which is a poor choice of option syntax -- and the size in Linux is the "ping packet padding" size, so there's an 8-octet ICMP header added, plus a (usually) 20-octet IP(v4) header added; on some devices, eg, routers, it's the total including the ICMP header and/or the total including the IP(v4) header as well.

Also FWIW, Linux used to do some fairly detailed tracking of discovered in older kernels (tracked the route cache) which could help with some of this (by keeping a cache of path MTU for a longer time) if the path MTU discovery ICMP came back; but the route cache was removed in Linux 3.6, so IIRC this is now rediscovered per-connection. (The rationale was apparently that "many destinations" sites -- like Google -- saw poor route cache efficiency; but I think it probably helped "few destinations" hosts with special cases like lower path MTU quite a bit.)

Ewen

PS: It's fairly common for "middle of transport provider" encapsulation to be configured to strip or ignore "DF" bits, especially if it's operating in the form of a layer 2 bridge (and thus can't meaningfully send ICMP "too big" to make path MTU discovery work). One tries to avoid such "smaller links in the middle" in general, but sometimes there's little choice. Especially where the outer links are 1500 byte MTU (ie, standard Ethernet) and the middle link is some encapsulation over standard 1500-byte Ethernet.... and thus ends up being smaller.

I know the route cache was removed. Except Facebook also just optimized it, upstream, about 6 months ago. Most confusing.

Facebook's optimization is supposed to only create routing cache entries when there's PMTU or other exceptions to record. And there's a stackoverflow answer that interprets the route cache removal as doing the same thing...

Written on 26 January 2016.
« A Python wish: an easy, widely supported way to turn a path into a module
Why my home backup situation is currently a bit awkward »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Tue Jan 26 01:16:22 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.