The likely cause of my IPSec dropped packet mystery
I believe that I've identified the cause of my mysterious dropped GRE tunnel packets that showed up in recent kernels. The short description of the cause is recursive routing leading to a path MTU collapse.
Explaining this is going to take some verbiage. Back when I set up my GRE tunnel, I wrote:
My current trick is routing the subnet that the target of the tunnel is on over the tunnel itself, which makes my head hurt.
Let me make this concrete. The GRE tunnel target is 220.127.116.11, and as part of my dual identity routing I have a route:
ip route add 18.104.22.168/24 dev extun
Let us call the tunnel target T, my machine's inside address I, and my machine's outside address O (because all of these are much shorter and clearer than writing out the IP addresses in full). In an environment with policy based routing it's possible to see how all of this works; because the tunnel is explicitly specified as being from O to T, it is forced to ignore the route to T's subnet that would normally send the GRE-encapsulated traffic to T back over the tunnel. This still works even if you talk directly to T without specifying a source address; your plain TCP connection will be routed over the GRE tunnel (and get the source address of I), and then the encapsulated version will be routed over the regular connection since it now comes from O.
(It's possible that the kernel is smart enough to do this even without policy based routing, but I haven't tested that.)
Because the GRE tunnel is an encapsulation over my regular link, it has a lower MTU than the regular link. This means that traffic going from I to T has a lower (path) MTU than traffic going from O to T.
In old kernels, all of this worked fine, and in particular the kernel kept the path MTUs of the two versions of traffic to T separate. In recent kernels, this appears to have changed; it looks like there is only a single path MTU for T, regardless of the path to it. The consequence is that when I start a TCP conversation with T over the GRE tunnel, the path MTU to T almost immediately collapses down to 552 octets (the default minimum path MTU). I assume that this is happening due to a recursive series of path MTU reductions; first the GRE tunnel reduces the normal MTU to T down to the GRE tunnel's MTU, then the encapsulation code notices that the GRE tunnel's MTU doesn't fit in the MTU to T and chops it in turn, and things repeat until the kernel won't let the MTU go any lower.
(There appears to be a minimum MTU for the GRE tunnel that is over
552 octets. Once the MTU to T shrinks too far and I try to talk
to anything over the GRE tunnel, I see a series of locally generated
rejections of the form '
ICMP 22.214.171.124 unreachable - need to frag
(mtu 478), length 556'. Another diagnostic is that the transmit error
count shown by '
ip -s -s show dev extun' keeps counting up.)
One can see some of this by inspecting the routing cache with
ip route show table cache'. However, flushing the cache
ip route flush table cache') does not help; it seems
that in current kernels, this routing cache is not the real fully
authoritative source of this information. (I am not up enough on
Linux networking to understand what is going on here.)
This problem can be avoided to a certain extent by creating a host route for T that sends traffic for it explicitly over the underlying link, not the GRE tunnel. However you will still provoke this problem if you force traffic for T to go over the GRE tunnel (for example, by specifying a source IP of I so that policy based routing kicks in); this just avoids accidents.
(Much of my understanding of what's going on has been developed through interacting with Eric Dumazet on the Linux kernel netdev mailing list, and in skimming netdev in general. Without Eric's questions in response to my initial bug report, I would never have been able to work out what's going on.)
Sidebar: useful sysctls and other things
There are two potentially useful sysctls in /proc/sys/net/ipv4/route.
min_pmtu sets the minimum path MTU, and is normally 552.
mtu_expires sets how long (in seconds) that learned path MTU(s) will
stick around for and is normally ten minutes; I believe that setting
it to a low value does not expire already-learned path MTUs. There is
a seductive looking
flush sysctl entry in the same directory but I
was unable to get it to do anything useful in testing; whatever it's
flushing is not what is grimly holding on to a bad path MTU.