I failed to notice when my network performance became terrible
I've written about how I didn't notice how comparatively slow one of my machines became over time. I have now run into another excellent and uncomfortable illustration of this phenomenon; this time around, it was part of my home network performance quietly becoming rather terrible.
My current home Internet is generally around 15000 Kbps down and 7500 Kbps up; its speed is stable and solid. I also have a little GRE over IPSec VPN tunnel between my home and work Linux machines, and somewhat over a year ago I used it to do some graphics-intensive remote X work, which quite impressed me at the time. Unfortunately, at some time since then the performance of that GRE-over-IPSec VPN tunnel fell off a cliff. Today, my office machine can send my home machine data over it at only about 120 KB/sec; another machine on campus that's several network hops away from my office machine can manage only 4.7 KB/sec. Talking directly to my home machine without the GRE-over-IPsec tunnel, both can manage around 1800 KB/sec.
In one version of this story, I would now tell you how I didn't
notice the decrease in performance. Looking back, that isn't what
happened; instead, I noticed signs of the decrease but I casually
blamed them on other causes. For instance, when
rsyncing a backup
copy of Wandering Thoughts to my home machine started
being visibly slow, I thought 'oh, my disks must be slow'. When
merely refreshing the front page of Wandering Thoughts involved
a visible lurch due to the browser redoing layout as more of the
page showed up, I assumed that either my browser was slow or the
web server was slow (or both). In reality all of these had a single
root cause, that being that I can only get 5 KB/sec of streaming
TCP bandwidth from the web server.
(It was actually the slow
rsyncs that caused me to start digging
recently. Not only did things reach the point where it was actively
irritating, but my office workstation did its own
rsync at blazing
speed, and I was now using SSDs at home anyway, so they shouldn't
be slow. If I had a slow IO problem at home, I had real problems
either with my SSDs or with ZFS, so I decided I'd better try to
figure out what was going on. Eventually this got me check network
bandwidth just in case, since I was increasingly ruling out everything
I could think of, like disk IO or network latency.)
What interests me most is the psychology of all of this. I'm pretty sure that when problems started, I just assumed that they were inevitable and more or less beyond my control. Since I thought there was nothing that could be done, I didn't pay any real attention to things and I certainly didn't investigate. All of this is the result of sensible human decision-making heuristics, but these heuristics misfire every so often.
(And now I'm irritated with myself for not investigating much earlier, when I might have been able to file a bug report that gives a specific 'it was good here then became bad here' set of software versions. This is as irrational as ever, but humans are not rational creatures even if we like to pretend that we are.)
Sidebar: What I know about the situation so far
My home and office machines are both running Fedora 26, but this problem was present in Fedora 25 as well and perhaps in earlier Fedora versions. I'm pretty sure that it can't have been present when I did my trick with remote X, but that's more than a year ago, with Fedora 23.
The problem is specific to the combination of GRE over IPSec. I've tested IPSec alone and GRE alone (in their actual operating configuration), and both get the full 1800 KB/sec down that I'd expect. Only when I encapsulate my GRE tunnel in IPSec do things go wrong. Conveniently (or inconveniently), this means that the problem is entirely in the Linux kernel, so diagnosing this will probably be what they call 'fun'.
(My GRE tunnel has the same relatively low MTU whether it's inside or outside IPSec, and it's had that MTU for a very long time now.)