Network loops can have weird effects (at least sometimes)
Today we had a weird network problem on one of our most crucial networks, our port-isolated user machine network; this is the wired network used to connect laptops, most machines in people's offices, and so on. The only failure we could really see was that when the gateway firewall sent out a (broadcast) ARP request for a given IP, it would not see the (unicast) ARP reply from your machine. If your machine did something that caused the gateway to pick up its MAC, everything worked. Manually delete the ARP entry on the gateway, and the problem would be back. And rarely (often taking many minutes) an ARP reply would make it to the gateway and poof, everything would work again for your machine for a while until your ARP entry fell out of the gateway's ARP table.
There were several oddities about this. The biggest is that only
ARP replies were affected; you could, for example, ping back and
forth between your machine and elsewhere as long as the gateway had
you in its ARP table. Nor did we see any unusual network traffic
during this. We've seen our networks melt down on occasion (including
this one), with things like traffic floods, rogue DHCP servers, and
packet echoes, but nothing odd showed up in
tcpdump from multiple
vantage points. If anything maybe there was less extraneous broadcast
babbling than usual.
Given 'some packets are vanishing', we initially suspected malfunctioning switches; we've seen various downright weird things when this happens. So we swapped in spares for core top level switches (they) were basically the only common point in the switch fabric between all of the machines that were seeing problems) and of course nothing happened. It wasn't the gateway, because we could reproduce the problem with a number of other machines in the same top level network position (such as the DHCP server). We scratched our heads a lot, or at least I did, and eventually resorted to brute force instead of trying to come up with theories about what had broken how: as I mentioned on Twitter, we started systematically disconnecting bits of the network from the top down to see what had to be connected to make things go wrong.
As you already know from the title of this entry, the problem turned out to be a network loop. At the very periphery of the network (in one of the department's office areas), someone had plugged a little 5-port switch into two network drops at once, thereby creating a loop between two ports on one of our wiring closet leaf switches. This simple single-switch cross-connect was the root cause of all of our network problems.
Looking back at it after the fact, I can construct a theory about how this cross-connect caused the observed problems (although I have no idea if it's correct). But at the time I wouldn't have at all expected to see these symptoms from a network loop. So my moral for today is that the symptoms of network loops can be quite weird and not what I expect at all.
(For reasons beyond the scope of this entry, we do not have STP enabled on our switches. Under normal circumstances it's unnecessary, as all of our networks are strict (acyclic) trees.)
Things I learned from OpenSSH about reading very sensitive files
You may have heard that OpenSSH had an exploitable issue with some bad client code (which is actually two CVEs, CVE-2016-0777 and CVE-2016-0778). The issue was reported by Qualys Security, who released a fascinating and very detailed writeup on the issues. While the direct problem is basically the same as in Heartbleed, namely trusting an attacker-supplied length parameter and then sending back whatever happened to be sitting in memory, Qualys Security identified several issues that allowed private keys to leak through this issue despite OpenSSH's attempts to handle them securely. The specific issues are also fascinating in how they show just how hard it is to securely read sensitive files.
So here is what I have learned from OpenSSH about this:
- Do not use any sort of library level buffered IO. OpenSSH read
private keys with stdio, which left copies of them behind in stdio
buffers that were later
free()'d. If your data is sensitive enough that you are going to explicitly erase it later, you must insure that it never passes through buffers that you do not control (and then you zero the buffers afterwards).
(I suspect that I have Go code that fumbles this, although doing this in Go in general is at least a bit tricky.)
- Do not use any convenient form of memory or buffer handling that
magically reallocates a new buffer and copies data when you ask
a buffer to grow. The other way OpenSSH leaked private keys into
memory was through
realloc(), which may of course free the buffer you handed it and give you another one.
There are all sorts of convenient auto-growing buffers and objects in all sorts of languages that are going to be vulnerable to this. You need to avoid them all. Once again, explicitly handling all storage yourself is required (and explicitly erasing all of it).
- Do not use general but over-powerful facilities in security
sensitive code. OpenSSH apparently used a general 'load keypair'
function that read the private key too even though it only needed
the public key, which resulted in OpenSSH bringing into memory
(and exposing) the private keys for all public keys it checked,
not just the private key of the public key it was going to use
with the server.
It's easy to see how this happened, and strictly from a programming perspective it's the right answer. We have to load keypairs some times, so rather than have a 'load keypair' function and a 'load public key' function and so on, you just have a fully general 'load keypair' function and throw away the parts of results you don't need. But the result is that we loaded key data we didn't need into memory and then it leaked.
(Admittedly the issue in OpenSSH is somewhat complex, since you can remove the
.pubfile and force it to decrypt your private key file to recover the public key (see the comments for this entry).)
(There is also an additional issue where overly clever C compilers
can eliminate 'unnecessary'
memset() operations that are supposed
to erase the key data.)
The thing that scares me is that all of these are really easy mistakes to make, or rather really easy issues to overlook. I could have written code that did all three of them without even thinking about it, and I might have done so even if I was also writing code to carefully erase key data later.
Part of how they are so easy to overlook is that we are trained to
think in terms of abstractions, not the details underneath them.
Stdio gives you efficient IO (and maybe you remember that it involves
realloc() just magically grows the allocated space
(and oh yeah, sometimes it gives you a new area of memory, which
means the old one got freed), and so on. And in fact all of these
would have been harmless or mostly harmless in the absence of the
core 'server can get a copy of random memory' bug.