2015-03-13
The puzzle of packets to your host that your host doesn't respond to
Today we tried to replace an old machine by having a newly built version of it take over from it. We built the new version using a temporary name and IP address, then at the transition time shut down the old version, reconfigured the new version to use the real name, primary IP address, and IP aliases, and rebooted it so it would come up with the new configuration. Unfortunately, when it came up it had a very weird problem: machines on the local network could talk to all of its IP addresses, but machines on other networks could only talk to its primary IP address, not any of the IP aliases. The other IP aliases didn't respond to packets.
To make it more mysterious, during the troubleshooting attempt my
coworker ran tcpdump
on the server itself and actually saw his
pings to an IP alias coming in but not being answered:
www.cs # tcpdump -i eth0 "host workstation.cs" 08:16:22.929594 IP workstation.cs -> support.cs: ICMP echo request, id 18707, seq 33, length 64 08:16:23.929546 IP workstation.cs -> support.cs: ICMP echo request, id 18707, seq 33, length 64
Then after a while (but not a short while) the problem went away; you could ping and otherwise talk to the IP aliases from machines on other networks. Oh, and we could reproduce this (we did it when failing back to the old version, which made us very alarmed).
(There's no firewall involved here, just to cover that.)
What's going on here is the inverse of something I've seen before with outgoing traffic:
You can't tell if packets are really going to your machine without checking the destination Ethernet address. The destination IP alone is not good enough.
Sure, these packets look like they're going to our server. But
actually they aren't; they're being sent to the Ethernet address
of the old version of the server, not the Ethernet address of the
current one. The new version of the server is seeing them for two
reasons. First, the switches on the network have aged out the
Ethernet address to port association for the old Ethernet address,
so the switches have to flood these packets to all ports. Second,
tcpdump
is running in its default promiscuous mode so it's picking
up this flooded traffic (and displaying only the IP level information).
The kernel knows better and is quietly ignoring these packets just
like it ignores all sorts of other random crud that shows up on the
network port.
(If we weren't running tcpdump
with the interface in promiscuous
mode, the packets probably would be ignored at the hardware level and not even reach the kernel.)
The reason that the packets had the old Ethernet address is that our top level router was caching the IP to Ethernet address association for a surprisingly long time. Hosts on the local network were directly re-ARPing for the IP aliases and getting the new server's Ethernet address, so they could talk to it, but packets from other networks went through the router and the router just used the old Ethernet address it had cached. As for traffic to the server's primary IP working, we think that the Ethernet address for the server's primary IP was getting updated on the router because the server generates outgoing traffic from that IP address, forcing the router to update. The problem went away after a while because the router timed out its cached Ethernet address information, re-ARPed, and finally had the correct new Ethernet addresses.
(Once we went searching on the Internet, we discovered that this is known behavior of our particular make of router. Fortunately there's a way to forcefully purge such a cached entry; unfortunately we're going to have to remember to do this on any migration or manual failover of any machine that has IP aliases. And it's a good thing we're not trying to do automated failover of IP aliases between machines.)