Today I learned that a syslog server can be very silent on the network
Let's start with my tweet:
TIL that a UDP based syslog server can be so quiet that it falls out of switch MAC tables, causing syslog msgs to it to flood everywhere.
We have a central syslog server that people running 'sandbox' machines point their servers at to aggregate all of their syslog data at a single point (and off individual servers). Mostly because we're old fashioned, it uses the plain old UDP based method of syslog forwarding where client machines simply fire UDP packets in its direction.
Today, due to various recent events and questions I was running tcpdump on one of the machines here that's on the same network as the syslog server, partly to see what kind of crud I would discover swirling around on the network (there is always some). To my surprise I saw a whole burst of syslog traffic that was going to the machine (and it wasn't broadcast, either); the traffic was coming from a bunch of machines behind one firewall. I scratched my head for a bit until the penny dropped that the machine had fallen out of switch MAC to port tables.
The direct reason this could happen is that a UDP based syslog server doesn't naturally send out any packets. Unlike TCP streams, where it would at least be sending out ACKs and so refreshing the switch MAC tables, receiving UDP streams is entirely passive. The machine does do some things that generate packets periodically (such as NTP), but apparently it was doing them so infrequently that at least some switches timed its MAC entry out and started flooding traffic far enough to get to my observer machine. At the same time, the gateway for the sandbox network that was sending the syslog traffic didn't time out its ARP entry, so it never re-ARPed and thus provoked the syslog server into generating some packets to re-prime the switches.
(Or perhaps the outgoing packets it did generate didn't flow over enough of the switches involved.)
There's two lessons I draw from this. First, MAC table timeouts may vary significantly across different machines. They can vary not only in how long they are but also in what keeps an MAC table entry active. Either the switches timed out their MAC entries much faster than the gateway or the gateway's entries stayed alive when they were used for outgoing traffic while the switches didn't do that.
(I can come up with at least a justification for why a switch should be fairly aggressive about aging out MACs that it hasn't seen traffic from. Incorrect switch MAC tables can do significant damage, so better safe than sorry if something is silent.)
The second is that it may take only a single switch losing a MAC entry to cause significant flooding. If a machine doesn't generate broadcast traffic, many switches may not have MAC entries for it in the first place (if traffic to or from it never transits them). If a top level switch loses the MAC entry, it will flood the traffic to all of its ports and thus to many of those switches, who then flood it down to all of their ports and so on. The narrower the normal traffic flow is (for example, if it's mostly between one gateway and the machine), the fewer switches there are that have the MAC association in the first place and thus are in a position to stop such a flood.
(There are probably all sorts of interesting dynamics in this
situation in terms of where outgoing traffic from the 'mostly silent'
machine goes, what switches it passes through, and thus whether or
not it will cause all of the relevant switches to pick up MAC entries
again. The moral here is that nothing beats forcing the machine to
generate some broadcast traffic in some way. There's direct traffic
generation with eg
arping, or there's just pinging a nonexistent
IP every few minutes to force an ARP broadcast.)