The case of the mysteriously failing connections
One of the strange networking mysteries around here is that every so often, one of our login servers will report that outgoing mail was delayed because it could not connect to the mail server's SMTP port. There's several things that make this puzzling:
- the connection is failing with 'host not reachable' errors, not 'connection refused' or the like
- the mail server is up, running fine, and not loaded at all
- the login servers and the mail server are on the same subnet, although they are not connected to the same switch.
This happens very infrequently, and every time we've seen it happen it's gone away when the mailer retried a bit later (which is one reason we haven't worried about it more).
Like the last mystery I don't have any answers, but I do have a theory. First, the background: our login servers are all on a single switch, along with our compute servers. We know that during periods of high activity the switch is sending 'stop transmitting' Ethernet flow control frames to the login servers; we believe that the switch's uplink is saturated, since it's only got a gigabit uplink and is connecting eight or nine actively used machines that get all the important filesystems over NFS.
(We actually split the machines between two switches moderately recently; I don't know if we've seen the problem since then.)
So my theory is that during periods of high network activity when the switch is choked, the login server's ARP requests for the mail server's Ethernet address are getting dropped (either by the switch or by the login server's network driver). Linux does report 'host unreachable' if there's no answer to its ARP queries, and people send email from the login servers sufficiently infrequently that the necessary information could drop out of the local ARP cache.