The case of the mysteriously failing connections

July 11, 2008

One of the strange networking mysteries around here is that every so often, one of our login servers will report that outgoing mail was delayed because it could not connect to the mail server's SMTP port. There's several things that make this puzzling:

  • the connection is failing with 'host not reachable' errors, not 'connection refused' or the like
  • the mail server is up, running fine, and not loaded at all
  • the login servers and the mail server are on the same subnet, although they are not connected to the same switch.

This happens very infrequently, and every time we've seen it happen it's gone away when the mailer retried a bit later (which is one reason we haven't worried about it more).

Like the last mystery I don't have any answers, but I do have a theory. First, the background: our login servers are all on a single switch, along with our compute servers. We know that during periods of high activity the switch is sending 'stop transmitting' Ethernet flow control frames to the login servers; we believe that the switch's uplink is saturated, since it's only got a gigabit uplink and is connecting eight or nine actively used machines that get all the important filesystems over NFS.

(We actually split the machines between two switches moderately recently; I don't know if we've seen the problem since then.)

So my theory is that during periods of high network activity when the switch is choked, the login server's ARP requests for the mail server's Ethernet address are getting dropped (either by the switch or by the login server's network driver). Linux does report 'host unreachable' if there's no answer to its ARP queries, and people send email from the login servers sufficiently infrequently that the necessary information could drop out of the local ARP cache.


Comments on this page:

From 93.80.166.248 at 2008-07-12 00:20:12:

Did you check number of errors on the interface? netstat -in should be helpful in such situations. May be you only need to replace a cable...

By cks at 2008-07-12 15:28:08:

The interfaces show no errors, and the machines are heavily enough used that a problematic cable would have much more obvious symptoms.

(And it turns out that splitting the machines between two switches has not eliminated the problem; I happened to notice a case of it happening the other day.)

From 93.80.179.93 at 2008-07-14 12:45:00:

how did you check your server's load?

did you try to repeat this problem without network? (if you connects to localhost on mail server?)

i don't know which MTA software do you use, but how big is its backlog parameter in listen(2)?

how many ip addresses do you have on your network interfaces?

did you look at tcpdump, whether your login server tries to request mail server's mac address again or not?

what are the values of "sysctl -a | grep arp" on both your hosts?

A lot of questions, but there is not enough information to say precisely, that the switches are guilty. It may be also both servers and software...

By cks at 2008-07-14 21:51:26:

The problem is fairly rare and doesn't last for very long; essentially by the time we can notice it, it is too late to do anything like run tcpdump. The mail server is lightly loaded as reported by uptime (and because it doesn't process much mail; it is used only for outgoing user-written email), arp settings are the default values, and all of the network interfaces involved have no aliases.

Written on 11 July 2008.
« Internet software decays and must be actively maintained
When overlapping windows do (and don't) make sense »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jul 11 00:43:46 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.