Ethernet network cables can go bad over time, with odd symptoms

June 26, 2021

Last week we got around to updating the kernels on all of our Ubuntu servers, including our Prometheus metrics server, which is directly connected to four networks. When the metrics server rebooted, one of those network interfaces flapped down and up for a bit, then suddenly had a lot of intermittent ping failures to machines on that subnet. At first I thought that this might be a network driver bug in the new kernel, but when I rebooted the server again the network interface came up at 100 Mbit/sec instead of 1 Gbit/sec and suddenly we had no more ping problems. When we replaced the network cable yesterday, that interface returned to the 1G it was supposed to be at and pinging things on that network may now be more reliable than before.

The first thing I take away from this is that network cables don't just fail cleanly, and when they do have problems your systems may or may not notice. Last week, the network port hardware on both our metrics server and the switch it was connected to spent hours thinking that the cable was fine at 1G when it manifestly wasn't.

For various reasons I wound up investigating how long this had been going on, using both old kernel logs on our syslog server and the network interface speed information captured by the Prometheus host agent. This revealed that the problem most likely started around June 2019 to August of 2019, when the network link speed dropped to 100 Mbit/sec and stayed there other than briefly after some reboots. Over all that time, we didn't notice that the network interface was running at one step down from its expected rate, partly because we weren't doing anything performance sensitive over it.

(We now have alerts for this, just in case it ever happens again.)

The second thing I take away from this is that network cables can fail in place even after they've been plugged in and working for months. This network cable wasn't necessarily completely undisturbed in our machine room, but at most it would have gotten brushed and moved around in the rack cable runs as we added and removed other network cables. But the cable still failed over time, either entirely on its own or with quite mild mechanical stress. It's possible that the cable was always flawed to some degree, but if so the flaws got worse, causing the cable to decay from a reliable 1G link down to 100M.

I don't think there's anything we can really do about this except to keep it in mind as a potential cause of otherwise odd or mysterious problems. We're definitely not going to recable everything with fresh cables just in case, and we're probably not even going to use freshly made or bought cables when we rack new machines.

(Over time we'll turn over our cable stock as we move to 10G, but it's going to be a long time before we have all of the machines there.)

Comments on this page:

I've found that it's not always the patch cable that's bad, sometimes it's the gold-plated spring fingers inside the ethernet jacks themselves. Multiple patch cable changes do not always fix dirty fingers.

A toothbrush can help. Keep in mind that ethernet jack fingers have variable sliding contact with the pins of the ethernet cable's plug, it's not just 'one spot' that is always used for electrical contact. This is why you can jiggle ethernet cables around a mm or so and the link (hopefully) doesn't go down.

Reminds me of an MySQL issue we once had... we had an high traffic HA MySQL setup and one of the replicas was constantly getting behind and replication lag was raising.

This particular server was not slower than other replicas in any regard. CPU and Disks were on par to other replica servers and there was basically no reason why it waa behaving the way it was...

We first thought there was an issue with SSD drives exhausting write capacity or being close to full, but we soon observed eth interface being capped at 100 base t. Switching thr cable for a new one resolved all of the issues ofc, and the server cought up in a mater of 2-3 minutes.

Interested, how did you set up monitoring and alerting for this? Using alertmanager/prometheus or something else?

By Greg Ruben at 2021-06-29 04:33:54:

Thanks for the tip... I will bring up 'bonding' in all important servers (at least until homeoffice/COVID).

By cks at 2021-06-29 23:59:51:

Ivan, we use our existing Prometheus and Alertmanager setup, mostly because it was already there and so easy. I wrote up a description of the metrics (and alerts) in PrometheusCheckingNetworkInterfaces.

Although I didn't fully emphasize it in the entry, checking the network interface state can't substitute for an end to end check through Blackbox or whatever. There are system problems that just can't be picked up by the existing interface metrics. But there are also problems that end to end checks probably won't detect (as we saw with our cable going bad and dropping down to 100M).

Throwback to years ago I went into the data centre for something and saw five or six other staff all gathered at one end, so I went over to see what was up and in one mixed lab, the PCs were DHCPing just fine but the Macs were not. They had checked and/or replaced nearly everything but the uplink cable from the lab's switch to the router. Why not that cable? Well, it's always been working, we've unplugged it and plugged it back in again, can't be the issue, but we've been here hours. So I replaced it and everything started working fine again. Don't know why, didn't care why.

Ever since, I try to do the "it's stupid, but it's easy" sort of fixes before I start digging into the real time-consuming stuff. It's almost never the easy stuff, but when it is, it sure feels good to have spent a couple of minutes to get the "it's stupid but" fix rather than hours and finally ending up on that one.

Written on 26 June 2021.
« A couple of Linux top-like programs for network traffic
Some notes on what's in Linux's /sys/class/net for network interface status »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jun 26 00:01:40 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.