Ethernet network cables can go bad over time, with odd symptoms
Last week we got around to updating the kernels on all of our Ubuntu servers, including our Prometheus metrics server, which is directly connected to four networks. When the metrics server rebooted, one of those network interfaces flapped down and up for a bit, then suddenly had a lot of intermittent ping failures to machines on that subnet. At first I thought that this might be a network driver bug in the new kernel, but when I rebooted the server again the network interface came up at 100 Mbit/sec instead of 1 Gbit/sec and suddenly we had no more ping problems. When we replaced the network cable yesterday, that interface returned to the 1G it was supposed to be at and pinging things on that network may now be more reliable than before.
The first thing I take away from this is that network cables don't just fail cleanly, and when they do have problems your systems may or may not notice. Last week, the network port hardware on both our metrics server and the switch it was connected to spent hours thinking that the cable was fine at 1G when it manifestly wasn't.
For various reasons I wound up investigating how long this had been going on, using both old kernel logs on our syslog server and the network interface speed information captured by the Prometheus host agent. This revealed that the problem most likely started around June 2019 to August of 2019, when the network link speed dropped to 100 Mbit/sec and stayed there other than briefly after some reboots. Over all that time, we didn't notice that the network interface was running at one step down from its expected rate, partly because we weren't doing anything performance sensitive over it.
(We now have alerts for this, just in case it ever happens again.)
The second thing I take away from this is that network cables can fail in place even after they've been plugged in and working for months. This network cable wasn't necessarily completely undisturbed in our machine room, but at most it would have gotten brushed and moved around in the rack cable runs as we added and removed other network cables. But the cable still failed over time, either entirely on its own or with quite mild mechanical stress. It's possible that the cable was always flawed to some degree, but if so the flaws got worse, causing the cable to decay from a reliable 1G link down to 100M.
I don't think there's anything we can really do about this except to keep it in mind as a potential cause of otherwise odd or mysterious problems. We're definitely not going to recable everything with fresh cables just in case, and we're probably not even going to use freshly made or bought cables when we rack new machines.
(Over time we'll turn over our cable stock as we move to 10G, but it's going to be a long time before we have all of the machines there.)