Monitoring the status of Linux network interfaces with Prometheus

June 29, 2021

Recently I wrote about how we found out a network cable had quietly gone bad and dropped the link to 100 Mbits/sec and mentioned in passing that we were now monitoring for this sort of thing (and in comments, Ivan asked how we were doing this). We're doing our alerts for this through our existing Prometheus setup, using the metrics that node_exporter extracts from Linux's /sys/class/net data for network interface status, which puts some limits on what we can readily check.

To start with, you get the network interface's link speed in node_network_speed_bytes. The values I've seen on our hardware are 1250000000 (10G), 125000000 (1G), 12500000 (100M), and -125000 (for an interface that has been configured up but has no carrier). If all of your network ports are at a single speed, say 1G (or 10G), you can just alert on node_network_speed_bytes being anything other than your normal speed. We have a mixture of speeds, so I had to resort to a collection of alerts to cover all of the cases:

An Ethernet interface that's been configured up but has no carrier has a node_network_carrier of 0 and also a node_network_speed_bytes that's negative (and it also has a node_network_up of 0). You can use either metric to detect this state and alert on it, which will find both unused network interfaces that your system has decided to try to do DHCP on and network interfaces that are supposed to be live but have no carrier. Unfortunately there's no way to detect the inverse condition of an interface that has carrier but that hasn't been configured up. The Linux kernel doesn't report on the link carrier state for interfaces that aren't UP, and so node_exporter has no metric that can detect this.

(I'd like to detect situations where an unused server port has live networking, either because a cable got plugged in or an existing disused cable became live. In our environment, either is a mistake we want to fix.)

These days, almost all network links are full duplex. You can detect links that have come up at half duplex by looking for a 'duplex="half"' label in the node_network_info metric. Since not all network interfaces have a duplex, you can't just look for 'duplex!="full"'. Technically 1G Ethernet can be run at half duplex, although there's nothing that should do this. 10G-T Ethernet is apparently full duplex only.

The node_network_up metric looks tempting but unfortunately it's a combination of dangerous and pointless. node_network_up is 1 if and only if the interface's operstate is 'up', and not all live network interfaces are 'up' when they're working. Prominently, the loopback ('lo') interface's normal operstate is 'unknown', as are Wireguard interfaces (and PPP interfaces). In addition, an operstate of 'up' requires there to be carrier on the interface. Nor does node_network_up being 1 mean that everything is fine, since an interface can be up without any IP addresses being configured on it.

(But if you want to use node_network_up, you probably want to use 'node_network_up != 1 and (node_network_protocol_type == 1)'. This makes it conditional on the interface being an Ethernet interface, so we know that operstate should be 'up' if it's functional. This is sufficiently complicated that I would rather look for up interfaces without carrier, since that's the only error condition we can actually see for Ethernet interfaces..)

Unfortunately, as far as I know there are no metrics that will tell you if an interface has IPv4 or IPv6 addresses configured on it (whether or not it has carrier and so is up). The 'address' that node_network_info and node_network_address_assign_type talk about is the Ethernet address, not IP addresses (as you can see from the values of the label in node_network_info). My conclusion is that you need to check whatever IP addresses you need to be up through the Blackbox exporter.

Given all of this, under normal circumstances, I think there are three sensible alerts or sets of alerts for network interfaces. One alert or set of alerts is for interface speed, based on node_network_speed_bytes, requiring your interfaces to be at their expected speeds. In many environments, you could then look for node_network_carrier being 0 to detect interfaces that are configured but don't have carrier. Finally, you might as well check for half duplex with 'node_network_info{duplex="half"}'.

(It seems likely that a cable (or a port) that fails enough to force you down to half duplex will trigger other conditions as well, but who knows.)

Written on 29 June 2021.
« Be careful when matching on Ethernet addresses in systemd-networkd
Giving your Linux network interfaces fixed names (under udevd and networkd) »

Page tools: View Source.
Search:
Login: Password:

Last modified: Tue Jun 29 23:55:34 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.