Monitoring the status of Linux network interfaces with Prometheus
Recently I wrote about how we found out a network cable had quietly
gone bad and dropped the link to 100 Mbits/sec
and mentioned in passing that we were now monitoring for this sort
of thing (and in comments, Ivan asked
how we were doing this). We're doing our alerts for this through
our existing Prometheus setup, using
the metrics that node_exporter extracts from Linux's
/sys/class/net data for network interface status, which puts some limits on
what we can readily check.
To start with, you get the network interface's link speed in
node_network_speed_bytes. The values I've seen on our hardware
are 1250000000 (10G), 125000000 (1G), 12500000 (100M), and -125000
(for an interface that has been configured up but has no carrier).
If all of your network ports are at a single speed, say 1G (or 10G),
you can just alert on
node_network_speed_bytes being anything
other than your normal speed. We have a mixture of speeds, so I had
to resort to a collection of alerts to cover all of the cases:
- any network interface at 100M, excluding the few exceptions
- our small collection of 10G connections not being at 10G.
- anything under 100M if its
An Ethernet interface that's been configured up but has no carrier
node_network_carrier of 0 and also a
that's negative (and it also has a
node_network_up of 0). You
can use either metric to detect this state and alert on it, which
will find both unused network interfaces that your system has decided
to try to do DHCP on and network interfaces that are supposed to
be live but have no carrier. Unfortunately there's no way to detect
the inverse condition of an interface that has carrier but that
hasn't been configured up. The Linux kernel doesn't report on the
link carrier state for interfaces that aren't
UP, and so
node_exporter has no metric that can detect this.
(I'd like to detect situations where an unused server port has live networking, either because a cable got plugged in or an existing disused cable became live. In our environment, either is a mistake we want to fix.)
These days, almost all network links are full duplex. You can detect
links that have come up at half duplex by looking for a '
label in the
node_network_info metric. Since not all network
interfaces have a duplex, you can't just look for '
Technically 1G Ethernet can be run at half duplex, although there's
nothing that should do this. 10G-T Ethernet is apparently full
node_network_up metric looks tempting but unfortunately
it's a combination of dangerous and pointless.
is 1 if and only if the interface's
operstate is 'up', and not
all live network interfaces are 'up' when they're working. Prominently,
the loopback ('
lo') interface's normal operstate is 'unknown',
as are Wireguard interfaces (and PPP interfaces). In addition, an
operstate of 'up' requires there to be carrier on the interface.
node_network_up being 1 mean that everything is fine,
since an interface can be up without any IP addresses being configured
(But if you want to use
node_network_up, you probably want to
node_network_up != 1 and (node_network_protocol_type ==
1)'. This makes it conditional on the interface being an Ethernet
interface, so we know that
operstate should be 'up' if it's
functional. This is sufficiently complicated that I would rather
look for up interfaces without carrier, since that's the only error
condition we can actually see for Ethernet interfaces..)
Unfortunately, as far as I know there are no metrics that will tell
you if an interface has IPv4 or IPv6 addresses configured on it
(whether or not it has carrier and so is up). The 'address' that
about is the Ethernet address, not IP addresses (as you can see
from the values of the label in
conclusion is that you need to check whatever IP addresses you need
to be up through the Blackbox exporter.
Given all of this, under normal circumstances, I think there are
three sensible alerts or sets of alerts for network interfaces. One
alert or set of alerts is for interface speed, based on
node_network_speed_bytes, requiring your interfaces to be at
their expected speeds. In many environments, you could then look
node_network_carrier being 0 to detect interfaces that are
configured but don't have carrier. Finally, you might as well check
for half duplex with '
(It seems likely that a cable (or a port) that fails enough to force you down to half duplex will trigger other conditions as well, but who knows.)