Monitoring the status of Linux network interfaces with Prometheus
Recently I wrote about how we found out a network cable had quietly
gone bad and dropped the link to 100 Mbits/sec
and mentioned in passing that we were now monitoring for this sort
of thing (and in comments, Ivan asked
how we were doing this). We're doing our alerts for this through
our existing Prometheus setup, using
the metrics that node_exporter extracts from Linux's
/sys/class/net
data for network interface status, which puts some limits on
what we can readily check.
To start with, you get the network interface's link speed in
node_network_speed_bytes
. The values I've seen on our hardware
are 1250000000 (10G), 125000000 (1G), 12500000 (100M), and -125000
(for an interface that has been configured up but has no carrier).
If all of your network ports are at a single speed, say 1G (or 10G),
you can just alert on node_network_speed_bytes
being anything
other than your normal speed. We have a mixture of speeds, so I had
to resort to a collection of alerts to cover all of the cases:
- any network interface at 100M, excluding the few exceptions
with
unless
. - our small collection of 10G connections not being at 10G.
- anything under 100M if its
node_network_up
is 1.
An Ethernet interface that's been configured up but has no carrier
has a node_network_carrier
of 0 and also a node_network_speed_bytes
that's negative (and it also has a node_network_up
of 0). You
can use either metric to detect this state and alert on it, which
will find both unused network interfaces that your system has decided
to try to do DHCP on and network interfaces that are supposed to
be live but have no carrier. Unfortunately there's no way to detect
the inverse condition of an interface that has carrier but that
hasn't been configured up. The Linux kernel doesn't report on the
link carrier state for interfaces that aren't UP
, and so
node_exporter has no metric that can detect this.
(I'd like to detect situations where an unused server port has live networking, either because a cable got plugged in or an existing disused cable became live. In our environment, either is a mistake we want to fix.)
These days, almost all network links are full duplex. You can detect
links that have come up at half duplex by looking for a 'duplex="half"
'
label in the node_network_info
metric. Since not all network
interfaces have a duplex, you can't just look for 'duplex!="full"
'.
Technically 1G Ethernet can be run at half duplex, although there's
nothing that should do this. 10G-T Ethernet is apparently full
duplex only.
The node_network_up
metric looks tempting but unfortunately
it's a combination of dangerous and pointless. node_network_up
is 1 if and only if the interface's operstate
is 'up', and not
all live network interfaces are 'up' when they're working. Prominently,
the loopback ('lo
') interface's normal operstate is 'unknown',
as are Wireguard interfaces (and PPP interfaces). In addition, an
operstate
of 'up' requires there to be carrier on the interface.
Nor does node_network_up
being 1 mean that everything is fine,
since an interface can be up without any IP addresses being configured
on it.
(But if you want to use node_network_up
, you probably want to
use 'node_network_up != 1 and (node_network_protocol_type ==
1)
'. This makes it conditional on the interface being an Ethernet
interface, so we know that operstate
should be 'up' if it's
functional. This is sufficiently complicated that I would rather
look for up interfaces without carrier, since that's the only error
condition we can actually see for Ethernet interfaces..)
Unfortunately, as far as I know there are no metrics that will tell
you if an interface has IPv4 or IPv6 addresses configured on it
(whether or not it has carrier and so is up). The 'address' that
node_network_info
and node_network_address_assign_type
talk
about is the Ethernet address, not IP addresses (as you can see
from the values of the label in node_network_info
). My
conclusion is that you need to check whatever IP addresses you need
to be up through the Blackbox exporter.
Given all of this, under normal circumstances, I think there are
three sensible alerts or sets of alerts for network interfaces. One
alert or set of alerts is for interface speed, based on
node_network_speed_bytes
, requiring your interfaces to be at
their expected speeds. In many environments, you could then look
for node_network_carrier
being 0 to detect interfaces that are
configured but don't have carrier. Finally, you might as well check
for half duplex with 'node_network_info{duplex="half"}
'.
(It seems likely that a cable (or a port) that fails enough to force you down to half duplex will trigger other conditions as well, but who knows.)
|
|