Normal situations should not be warnings (especially not repeated ones)

February 9, 2021

Every so often (or really, too often), people with good intentions build a program that looks at some things or does some things, and they decide to have that program emit warnings or set status results if things are not quite perfect and as expected. This is a mistake, and it makes system administrators who have to deal with the program unhappy. An ordinary system configuration should not cause a program to raise warnings or error markers, even if it doesn't allow all of the things that a program is capable of doing (or that the program wants to do by default). In addition, every warning should be rate-limited in any situation that can plausibly emit them regularly.

That all sounds abstract, so let's make it concrete with some examples drawn from the very latest version (1.1.0) of the Prometheus host agent. The host agent gathers a bunch of information from your system, which is separated into a bunch of 'collectors' (one for each sort of information). Collectors may be enabled or disabled by default, and as part of the metrics that the host agent emits it can report if a particular collector said that it failed (what consitutes 'failure' is up to the collector to decide).

The host agent has collectors for a number of Linux filesystem types (such as XFS, Btrfs, and ZFS), for networking technologies such as Fibrechannel and Infiniband, and for network stack information such as IP filtering connection tracking ('conntrack'), among other collectors. All of the collectors I've named are enabled by default. Naturally, many systems do not actually have XFS, Btrfs, or ZFS filesystems, or Infiniband networking, or any 'conntrack' state. Unfortunately, of these enabled by default collectors, zfs, infiniband, fibrechannel, and conntrack all generate metrics reporting a collector failure on Linux servers that don't use those respective technologies. Without advance knowledge of the specific configuration of every server you monitor, this makes it impossible to tell the difference between a machine that doesn't have one of those things and a real collector failure on a machine that does have one and so should be successfully collecting information about them. But at least these failures only show up in the generated metrics. At least two collectors in 1.1.0 do worse by emitting actual warnings into the host agent's logs.

The first collector is for Linux's new pressure stall information. This is valuable information but of course is only supported on recent kernels, which means recent versions from Linux distributions (so, for example, both Ubuntu 18.04 and CentOS 7 use kernels without this information). However, if the host agent's 'pressure' collector can't find the /proc files it expects, it doesn't just report a collector failure, it emits an error message:

level=error ts=2021-02-08T19:42:48.048Z caller=collector.go:161 msg="collector failed" name=pressure duration_seconds=0.073142059 err="failed to retrieve pressure stats: psi_stats: unavailable for cpu"

At least you can disable this collector on older kernels, and automate that with a cover script that checks for /proc/pressure and disables the pressure collector if it's not there.

The second collector is for ZFS metrics. In addition to a large amount of regular ZFS statistics, recent versions of ZFS on Linux expose kernel information about the overall health of each ZFS pool on the system. This was introduced in ZFS on Linux version 0.8.0, which is more recent that the version of ZoL that is included in, for example, Ubuntu 18.04. Unfortunately, in version 1.1.0 the Prometheus host agent ZFS collector insists on this overall health information being present; if it isn't, the collector emits a warning:

level=warn ts=2021-02-09T01:14:09.074Z caller=zfs_linux.go:125 collector=zfs msg="Not found pool state files"

Since this is only part of the ZFS collector's activity, you can't disable just this pool state collection. Your only options are to either disable the entire collector, losing all ZFS metrics on say your Ubuntu 18.04 ZFS fileservers, or have frequent warnings flood your logs. Or you can take the third path of not using version 1.1.0 of the host agent.

(Neither the pressure collector nor the ZFS collector rate-limit these error and warning messages. Instead one such message will be emitted every time the host agent is polled, which is often as frequently as once every fifteen or even every ten seconds.)

Comments on this page:

By Oli Gendebien at 2021-02-10 16:49:48:

Somehow I'm running into this issue. You write:

At least you can disable this collector on older kernels, and automate that with a cover script that checks for /proc/pressure and disables the pressure collector if it's not there.

How do you go about doing that? Thank you for your post. Cheers

By cks at 2021-02-10 22:53:29:

Disabling collectors is not well covered in the node exporter documentation from what I remember, but it's done with a command line argument of '--no-collector.<whatever>', so '--no-collector.pressure' here. So what you want is something like a script of:

if [ ! -d /proc/pressure ]; then

exec /where/ever/node_exporter \
    $pressure \
    [whatever other arguments you use]

(We actually have quite a complicated front end script for the node exporter for this because we have all sorts of variations in what command line arguments we want on different servers. Some servers run NTP daemons, for example, and some don't, so we selectively enable the NTP collector. And back when we started we still had some non-systemd servers, so the systemd collector was only selectively enabled.)

By Oli Gendebien at 2021-02-11 10:47:45:

Thank you Chris it worked!

I added a parameter to the systemctl configuration for the node-exporter service.

Written on 09 February 2021.
« Strict SameSite web cookie policies probably don't do much for us
The issue of IOPS versus latency on SSDs and NVMe drives »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Feb 9 00:16:58 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.