What Prometheus exporters we use (as of the end of 2023)

January 16, 2024

We have a fairly basic and straightforward Prometheus and Grafana setup, but over time we've drifted into using a number of Prometheus exporters, which is the Prometheus term for things that provide or generate metrics. Today I feel like listing them off as a snapshot of our current practices and what we've found useful in our particular and somewhat peculiar environment.

  • node_exporter is the standard Prometheus host agent. We run the latest binary release on our Linux machines and the packaged OpenBSD version on our OpenBSD machines.

    We have a whole collection of scripts that collect various host specific metrics and push them out through the node exporter's 'textfile' collector; an inventory of these doesn't fit within the margins of this entry.

  • The Blackbox exporter is the standard solution for probing machines and services from the outside, and for collecting TLS certificate information. We use it for a variety of these checks across ICMP ping, port connections, HTTP checks, and DNS lookups.

  • Pushgateway is what we use to publish assorted bits of information, some of it for historical reasons.

  • apache_exporter is how we scrape basic statistics from our collection of Apache web servers. We run it on our central metrics server rather than having it running on each Apache web server for obscure reasons.

  • script_exporter is how we use arbitrary scripts to generate metrics. We use these scripts to perform more intricate service checks than Blackbox supports (letting us check IMAP, Authenticated SMTP, Samba servers, and more) and to pull more complicated information. I prefer the script exporter to the other options for this.

  • We run a locally hacked version of nvidia_exporter on our NVIDIA™ GPU SLURM nodes. It's somewhat handy for providing usage metrics to tell us how actively the GPUs are being used, how much of their memory gets allocated, and so on.

    (We have one machine with AMD GPUs, for which we use a hacked up version of code I dug out of the depths of Wikipedia's metrics systems; this code is a node exporter textfiles thing, not a separate exporter.)

  • A few machines run Google's mtail to extract structured information from various logs. These days it's only used for Exim logs for mail metrics.

    (Grafana Loki's promtail component can generate (some) metrics from logs, but I'm not really enthused about Loki these days and anyway we were using mtail before Loki existed.)

  • We use tplink-plug-exporter to give us some additional information from the wifi-controlled smart plugs we use to monitor our wifi. In theory this gives us information like reported 'RSSI' wifi signal strength, but in practice we're mostly scraping this because it's there.

  • We run the Cloudflare ebpf_exporter on our ZFS fileservers and a few other machines to capture detailed per-disk latency histograms, to help us diagnose potential disk IO performance problems. We're using an old version of this for various reasons; someday I need to update to the current version (which changed how it builds and deploys the eBPF instrumentation) and look for additional useful information it can collect.

  • We also run my fork of zfs_exporter on our ZFS fileservers to give us detailed ZFS performance information. Probably too detailed; since we have a lot of pools and report metrics down to individual disks (which are actually partitions on physical disks), we get a lot of time series from this exporter.

  • cert-exporter is used to collect TLS information for a few TLS certificates that we can most conveniently access on disk, instead of through TLS services. These include, for example, our OpenVPN TLS certificates (even though they won't expire for some time, which is a good thing).

  • chrony_exporter collects information from the Chrony NTP server, which we run on both our local NTP servers and our ZFS fileservers.

  • bind_exporter runs on both our stealth master Bind server and our Bind based resolving DNS servers (after we switched to Bind. This gives us metrics about query volume, which is nice, but it especially gives us all of the zone SOA serial numbers, which lets us raise alerts if things aren't all using the same version of our DNS zones.

As you can see from this list, we (well, I) like running exporters for things, although there are exporters we're not running for various reasons. One current big gap in our observability is per-service resource usage information on our Ubuntu servers. The information is there in Linux cgroups (and systemd's use of them), but I haven't found an available exporter that provides the information I'd like in the form I'd like it.

(It may surprise people to hear that we're not using the SNMP exporter, but we don't actually have anything we want to poll that's set up to report stuff over SNMP. In particular, our core network switches aren't set up for SNMP metrics collection, for historical reasons.)

Written on 16 January 2024.
« How we monitor that our wireless network is still there in places
Some interesting metrics you can get from cgroup V2 systems »

Page tools: View Source.
Search:
Login: Password:

Last modified: Tue Jan 16 22:42:46 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.