What Prometheus exporters we use (as of the end of 2023)
We have a fairly basic and straightforward Prometheus and Grafana setup, but over time we've drifted into using a number of Prometheus exporters, which is the Prometheus term for things that provide or generate metrics. Today I feel like listing them off as a snapshot of our current practices and what we've found useful in our particular and somewhat peculiar environment.
- node_exporter is the
standard Prometheus host agent. We run the latest binary release on our
Linux machines and the packaged OpenBSD version on our OpenBSD machines.
We have a whole collection of scripts that collect various host specific metrics and push them out through the node exporter's 'textfile' collector; an inventory of these doesn't fit within the margins of this entry.
- The Blackbox exporter
is the standard solution for probing machines and services from
the outside, and for collecting TLS certificate information. We
use it for a variety of these checks across ICMP ping, port connections,
HTTP checks, and DNS lookups.
- Pushgateway is what we
use to publish assorted bits of information, some of it for historical
reasons.
- apache_exporter is
how we scrape basic statistics from our collection of Apache web
servers. We run it on our central metrics server rather than having
it running on each Apache web server for obscure reasons.
- script_exporter is
how we use arbitrary scripts to generate metrics. We use these scripts
to perform more intricate service checks than Blackbox supports (letting
us check IMAP, Authenticated SMTP, Samba servers, and more) and to pull
more complicated information. I prefer the script exporter to the other
options for this.
- We run a locally hacked version of nvidia_exporter
on our NVIDIA™ GPU SLURM nodes. It's somewhat
handy for providing usage metrics to tell us how actively the GPUs
are being used, how much of their memory gets allocated, and so on.
(We have one machine with AMD GPUs, for which we use a hacked up version of code I dug out of the depths of Wikipedia's metrics systems; this code is a node exporter textfiles thing, not a separate exporter.)
- A few machines run Google's mtail to
extract structured information from various logs. These days it's only used
for Exim logs for mail metrics.
(Grafana Loki's promtail component can generate (some) metrics from logs, but I'm not really enthused about Loki these days and anyway we were using mtail before Loki existed.)
- We use tplink-plug-exporter to give us some
additional information from the wifi-controlled smart plugs we use
to monitor our wifi. In theory this gives
us information like reported 'RSSI' wifi signal strength, but in
practice we're mostly scraping this because it's there.
- We run the Cloudflare ebpf_exporter on our ZFS fileservers and a few other machines to capture
detailed per-disk latency histograms, to help us diagnose potential
disk IO performance problems. We're using an old version of this for
various reasons; someday I need to update to the current version (which
changed how it builds and deploys the eBPF instrumentation) and look for
additional useful information it can collect.
- We also run my fork of zfs_exporter on our ZFS fileservers
to give us detailed ZFS performance information. Probably too detailed;
since we have a lot of pools and report metrics down to individual disks
(which are actually partitions on physical disks), we get a lot of time
series from this exporter.
- cert-exporter is used
to collect TLS information for a few TLS certificates that we can most
conveniently access on disk, instead of through TLS services. These
include, for example, our OpenVPN TLS certificates (even though they won't expire for
some time, which is a good thing).
- chrony_exporter collects
information from the Chrony NTP server,
which we run on both our local NTP servers
and our ZFS fileservers.
- bind_exporter runs on both our stealth master Bind server and our Bind based resolving DNS servers (after we switched to Bind. This gives us metrics about query volume, which is nice, but it especially gives us all of the zone SOA serial numbers, which lets us raise alerts if things aren't all using the same version of our DNS zones.
As you can see from this list, we (well, I) like running exporters for things, although there are exporters we're not running for various reasons. One current big gap in our observability is per-service resource usage information on our Ubuntu servers. The information is there in Linux cgroups (and systemd's use of them), but I haven't found an available exporter that provides the information I'd like in the form I'd like it.
(It may surprise people to hear that we're not using the SNMP exporter, but we don't actually have anything we want to poll that's set up to report stuff over SNMP. In particular, our core network switches aren't set up for SNMP metrics collection, for historical reasons.)
|
|