How we monitor our Prometheus setup itself
On Mastodon, I said:
When you have a new alerting and monitoring system, 'who watches the watchmen' becomes an interesting and relevant question. Especially when the watchmen have a lot of separate components and moving parts.
If we had a lot of experience with Prometheus, we probably wouldn't worry about this; we'd be able to assume that everything was just going to work reliably. But we're very new with Prometheus, and so we get to worry about its reliability in general and also the possibility that we'll quietly break something in our configuration or how we're operating things (and we have, actually). So we need to monitor Prometheus itself. If Prometheus was a monolithic system, this would probably be relatively easy, but instead instead our overall Prometheus environment has a bunch of separate pieces, all of which can have things go wrong.
A lot of how we're monitoring for problems is probably basically standard in Prometheus deployments (or at least standard in simple ones, like ours). The first level of monitoring and alerts is things inside Prometheus:
- We alert on unresponsive host agents (ie, Prometheus node_exporter) as part of our
general checking for and alerting on down hosts; this will catch
when a configured machine doesn't have the agent installed or it
hasn't been started. The one thing it won't catch is a production
machine that hasn't been added to our Prometheus configuration.
Unfortunately there's no good automated way in our environment
to tell what is and isn't a production machine, so we're just
going to have to rely on remembering to add machines to Prometheus
when we put them into production.
(This alert uses the Prometheus '
up
' metric for our specific host agent job setting.) - We also alert if Prometheus can't talk to a number of other metrics
sources it's specifically configured to pull from, such as Grafana,
Pushgateway, the Blackbox agent itself, Alertmanager, and a couple
of instances of an Apache metrics exporter. This is also
based on the
up
metric, excluding the ones for host agents and for all of our Blackbox checks (which generateup
metrics themselves, which can be distinguished from regularup
metrics because the Blackbox check ones have a non-emptyprobe
label). - We publish some system-wide information for temperature sensor
readings and global disk space usage for our NFS fileservers, so we have checks to make sure
that this information is both present at all and not too old. The
temperature sensor information is published through Pushgateway,
so we leverage its
push_time_seconds
metric for the check. The disk space usage information is published in a different way, so we rely on its own 'I was created at' metric. - We publish various per-host information through the host agent's
textfile
collector, where you put files of metrics you want to publish in a specific directory, so we check to make sure that these files aren't too stale through thenode_textfile_mtime_seconds
metric. Because we update these files at varying intervals but don't want to have complex alerts here, we use a single measure for 'too old' and it's a quite conservative number.(This won't detect hosts that have never successfully published some particular piece of information at all, but I'm currently assuming this is not going to happen. Checking for it would probably be complicated, partly because we'd have to bake in knowledge about what things hosts should be publishing.)
All of these alerts require their own custom and somewhat ad-hoc rules. In general writing all of these checks feels like a bit of a slog; you have to think about what could go wrong, and then how you could check for it, and then write out the actual alert rule necessary. I was sort of tempted to skip writing the last two sets of alerts, but we've actually quietly broken both the global disk space usage and the per-host information publication at various times.
(In fact I found out that some hosts weren't updating some information
by testing my alert rule expression in Prometheus. I did a topk()
query on it and then went 'hey, some of these numbers are really
much larger than they should be'.)
This leaves checking Prometheus itself, and also a useful check on Alertmanager (because if Alertmanager is down, Prometheus can't send out the alert it detects). In some places the answer to this would be a second Prometheus instance that cross-checks the first and a pair of Alertmanagers that both of them talk to and that coordinate with each other through their gossip protocol. However, this is a bit complicated for us, so my current answer is to have a cron job that tries to ask Prometheus for the status of Alertmanager. If Prometheus answers and says Alertmanager is up, we conclude that we're fine; otherwise, we have a problem somewhere. The cron job currently runs on our central mail server so that it depends on the fewest other parts of our infrastructure still working.
(Mechanically this uses curl
to make the query through Prometheus's
HTTP API and then jq
to extract
things from the answer.)
We don't currently have any checks to make sure that Alertmanager can actually send alerts successfully. I'm not sure how we'd craft those, because I'm not sure Alertmanager exposes the necessary metrics. Probably we should try to write some alerts in Prometheus and then have a cron job that queries Prometheus to see if the alerts are currently active.
(Alertmanager exposes a count of successful and failed deliveries for the various delivery methods, such as 'email', but you can't find out when the last successful or failed notification was for one, or whether specific receivers succeeded or failed in some or all of their notifications. There's also no metrics exposed for potential problems like 'template expansion failure', which can happen if you have an error somewhere in one of your templates. If the error is in a rarely used conditional portion of a template, you might not trip over it for a while.)
|
|