== How we monitor our Prometheus setup itself On Mastodon, [[I said https://mastodon.social/@cks/101123682637178682]]: > When you have a new alerting and monitoring system, 'who watches the > watchmen' becomes an interesting and relevant question. Especially > when the watchmen have a lot of separate components and moving parts. If we had a lot of experience with Prometheus, we probably wouldn't worry about this; we'd be able to assume that everything was just going to work reliably. But we're very new with Prometheus, and so we get to worry about its reliability in general and also the possibility that we'll quietly break something in our configuration or how we're operating things (and we have, actually). So we need to monitor Prometheus itself. If Prometheus was a monolithic system, this would probably be relatively easy, but instead instead our overall Prometheus environment has a bunch of separate pieces, all of which can have things go wrong. A lot of how we're monitoring for problems is probably basically standard in Prometheus deployments (or at least standard in simple ones, like ours). The first level of monitoring and alerts is things inside Prometheus: * We alert on unresponsive host agents (ie, [[Prometheus node_exporter https://github.com/prometheus/node_exporter]]) as part of our general checking for and alerting on down hosts; this will catch when a configured machine doesn't have the agent installed or it hasn't been started. The one thing it won't catch is a production machine that hasn't been added to our Prometheus configuration. Unfortunately there's no good automated way in our environment to tell what is and isn't a production machine, so we're just going to have to rely on remembering to add machines to Prometheus when we put them into production. (This alert uses the Prometheus '_up_' metric for our specific host agent job setting.) * We also alert if Prometheus can't talk to a number of other metrics sources it's specifically configured to pull from, such as Grafana, Pushgateway, the Blackbox agent itself, Alertmanager, and a couple of instances of [[an Apache metrics exporter https://github.com/Lusitaniae/apache_exporter]]. This is also based on the _up_ metric, excluding the ones for host agents and for all of our Blackbox checks (which generate _up_ metrics themselves, which can be distinguished from regular _up_ metrics because the Blackbox check ones have a non-empty _probe_ label). * We publish some system-wide information for temperature sensor readings and global disk space usage for [[our NFS fileservers ../solaris/ZFSFileserverSetupII]], so we have checks to make sure that this information is both present at all and not too old. The temperature sensor information is published through Pushgateway, so we leverage its ((push_time_seconds)) metric for the check. The disk space usage information is published in a different way, so we rely on its own 'I was created at' metric. * We publish various per-host information through the host agent's _textfile_ collector, where you put files of metrics you want to publish in a specific directory, so we check to make sure that these files aren't too stale through the ((node_textfile_mtime_seconds)) metric. Because we update these files at varying intervals but don't want to have complex alerts here, we use a single measure for 'too old' and it's a quite conservative number. (This won't detect hosts that have never successfully published some particular piece of information at all, but I'm currently assuming this is not going to happen. Checking for it would probably be complicated, partly because we'd have to bake in knowledge about what things hosts should be publishing.) All of these alerts require their own custom and somewhat ad-hoc rules. In general writing all of these checks feels like a bit of a slog; you have to think about what could go wrong, and then how you could check for it, and then write out the actual alert rule necessary. I was sort of tempted to skip writing the last two sets of alerts, but we've actually quietly broken both the global disk space usage and the per-host information publication at various times. (In fact I found out that some hosts weren't updating some information by testing my alert rule expression in Prometheus. I did a _topk()_ query on it and then went 'hey, some of these numbers are really much larger than they should be'.) This leaves checking Prometheus itself, and also a useful check on [[Alertmanager https://prometheus.io/docs/alerting/alertmanager/]] (because if Alertmanager is down, Prometheus can't send out the alert it detects). In some places the answer to this would be a second Prometheus instance that cross-checks the first and a pair of Alertmanagers that both of them talk to and that coordinate with each other through their gossip protocol. However, this is a bit complicated for us, so my current answer is to have a cron job that tries to ask Prometheus for the status of Alertmanager. If Prometheus answers and says Alertmanager is up, we conclude that we're fine; otherwise, we have a problem somewhere. The cron job currently runs on our central mail server so that it depends on the fewest other parts of our infrastructure still working. (Mechanically this uses _curl_ to make the query through Prometheus's HTTP API and then [[_jq_ https://stedolan.github.io/jq/]] to extract things from the answer.) We don't currently have any checks to make sure that Alertmanager can actually send alerts successfully. I'm not sure how we'd craft those, because I'm not sure Alertmanager exposes the necessary metrics. Probably we should try to write some alerts in Prometheus and then have a cron job that queries Prometheus to see if the alerts are currently active. (Alertmanager exposes a count of successful and failed deliveries for the various delivery methods, such as 'email', but you can't find out when the last successful or failed notification was for one, or whether specific receivers succeeded or failed in some or all of their notifications. There's also no metrics exposed for potential problems like 'template expansion failure', which can happen if you have an error somewhere in one of your templates. If the error is in a rarely used conditional portion of a template, you might not trip over it for a while.)