2023-10-27
Alerting on sticky configuration reload failures for Prometheus
Recently, I discovered that 'promtool check config' doesn't always fail if you have Prometheus configuration errors (which may be a bug). Fortunately this was only a mild issue because quite some time ago I added an alert for a sticky configuration reload failure. When this alert fired soon after I thought my configuration was good and my reload had succeeded, it was pretty straightforward to work out that 'promtool check config' was more or less lying to me (especially since the Prometheus server did log information about the problem).
How the alert works is straightforward. Prometheus, along with Alertmanager and the Blackbox exporter, exposes a metric to report if the last configuration file reload was successful or not. In Prometheus's case this is prometheus_config_last_reload_successful (the other two have a similar naming pattern). So I wrote the obvious alert rule:
- alert: PrometheusReloadFailed expr: prometheus_config_last_reload_successful != 1 for: 10m [...]
You can adjust the ten minutes based on how fast it takes you to retry a configuration update when you reload and then discover that it's bad. If this is a multi-minute affair, you may want to only raise the alert after well more than ten minutes, so that it won't trigger while you're working to fix an issue you already know about. In our case, updating the configuration is quite quick so if it's been ten minutes without a successful reload either I don't know about it or something has gone quite badly wrong.
Having now gone through this experience, I'd like all Prometheus exporters to have some metric like this if they have configuration files and can (try to) reload them. However, I think that hot-reloading configuration files is relatively uncommon in third party exporters, and most of them take the approach of having you restart them instead. Generally this is okay.
(The obvious advantage of supporting reloading is that if you do it right, you keep going with the old configuration if the new one is bad. For the Prometheus daemon specifically, restarting can take some time and you don't collect metrics while it's starting up, so you really don't want to restart it unless you have to.)
PS: All of this is part of our Prometheus self monitoring, and also Alertmanager monitoring.
PPS: I don't have strong opinions either way on whether a failed configuration reload should make a general health metric go to 'unhealthy'. On the one hand, there is definitely something wrong right now; on ther other hand, presumably the service is otherwise working normally and maybe you don't want to raise that big a red flag just yet.