You should do lint checks on your Prometheus alert (and recording) rules
I turned Cloudflare's Pint Prometheus linter loose on our alert rules with our Prometheus server configured so it could check for metrics existence, and wow it found a bunch of problems (once again, on top of basic label checks I did before).
Today's learning experience: If you leave off the 0x on a hex number that starts with a letter in a Prometheus alert rule, like 'c0a8fdff', Prometheus interprets it as a metric name (and finds nothing).
As you might guess from the threading, Pint is what found my 'c0a8fdff' mistake.
If you're extremely meticulous about writing, reviewing, testing, and double checking your Prometheus rules, you might not have Pint find anything. I thought I was pretty good about writing alert rules and testing them, but I was very clearly wrong. Configuring and using Pint has been quite valuable, as has been teaching it how to connect to our Prometheus server so that it could check for things like the existence of metrics.
No linter is perfect, Pint included; I had to turn off a number of warnings for various reasons. But they're a lot better than nothing and they're usually fairly easy to set up, giving them a good return for your time. If you're really energetic you can write unit tests for rules, but this won't catch everything (it doesn't necessarily check that you're putting in the labels that you should be, for example) and it's a lot more work. The odds that I'll ever write any significant number of alert rule unit tests are very low; the odds that I will run Pint over our alert rules after I make changes are now very high.
PS: As with all linters, I strongly suggest getting your rules to the state where they don't generate any lint check warnings as they are. It's much easier to tell the difference between zero warnings and some (new) warnings than it is to spot some new warnings in a sea of existing ones. Our alert rules have some silenced Pint warnings that I'm not entirely happy about for this reason; it's better to make it very obvious if there's a new warning than to keep nagging me about an existing issue I'm not going to fix right now.
|
|