Wandering Thoughts archives

2022-03-20

Prometheus: using gauge-like things as if they were counters

I recently wrote about how I wish ZFS pools kept a persistent count of errors, instead of more or less a current count that you're expected to clear to verify that you've dealt with your problems. I a comment, Daniel Ebdrup Jensen said that this (and other things) was better collected and tracked in an event monitoring system. My first reaction was that you couldn't really do this because a persistent, always accumulating count of errors (or whatever) is what Prometheus really wants to see; in other words, a Prometheus counter. The 'current count of uncleared errors' is what I think of as a Prometheus gauge.

Then I realized that under some circumstances, you can treat gauge-like things as if they were counters, although weird counters. The important thing that makes this work is that whenever it manipulates a counter (well, a range vector), Prometheus looks for counter resets. Normally, counter resets are infrequent; they happen when you reboot systems or restart daemons or whatever. But a ZFS count of current errors that you always deal with by clearing completely (so that it drops to zero) can be properly handled by this process of checking for counter resets. When your current error count of six checksum errors goes to zero checksum errors, Prometheus will conclude that the counter has reset, and then when it goes up again, Prometheus functions such as increase() will properly add the two values together to work out your total accumulated checksum errors over some time range.

Of course this comes with a bunch of qualifications and limitations. The most obvious qualification is that you always have to use Prometheus functions in order to work out the current cumulative value even over short time ranges; you can never just look at the current value for a 'cumulative recent errors' value. In turn this means that if you want the cumulative value over a long time range, you're going to be loading a lot of samples and the query will likely be slow.

The second issue is that this only works if you always fully clear ZFS errors (or whatever sort of pseudo-counter you have). If the value drops but doesn't go to zero, Prometheus's counter reset code will assume that it was reset to zero and then grew from there. Only Prometheus gauges are allowed to drop; things you want to treat as counters can only be completely reset to zero. The third issue is that this can't handle situations where the accumulated count of whatever should be reset to zero, for example because you replaced a disk that had checksum errors with a new one (that got the same name), and so the new disk's count of total checksum errors should start from zero, not have the previous disk's count added to it.

All in all, using a 'current count of uncleared errors' as a Prometheus counter is a hack. It can work and it's better than nothing, but you're better off having a true counter, one that behaves more like Prometheus expects.

(This is also giving me thoughts about using increase() on selected gauge metrics, much as you can use resets() or also changes(). For that matter, perhaps I should be using changes() on some counters to provide interesting information, like how many times particular SMART disk attributes went up.)

sysadmin/PrometheusGaugesAsCounters written at 23:19:29; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.