Some notes on heatmaps and histograms in Prometheus and Grafana
On Mastodon (or if you prefer, the Fediverse), I mentioned:
I have now reached the logical end point of running Prometheus on my desktop, which is that I have installed Grafana so I can see heatmap graphs of my disk IO latency distributions generated from the Cloudflare eBPF exporter.
It's kind of neat once I got all the bits going.
This isn't my first go-around on heatmaps and histograms, but this time around I found new clever mistakes to make on top of my existing confusions. So it's time for some notes, in the hopes that they will make next time easier.
Grafana can make heatmaps out of at least
two different sorts of Prometheus metrics, showing the distribution
of numeric values over time (a value heatmap). The first sort, which is simpler and
the default if you set up a heatmap panel, is gauges or gauge-like
things, such as the number of currently active Apache processes or
the amount of CPU usage over the past minute (which you would
generate with rate()
from the underlying counters). You could
visualize these metrics in a conventional graph, but in many cases
the graph would wiggle around madly and it would be hard to see
much in it. Showing the same data in a heatmap may provide more
useful and readable information.
When used this way, Grafana automatically works out the heatmap
buckets to use from the data values and groups everything together
and it is all very magical. Grafana takes multiple samples for every
bucket's time range, but not all that many samples, and there is
no real way to control this. In particular, as the time range goes
up Grafana will sample your metric at steadily courser resolution,
even though it could use a finer resolution to get more detailed
information for buckets. As a consequence, for gauges you almost
certainly want to use avg_over_time
or max_over_time
instead
of the raw metric.
(Using rate()
on a counter already gives you this implicitly.)
The other sort of Grafana heatmaps are made from Prometheus histogram
metrics, which the Grafana documentation calls 'pre-bucketed data'.
With these, you have to go to the Axes tab of the panel and set the
"Data format" to "Time series buckets", and you also normally set
the "Legend format" to '{{le}}
' in the Metrics tab so that the Y
axis can come out right. Failing to change the data format will
give you very puzzling heatmaps and it is not at all obvious what's
wrong and how you fix it.
(It's a real pity that Grafana doesn't auto-detect that this is a Prometheus histogram metric and automatically switch the data format and so on for you. It would make things much more usable and friendly.)
Prometheus histogram metrics can be either counters or gauges. A
histogram of the number of IMAP connections per user would be a
gauge histogram, because it changes up and down as people log off.
A histogram of disk IO latency is a counter histogram; it will
normally only count up. You need to rate()
or increase()
counter
histograms in order to get useful heatmap displays; gauge histograms
can be used as-is, although you probably want to consider running
them through avg_over_time
or max_over_time
.
(Prometheus's metric type information doesn't distinguish between these two sorts of histograms. If you're lucky, the HELP text for a particular histogram tells you which it is; if you're not, you get to deduce it from what the histogram is measuring and how it behaves over time.)
One easy to make mistake is to have your heatmap metric query in
Grafana actually return more than one metric sequence. For instance,
when I first set up a heatmap for my disk latency metrics, I didn't
realized that they came in a 'read' and a 'write' version for each
disk. The resulting combined heatmap was rather confusing, with
all sorts of nonsensical bucket counts. In theory you can put such
multiple metrics in the same heatmap by creating separate names in
the legend format, for example '{{le}} {{operation}}
', but in
practice this gives you two (or more) heatmaps stacked on top of
each other and is not necessarily what you want. As far as I know,
there's no way to combine two metrics or superimpose two metrics
in the same heatmap. Sadly, this does result in an explosion of
heatmaps for things like disk latency, so you probably want to use
some Grafana dashboard variables to select what disk
(or perhaps disks) you want heatmaps from.
It seems surprisingly hard to find a colour scheme for Grafana heatmaps that both has a pleasant variation from common to uncommon values while still clearly showing that uncommon values are present. By default, Grafana seems to want to fade uncommon values out almost to invisibility, which is not what I want; I want uncommon values to stand out, because they are one of the important things I'm looking for with heatmaps and histograms in general. Perhaps this is a sign that Grafana heatmaps are not actually the best way of looking for unusual values in Prometheus histograms, although they are probably a good way of looking at details once I know that some are present.
(I've also learned some hard lessons about hand-building histogram metrics for Prometheus. My overall advice there is to delegate the job to someone else's code if you have the choice, because it's really hard to get right if you're doing it all yourself.)
PS: for things like disk IO latency distributions, where the tail end is multiple seconds but involves fractions like '2.097152', it helps to explicitly set the Y axis decimals to '1' instead of leaving it on auto. This helps the Y axis label take up less space so the buckets get more of it. For disk IO sizes, I even set the decimals to '0'. Grafana's obsession with extreme precision in this sort of thing is both impressive and irksome.
|
|