Some notes on heatmaps and histograms in Prometheus and Grafana

February 17, 2019

On Mastodon (or if you prefer, the Fediverse), I mentioned:

I have now reached the logical end point of running Prometheus on my desktop, which is that I have installed Grafana so I can see heatmap graphs of my disk IO latency distributions generated from the Cloudflare eBPF exporter.

It's kind of neat once I got all the bits going.

This isn't my first go-around on heatmaps and histograms, but this time around I found new clever mistakes to make on top of my existing confusions. So it's time for some notes, in the hopes that they will make next time easier.

Grafana can make heatmaps out of at least two different sorts of Prometheus metrics, showing the distribution of numeric values over time (a value heatmap). The first sort, which is simpler and the default if you set up a heatmap panel, is gauges or gauge-like things, such as the number of currently active Apache processes or the amount of CPU usage over the past minute (which you would generate with rate() from the underlying counters). You could visualize these metrics in a conventional graph, but in many cases the graph would wiggle around madly and it would be hard to see much in it. Showing the same data in a heatmap may provide more useful and readable information.

When used this way, Grafana automatically works out the heatmap buckets to use from the data values and groups everything together and it is all very magical. Grafana takes multiple samples for every bucket's time range, but not all that many samples, and there is no real way to control this. In particular, as the time range goes up Grafana will sample your metric at steadily courser resolution, even though it could use a finer resolution to get more detailed information for buckets. As a consequence, for gauges you almost certainly want to use avg_over_time or max_over_time instead of the raw metric.

(Using rate() on a counter already gives you this implicitly.)

The other sort of Grafana heatmaps are made from Prometheus histogram metrics, which the Grafana documentation calls 'pre-bucketed data'. With these, you have to go to the Axes tab of the panel and set the "Data format" to "Time series buckets", and you also normally set the "Legend format" to '{{le}}' in the Metrics tab so that the Y axis can come out right. Failing to change the data format will give you very puzzling heatmaps and it is not at all obvious what's wrong and how you fix it.

(It's a real pity that Grafana doesn't auto-detect that this is a Prometheus histogram metric and automatically switch the data format and so on for you. It would make things much more usable and friendly.)

Prometheus histogram metrics can be either counters or gauges. A histogram of the number of IMAP connections per user would be a gauge histogram, because it changes up and down as people log off. A histogram of disk IO latency is a counter histogram; it will normally only count up. You need to rate() or increase() counter histograms in order to get useful heatmap displays; gauge histograms can be used as-is, although you probably want to consider running them through avg_over_time or max_over_time.

(Prometheus's metric type information doesn't distinguish between these two sorts of histograms. If you're lucky, the HELP text for a particular histogram tells you which it is; if you're not, you get to deduce it from what the histogram is measuring and how it behaves over time.)

One easy to make mistake is to have your heatmap metric query in Grafana actually return more than one metric sequence. For instance, when I first set up a heatmap for my disk latency metrics, I didn't realized that they came in a 'read' and a 'write' version for each disk. The resulting combined heatmap was rather confusing, with all sorts of nonsensical bucket counts. In theory you can put such multiple metrics in the same heatmap by creating separate names in the legend format, for example '{{le}} {{operation}}', but in practice this gives you two (or more) heatmaps stacked on top of each other and is not necessarily what you want. As far as I know, there's no way to combine two metrics or superimpose two metrics in the same heatmap. Sadly, this does result in an explosion of heatmaps for things like disk latency, so you probably want to use some Grafana dashboard variables to select what disk (or perhaps disks) you want heatmaps from.

It seems surprisingly hard to find a colour scheme for Grafana heatmaps that both has a pleasant variation from common to uncommon values while still clearly showing that uncommon values are present. By default, Grafana seems to want to fade uncommon values out almost to invisibility, which is not what I want; I want uncommon values to stand out, because they are one of the important things I'm looking for with heatmaps and histograms in general. Perhaps this is a sign that Grafana heatmaps are not actually the best way of looking for unusual values in Prometheus histograms, although they are probably a good way of looking at details once I know that some are present.

(I've also learned some hard lessons about hand-building histogram metrics for Prometheus. My overall advice there is to delegate the job to someone else's code if you have the choice, because it's really hard to get right if you're doing it all yourself.)

PS: for things like disk IO latency distributions, where the tail end is multiple seconds but involves fractions like '2.097152', it helps to explicitly set the Y axis decimals to '1' instead of leaving it on auto. This helps the Y axis label take up less space so the buckets get more of it. For disk IO sizes, I even set the decimals to '0'. Grafana's obsession with extreme precision in this sort of thing is both impressive and irksome.

Written on 17 February 2019.
« Accumulating a separated list in the Bourne shell
Why I like middle mouse button paste in xterm so much »

Page tools: View Source.
Search:
Login: Password:

Last modified: Sun Feb 17 00:47:08 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.