How I'm visualizing health check history in Grafana

October 28, 2018

In our in-progress Prometheus and Grafana setup, we're doing an assortment of black-box health checks on various machines and services. Once you have health checks, one of the obvious things to want is a visualization of their history; when did health checks fail, and how many failed at that time, and so on. Among other things, this is useful if you want to look for flaky health checks that fail sometimes but not for long enough to trigger alerts. Do your pings or DNS lookups or whatever occasionally fail, or even regularly fail, and for how long?

(This is especially useful if you add a new health check and want to know if it's reliable before you turn on alerting based on it.)

I have made a number of attempts to visualize this history information in Grafana, and today I'm going to run down what I've tried, how well I think each has worked, and what I feel about them.

The first and most obvious attempt is with a Grafana graph panel, graphing the result of 'count(probe_success == 0) by (probe)' (and using '{{probe}}' as the label). You probably want to turn on stacking (because otherwise probes with the same number of failures will overlap) and likely points (to make it easier to see what a spot means). This works in that you get a clear display of what failed and how much, but it has the limitation that it's pretty hard to see how long the failure lasted for; you're reduced to trying to carefully eyeball the start and end points, or get your mouse over the lines to see the date stamps.

What should be a natural display for this is a heatmap, with the X axis being time, the Y axis being probes, and the cells coloured by how many of each probe were failing at the time (or over the time period). Unfortunately Grafana's heatmap panel cannot do this. This sort of heatmap is apparently called a utilization heatmap, and Grafana currently only does what I will call value heatmaps, where the Y axis is always numeric.

My next attempt was with the third party Statusmap panel (Github), which is explained in this article by its creators. Statusmap looked like a natural fit for our situation, but it didn't work out so well in practice, probably because I'm using it in a situation it's not really intended for. As far as I can tell from reading the documentation, it is primarily designed for a situation where you have a few discrete statuses and you're willing to pre-process them in a fairly elaborate way (as described in the article and in the Github readme). This isn't our situation, and when I tried a Statusmap I ran into an issue where I would only see failures if the time resolution was small enough. Also, I wasn't getting start and end times.

The current solution I've using is another third party panel, the Discrete panel (Github), which is intended to show discrete values in a horizontal graph. This gives me pretty much what I want; the X axis is time, the Y axis is the health check probe name, and the 'discrete value' is the count of how many failed health checks there were. This seems to reliably notice and show failures even at reasonably large time scales (although if you zoom out far enough, short failures disappear) and hovering over a particular 'discrete value' will give me the duration (necessarily rounded to the query interval that Grafana is using).

The Discrete panel is probably not a good fit for a true utilization heatmap where you have constantly changing values (for instance, if you were doing a utilization heatmap of disk latencies over time), but that's generally not the case here. Most of the time our count of failed health checks is zero, and it's always an integer (hopefully a small one). But perhaps someday someone will create a true utilization heatmap panel for Grafana.

(I'm honestly surprised that no one has yet, even and especially Grafana themselves; it seems such an obvious need and desire. There are a ton of things you would naturally display in a utilization heatmap. But, well, it's open source. I get to keep all of the pieces, and scratch my own itches if I care enough.)

PS: It's entirely possible that I'm missing an obvious good way to do this, too, since my experience with Grafana so far is relatively shallow and limited. I did try some Internet searches but couldn't find very much.

Sidebar: Where a Grafana heatmap works well

Suppose that you have a Prometheus gauge metric that measures how many busy Apache worker processes your web server has (probably gathered through a third party exporter, likely this one although there's also apparently this one), and you want to visualize how busy your Apache server is. You could do this as a straight graph, but it would probably be spiky and jumpy, especially over larger time ranges, and the result could be hard to read.

(You'd probably want to use Prometheus's avg_over_time aggregation, because otherwise you're sampling the instantaneous value at the end of every query step through the time range. You could also use max_over_time if you wanted to know the high water mark of active workers.)

A heatmap is a good fit for this information, because it shows you a two dimensional summary of the worker count over a large time range. Grafana will work out the buckets for 'number of active workers' on its own, and then it will count how many times the number of workers falls into each bucket and show you how that distribution looks. If you zoom right in to a small time range, this decays into an imprecise version of a graph of the same version, but at larger scales it's useful.

(It's also straightforward to configure. Your query is just the raw metric, possibly averaged over time, and the only panel options you'll probably want to play with are the heatmap colours and maybe whether to show the histogram in the tooltip. Grafana does everything else itself.)

Written on 28 October 2018.
« Link: HiDPI on dual 4K monitors with Linux
Shooting myself in the foot by cargo-culting Apache configuration bits »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Oct 28 18:54:56 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.