Your options for displaying status over time in Grafana 9

December 14, 2022

Once upon a time, there was a straightforward good way of displaying things like alerts over time or health check failures over time in Grafana, as I wrote about in How I'm visualizing health check history in Grafana. Unfortunately Grafana broke the (once) very nice Discrete panel starting in 8.4, either through an unfixed bug or through an incompatible API change (in a minor release). As of the current Grafana 9.3.1 (as I write this), I've managed to find only five potential options among first and third party panels, none of them excellent.

To set the stage, we want to visualize a count of each type of (firing) Prometheus alert over time. Our basic range query for all panels is:

sum( max_over_time ( ALERTS[$__interval] ) ) by (alertname)

When one or more instances of an alert is firing, this will give us a time series point with the alertname label and a non-zero value. When the alert isn't firing at all, there will be no time series point.

Here, in order from best to worst, are my view of your options for displaying this in a useful, readable, and accurate way:

  1. A state timeline panel, with 'merge equal consecutive values' and (I believe) more or less default values for everything else. I use the 'green yellow red' colour scheme and a tooltip mode of 'single'. This comes the closest to how the old Discrete panel looked, and it's even a more or less supported usage, as covered in Time series data with thresholds. However, don't set thresholds; if you leave things alone, it more or less works out to different colours for each numeric count. One unfortunate limitation of state timelines is that they'll always jam every label you have into the Y axis, even if this makes the label text overlap (the old Discrete panel was willing to scroll vertically in this situation).

    (State timelines are officially classified as 'beta' in Grafana 9.3.1, but they seem to work for me.)

  2. A stacked bar graph time series panel, with the stacking mode set to 'normal' and the tooltip mode set to 'all' (and you'll want the legend on). This works and correctly displays everything, but it can be hard to see and track small periods for one alert if another alert was going on at the same time; the short alert's bar just isn't very visible. But it will give you a reasonably good idea of what's going on and how many things are going on, with distinct colors for every different alert.

  3. A stacked line graph time series panel with the point size turned up, a solid line style, and 'Connect null values' set to 'never'. This can be easier to read than the stacked bar graph, but it has a long-running issue where Grafana will assign incorrect colours to various points and lines. If multiple alerts appeared and disappeared, you'll need to switch them on and off one by one to be sure of when each specific one actually happened. Bar graphs don't have this mis-colouring issue but are harder to read.

    (This issue of wrong colours has existed for a long time, even back in the old Graph panel type. I filed a bug report about it back in the Grafana 6.2.5 era in 2019, but it was closed.)

  4. A heatmap panel. This will look like it's working but if you look closely it suffers from a number of limitations, including dropping the labels for some series (ie, some alert names) if there are too many of them, displaying heatmap boxes that are too large horizontally, and often not displaying accurate and minimal time ranges for when things happened. But hey, at least it displays something and you can get a general idea of what happened (and all things seem to be present in the heatmap even if there's no label for it, so you can mouse over something to see it).

    (If you use a heatmap for its generally intended purpose, dropping labels is okay because they're numeric labels and you can interpolate the range. I'm abusing the heatmap panel here, and my abuse is catching up to me. The moral lesson is not to do that.)

  5. A status history panel, which sounds like it should be exactly what you use here except for the small issue that it mostly doesn't display anything. Grafana officially describes the status history panel as 'beta' in 9.3.1; I would describe it as 'alpha' or even 'pre-alpha'. This has been filed as Grafana issue #51259, and has been present since at least Grafana 9.0.1. When the status history panel displays things, it doesn't seem to be particularly superior to a state timeline panel for this usage.

    (Like the state timeline panel, the current status history panel will jam every label into the Y axis even if this causes them to overlap.)

Only the bar graph and line graph versions allow you to selectively turn on and off displaying the timelines of some alerts, so you can easily see when one specific type was active or drop a couple of especially active alerts to get a better view at other ones. However, I find the status timeline and things like it to be a better visual overview of what's going on and when it was happening.

For our purposes, I would probably use status timeline panels as a replacement for the old Discrete panel, and continue using the (flawed) stacked line graph time series panel along side them, since no one panel is ideal for everything. In practice we are (still) freezing at the last Grafana 8.3 release and probably will be for a long time, partly because of this issue.

(Looking at my screenshot in issue #51259, I see that the Heatmap rendering seems to have gotten worse since 9.0.1, where it would at least render small boxes some of the time and might only have had the issue that labels could disappear if there were too many of them to fit.)

PS: I'd love to find out about better alternatives for this sort of 'status over time' display in Grafana. To me it seems like such an obvious thing to want so I'm always a bit surprised that there seems to be no really good solution yet.


Comments on this page:

For the status history panel:

Try using this as your query: present_over_time((count(ALERTS{severity=~"[1234]"}) by (alertname))[$__rate_interval:$__rate_interval])

Set the minimum time interval to something sane (not 30s). Set the "No Value" value to 0, delete thresholds. Tell it to "merge equal consecutive values".

You can then set the value mappings so: 0 -> OK -> Green 1 -> Alert -> Red

Works pretty well. Still gets pretty bunched up if the panel is short and you have a bunch of alerts. We attach a severity label to all our alerts in alertmanager config, so you can filter by using a regex against the ALERTS metric.

By cks at 2022-12-19 22:43:37:

This approach unfortunately isn't a good solution for us. We want to see a count of how many instances of an alert have been firing, not just a zero or active indicator, and we don't want the no-alert case to show anything so that the points with alerts are easier to pick out. In addition, setting 'No value' to 0 instead of the default blank and fiddling with either value mappings or thresholds doesn't seem to make the Status History panel render any more often than before; it's still mostly blank (in the query I gave), even with a large minimum step interval.

(It is obtaining query results, because it shows the labels you'd expect.)

Written on 14 December 2022.
« Go and the case of the half-missing import
How I do static IPs and names for my NAT'd libvirt-based VMs »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Dec 14 22:41:40 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.