Your options for displaying status over time in Grafana 9
Once upon a time, there was a straightforward good way of displaying things like alerts over time or health check failures over time in Grafana, as I wrote about in How I'm visualizing health check history in Grafana. Unfortunately Grafana broke the (once) very nice Discrete panel starting in 8.4, either through an unfixed bug or through an incompatible API change (in a minor release). As of the current Grafana 9.3.1 (as I write this), I've managed to find only five potential options among first and third party panels, none of them excellent.
To set the stage, we want to visualize a count of each type of (firing) Prometheus alert over time. Our basic range query for all panels is:
sum( max_over_time ( ALERTS[$__interval] ) ) by (alertname)
When one or more instances of an alert is firing, this will give
us a time series point with the
alertname label and a non-zero
value. When the alert isn't firing at all, there will be no time
Here, in order from best to worst, are my view of your options for displaying this in a useful, readable, and accurate way:
- A state timeline
panel, with 'merge equal consecutive values' and (I believe) more or
less default values for everything else. I use the 'green yellow
red' colour scheme and a tooltip mode of 'single'. This comes the
closest to how the old Discrete panel looked, and it's even a more
or less supported usage, as covered in Time series data with
However, don't set thresholds; if you leave things alone, it more
or less works out to different colours for each numeric count.
One unfortunate limitation of state timelines is that they'll
always jam every label you have into the Y axis, even if this
makes the label text overlap (the old Discrete panel was willing
to scroll vertically in this situation).
(State timelines are officially classified as 'beta' in Grafana 9.3.1, but they seem to work for me.)
- A stacked bar graph time series
panel, with the stacking mode set to 'normal' and the tooltip mode
set to 'all' (and you'll want the legend on). This works and correctly
displays everything, but it can be hard to see and track small periods for
one alert if another alert was going on at the same time; the short alert's
bar just isn't very visible. But it will give you a reasonably good idea
of what's going on and how many things are going on, with distinct colors
for every different alert.
- A stacked line graph time series panel with the point size turned
up, a solid line style, and 'Connect null values' set to 'never'.
This can be easier to read than the stacked bar graph, but it has
a long-running issue where Grafana will assign incorrect colours
to various points and lines. If multiple alerts appeared and
disappeared, you'll need to switch them on and off one by one to
be sure of when each specific one actually happened. Bar graphs
don't have this mis-colouring issue but are harder to read.
(This issue of wrong colours has existed for a long time, even back in the old Graph panel type. I filed a bug report about it back in the Grafana 6.2.5 era in 2019, but it was closed.)
- A heatmap
panel. This will look like it's working but if you look closely
it suffers from a number of limitations, including dropping the
labels for some series (ie, some alert names) if there are too
many of them, displaying heatmap boxes that are too large
horizontally, and often not displaying accurate and minimal time
ranges for when things happened. But hey, at least it displays
something and you can get a general idea of what happened (and
all things seem to be present in the heatmap even if there's no
label for it, so you can mouse over something to see it).
(If you use a heatmap for its generally intended purpose, dropping labels is okay because they're numeric labels and you can interpolate the range. I'm abusing the heatmap panel here, and my abuse is catching up to me. The moral lesson is not to do that.)
- A status history
panel, which sounds like it should be exactly what you use here
except for the small issue that it mostly doesn't display anything.
Grafana officially describes the status history panel as 'beta'
in 9.3.1; I would describe it as 'alpha' or even 'pre-alpha'. This
has been filed as Grafana issue #51259, and has been
present since at least Grafana 9.0.1. When the status history
panel displays things, it doesn't seem to be particularly superior
to a state timeline panel for this usage.
(Like the state timeline panel, the current status history panel will jam every label into the Y axis even if this causes them to overlap.)
Only the bar graph and line graph versions allow you to selectively turn on and off displaying the timelines of some alerts, so you can easily see when one specific type was active or drop a couple of especially active alerts to get a better view at other ones. However, I find the status timeline and things like it to be a better visual overview of what's going on and when it was happening.
For our purposes, I would probably use status timeline panels as a replacement for the old Discrete panel, and continue using the (flawed) stacked line graph time series panel along side them, since no one panel is ideal for everything. In practice we are (still) freezing at the last Grafana 8.3 release and probably will be for a long time, partly because of this issue.
(Looking at my screenshot in issue #51259, I see that the Heatmap rendering seems to have gotten worse since 9.0.1, where it would at least render small boxes some of the time and might only have had the issue that labels could disappear if there were too many of them to fit.)
PS: I'd love to find out about better alternatives for this sort of 'status over time' display in Grafana. To me it seems like such an obvious thing to want so I'm always a bit surprised that there seems to be no really good solution yet.