Wandering Thoughts archives


How I'm visualizing health check history in Grafana

In our in-progress Prometheus and Grafana setup, we're doing an assortment of black-box health checks on various machines and services. Once you have health checks, one of the obvious things to want is a visualization of their history; when did health checks fail, and how many failed at that time, and so on. Among other things, this is useful if you want to look for flaky health checks that fail sometimes but not for long enough to trigger alerts. Do your pings or DNS lookups or whatever occasionally fail, or even regularly fail, and for how long?

(This is especially useful if you add a new health check and want to know if it's reliable before you turn on alerting based on it.)

I have made a number of attempts to visualize this history information in Grafana, and today I'm going to run down what I've tried, how well I think each has worked, and what I feel about them.

The first and most obvious attempt is with a Grafana graph panel, graphing the result of 'count(probe_success == 0) by (probe)' (and using '{{probe}}' as the label). You probably want to turn on stacking (because otherwise probes with the same number of failures will overlap) and likely points (to make it easier to see what a spot means). This works in that you get a clear display of what failed and how much, but it has the limitation that it's pretty hard to see how long the failure lasted for; you're reduced to trying to carefully eyeball the start and end points, or get your mouse over the lines to see the date stamps.

What should be a natural display for this is a heatmap, with the X axis being time, the Y axis being probes, and the cells coloured by how many of each probe were failing at the time (or over the time period). Unfortunately Grafana's heatmap panel cannot do this. This sort of heatmap is apparently called a utilization heatmap, and Grafana currently only does what I will call value heatmaps, where the Y axis is always numeric.

My next attempt was with the third party Statusmap panel (Github), which is explained in this article by its creators. Statusmap looked like a natural fit for our situation, but it didn't work out so well in practice, probably because I'm using it in a situation it's not really intended for. As far as I can tell from reading the documentation, it is primarily designed for a situation where you have a few discrete statuses and you're willing to pre-process them in a fairly elaborate way (as described in the article and in the Github readme). This isn't our situation, and when I tried a Statusmap I ran into an issue where I would only see failures if the time resolution was small enough. Also, I wasn't getting start and end times.

The current solution I've using is another third party panel, the Discrete panel (Github), which is intended to show discrete values in a horizontal graph. This gives me pretty much what I want; the X axis is time, the Y axis is the health check probe name, and the 'discrete value' is the count of how many failed health checks there were. This seems to reliably notice and show failures even at reasonably large time scales (although if you zoom out far enough, short failures disappear) and hovering over a particular 'discrete value' will give me the duration (necessarily rounded to the query interval that Grafana is using).

The Discrete panel is probably not a good fit for a true utilization heatmap where you have constantly changing values (for instance, if you were doing a utilization heatmap of disk latencies over time), but that's generally not the case here. Most of the time our count of failed health checks is zero, and it's always an integer (hopefully a small one). But perhaps someday someone will create a true utilization heatmap panel for Grafana.

(I'm honestly surprised that no one has yet, even and especially Grafana themselves; it seems such an obvious need and desire. There are a ton of things you would naturally display in a utilization heatmap. But, well, it's open source. I get to keep all of the pieces, and scratch my own itches if I care enough.)

PS: It's entirely possible that I'm missing an obvious good way to do this, too, since my experience with Grafana so far is relatively shallow and limited. I did try some Internet searches but couldn't find very much.

Sidebar: Where a Grafana heatmap works well

Suppose that you have a Prometheus gauge metric that measures how many busy Apache worker processes your web server has (probably gathered through a third party exporter, likely this one although there's also apparently this one), and you want to visualize how busy your Apache server is. You could do this as a straight graph, but it would probably be spiky and jumpy, especially over larger time ranges, and the result could be hard to read.

(You'd probably want to use Prometheus's avg_over_time aggregation, because otherwise you're sampling the instantaneous value at the end of every query step through the time range. You could also use max_over_time if you wanted to know the high water mark of active workers.)

A heatmap is a good fit for this information, because it shows you a two dimensional summary of the worker count over a large time range. Grafana will work out the buckets for 'number of active workers' on its own, and then it will count how many times the number of workers falls into each bucket and show you how that distribution looks. If you zoom right in to a small time range, this decays into an imprecise version of a graph of the same version, but at larger scales it's useful.

(It's also straightforward to configure. Your query is just the raw metric, possibly averaged over time, and the only panel options you'll probably want to play with are the heatmap colours and maybe whether to show the histogram in the tooltip. Grafana does everything else itself.)

sysadmin/GrafanaVisualizeHistory written at 18:54:56; Add Comment

Link: HiDPI on dual 4K monitors with Linux

Vincent Bernat's article HiDPI on dual 4K monitors with Linux (via) is about what you'd expect it to be about and is, as they say, relevant to my interests. Especially relevant to me is the section on HiDPI support on Linux with X11, which runs down a collection of issues and contains a very useful chart about what is supported in what application and toolkit, which added some information that I hadn't known.

Note that Bernat's experience with xterm and rxvt don't match mine, perhaps because we're setting the X-level DPI information in somewhat different ways. My experience, as covered here, is that plain X applications using XFT fonts scale them appropriately once you get the DPI set everywhere (ie, if you tell xterm to use Monospace-12, you will get an actual 12 point size on your HiDPI monitor, not 12 points at 96 DPI and thus tiny fonts). If you use bitmap fonts, though, you're in trouble and unfortunately xterm still uses those by default for some things, like its popup menus.

(It's the nature of these articles to become out of date over time as HiDPI support improves and changes, but it's still a useful snapshot and some of these applications will probably never change.)

links/HiDPIOnDualMonitors written at 16:17:29; Add Comment

The obviousness of inheritance blinded me to the right solution

This is a Python programming war story.

I recently wrote a program to generate things to drive low disk space alerts for our ZFS filesystems in our in-progress Prometheus monitoring system. ZFS filesystems are grouped together into ZFS pools, and in our environment it makes sense to alert on low free space in either or both (ZFS filesystems can run out of space without their pool running out of space). Since we have a lot of filesystems and many fewer pools, it also makes sense to be able to set a default filesystem alert level on a per-pool basis (and then perhaps override it for specific filesystems). The actual data that drives Prometheus must be on a per-object basis, so one thing the program has to do is expand those default alert levels out to be specific alerts for every filesystem in the pool without a specific alert level.

When I began coding the Python to parse the configuration file and turn it into a data representation, I started by thinking about the data representation. It seemed intuitively clear and obvious that a ZFS pool and a ZFS filesystem are almost the same thing, except that a ZFS pool has a bit more information, and therefor they should be in a general inheritance relationship with a fundamental base class (written here using attrs):

class AlertObj:
  name = attr.ib()
  level = attr.ib()
  email = attr.ib()

class FSystem(AlertObj):

class Pool(AlertObj):
  fs_level = attr.ib()

I wrote the code and it worked, but the more code I wrote, the more awkward things felt. As I got further and further in, I wound up adding ispool() methods and calling them here and there, and there was a tangle of things operating on this and that. It all just felt messy. Something was wrong but I couldn't really see what at the time.

For unrelated reasons, we wound up wanting to significantly revise how we drove low disk space alerts and rather than modify my first program, I opted to start over from scratch. One reason for this was because with the benefit of a little bit of distance from my own code, I could see that inheritance was the wrong data model for my situation. The right natural data representation was to have two completely separate sets of objects, one set for directly set alert levels, which lists both pools and filesystems, and one for default alert levels (which only contains pools because they're the only thing that creates default alert levels). The objects all have the same attributes (they only need name, level, and email).

This made the processing logic much simpler. Parsing the configuration file returns both sets of objects, the direct set and the defaultable set. Then we go through the second set and for each pool entry in it, we look up up all of the filesystems in that pool and add them to the first set if they aren't already there. There is no Python inheritance in sight and everything is obviously right and straightforward.

In the new approach, it would also be relatively easy to add default alert levels that are driven by other sorts of things, for instance an idea of who owns a particular entity (pools are often owned collectively by groups, but individual filesystems may be 'owned' and used by specific people, some of whom may not care unless their filesystems are right out of space). The first version's inheritance-based approach would have just fell over in the face of this; a default alert level based on ownership has no 'is-sort-of-a' relationship with ZFS filesystems or pools at all.

I've always known that inheritance wasn't always the right answer, partly because I have the jaundiced C programm's view of object orientation; all of OO's fundamental purpose is to make my code simpler, and if it doesn't do that I don't use it. In theory this should have made me skip inheritance here; in practice, inheritance was such an obvious and shiny hammer that once I saw some of it, I proceeded to hit all of my code with it no matter what.

(If nothing else, the whole experience serves me as a useful learning experience. Maybe the next time around I will more readily listen to the feeling that my code is awkward and maybe something is wrong.)

python/BlindedByInheritance written at 00:49:22; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.