2020-03-30
It's worth documenting the obvious (before it stops being obvious)
I often feel a little bit silly when I write entries about things like making bar graphs in Grafana or tags for Grafana dashboard variables because when I write them up it's all pretty straightforward and even obvious. This is an illusion. It's all straightforward and obvious to me right now because I've been in the middle of doing this with Grafana, and so I have a lot of context and contextual knowledge. Not only do I know how to do things, I also know what they're called and roughly where to find information about them in Grafana's official documentation. All of this is going to fade away over time, as I stop making and updating our Grafana dashboards.
Writing down these obvious things has two uses. First and foremost, I'll have specific documentation for when I want to do this again in six months or a year or whatever (provided that I can remember that I wrote some entries on this and that I haven't left out crucial context, which I've done in the past). Second, actually writing down my own documentation forces me to understand things more thoroughly and hopefully helps fix them more solidly in my mind, so perhaps I won't even need my entries (or at least not need them so soon).
There's a lot of obvious things and obvious context that we don't document explicitly (in our worklog system or otherwise), which I've noticed before. Some of those obvious things don't really need to be documented because we do them all of the time, but I'm sure there's other things I'm dealing with right now that I won't be in six months. And even for the things that we do all the time, maybe it wouldn't hurt to explicitly write them up once (or every so often, or at least re-check the standard 'how we do X' documentation every so often).
(Also, just because we do something all the time right now doesn't mean we always will. What we do routinely can shift over time, and we won't even necessarily directly notice the shift; it may just slowly be more and more of this and less of that. Or perhaps we'll introduce a system that automates a lot of something we used to do by hand.)
The other side of this, and part of why I'm writing this entry, is that I shouldn't feel silly about documenting the obvious, or at least I shouldn't let that feeling stop me from doing it. There's value in doing it even if the obvious remains obvious to me, and I should keep on doing a certain amount of it.
(Telling myself not to feel things is probably mostly futile. Humans are not rational robots, no matter how much we tell ourselves that we are.)
Notes on Grafana 'value groups' for dashboard variables
Suppose, not hypothetically, that you have some sort of Grafana overview dashboard that can show you multiple
hosts at once in some way. In many situations, you're going to want
to use a Grafana dashboard variable to
let you pick some or all of your hosts. If you're getting the data
for what hosts should be in your list from Prometheus, often you'll
want to use label_values()
to extract the data you want. For example, suppose that you have
a label field called 'cshost
' that is your local short host
name for a host. Then a plausible Grafana query for 'all of our hosts'
for a dashboard variable would be:
label_values( node_load1, cshost )
(Pretty much every Unix that the Prometheus host agent runs on will supply a load average, although they may not supply other metrics.)
However, if you have a lot of hosts, this list can be overwhelming and also you may have sub-groupings of hosts, such as all SLURM nodes that you want to make it convenient to narrow down to. To support this, Grafana has a dashboard variable feature called value groups or just 'tags'. Value groups are a bit confusing and aren't as well documented as dashboard variables as a whole.
There are two parts to setting up a value group; you need a query that will give Grafana the names of all of the different groups (aka tags), and then a second query that will tell Grafana which hosts are in a particular group. Suppose that we have a metric to designate which classes a particular host is in:
cslab_class{ cshost="cpunode2", class="comps" } 1 cslab_class{ cshost="cpunode2", class="slurmcpu" } 1 cslab_class{ cshost="cpunode2", class="c6220" } 1
We can use this metric for both value group queries. The first
query is to get all the tags, which are all the values of class
:
label_values( cslab_class, class )
Note that we don't have to de-duplicate the result; Grafana will do that for us (although we could do it ourselves if we wanted to make a slightly more complex query).
The second query is to get all of values for a particular group (or
tag), which is to say the hosts for a specific class. In this query,
we have a special Grafana provided $tag
variable that refers to the
current class, so our query is now for the cshost
label for things
with that class:
label_values( cslab_class{ class="$tag" }, cshost )
It's entirely okay for this query to return some additional hosts (values) that aren't in our actual dashboard variable; Grafana will quietly ignore them for the most part.
Although you'll often want to use the same metric in both queries,
it's not required. Both queries can be arbitrary and don't have to be
particularly related to each other. Obviously, the results from the
second query do have to exactly match the values you have in the
dashboard variable itself. Unfortunately you don't have regexp
rewriting for your results the way you do for the main dashboard
variable query, so with Prometheus you may need to do some rewriting
in the query itself using label_replace()
.
Also, there's no preview of what value groups (tags) your query
generates, or what values are in what groups; you have to go play
around with the dashboard to see what you get.