The problem of paying too much attention to our dashboards
On Mastodon, I said:
Our Grafana dashboards are quite shiny, at least to me (since I built them), but I really should start resisting the compulsive urge to take a look at them all the time just to see what's going on and look at the pretty zigzagging lines.
I have a bad habit of looking at shiny things that I've put together, and dashboards are extremely shiny (even if some of them are almost all text). There are two problems with this, the obvious and the somewhat subtle.
The obvious problem is that, well, I'm spending my time staring somewhat mindlessly at pretty pictures. It's interesting to look at lines wiggle around or collections of numbers, but it's generally not informative. It's especially not informative for our systems because our systems spend almost all of their time working fine, which means that there is no actual relevant information to be had from all of these dashboards. In terms of what I spend (some) time on, I would be better off if we had one dashboard with one box that said 'all is fine'.
This is a general issue with dashboards for healthy environments; if things are fine, your dashboards are probably telling you nothing or at least nothing that is of general interest and importance.
(Your dashboards may be telling you details and at some point you may want access to those details, like how many email messages you typically have in your mailer queues, but they are not generally important.)
The more subtle problem is the general problem of metrics, which is a variant of Goodhart's law. Once you have a metric and you pay attention to the metric, you start to focus on the metric. If you have a dashboard of metrics, it's natural to pay attention to the metrics and to exceptions in the metrics, whether or they actually matter. It may or may not matter that a machine has an unusually high load average, but if it's visible, you're probably going to focus on it and maybe dig into it. Perhaps there is a problem, but often there isn't, especially if you're surfacing a lot of things on your dashboards because they could be useful.
(One of the things behind this is that all measures have some amount of noise and natural variation, but as human beings we have a very strong drive to uncover patterns and meaning in what we see. If you think you see some exceptional pattern, it may or may not be real but you can easily spend a bunch of time trying to find out and testing theories.)
My overall conclusion from my own experiences with our new dashboards and metrics system is that if you have good alerts, you (or at least I) would be better off only looking at dashboards if there is some indications that there are actual problems, or if you have specific questions you'd like to answer. In practice, trawling for 'is there anything interesting' in our dashboards is a great way to spend some time and divert myself down any number of alleyways, most of them not useful ones.
(In a way the worst times are the times when looking at our dashboards actually is useful, because that just encourages me to do it more.)
PS: This is not the first time I've seen the effects of something like this; I wrote about an earlier occasion way back in Metrics considered dangerous.
Understanding how to pull in labels from other metrics in Prometheus
Brian Brazil recently wrote Analyse a metric by kernel version, where he shows how to analyze a particular metric in a new way by, essentially, adding a label from another metric to the metric, in this case the kernel version. His example is a neat trick, but it's also reasonably tricky to understand how it works, so today I'm going to break it down (partly so that I can remember this in six months or a year from now, when my PromQL knowledge has inevitably rusted).
The query example is:
avg without (instance)( node_sockstat_TCP_tw * on(instance) group_left(release) node_uname_info )
The simple version of what's happening here is that because
node_uname_info's value is always 1, we're using '
*' as a
do-nothing arithmetic operator so we can essentially do a join
node_uname_info to grab a
label from the latter. We have to go to these lengths because
PromQL does not have an explicit 'just do a join' operator
that can be used with
There are several things in here. Let's start with the basic one,
which is the '
* on(instance)' portion. This is one to one vector
with a restriction on what label is being used to match up pairs
of entries; we're implicitly restricting the multiplication to pairs
of entries with matching '
instance' labels. Normally '
will be the same for all metrics scraped from a single host's
it makes a good label for finding the
that corresponds to a particular host's
(We have to use '
on (...)' because not all labels match. After
all, we're pulling in the '
release' label from the
metric; if it was already available as a label on
we wouldn't need to do this work at all.)
Next is the
group_left, which is being used here for its side
effect of incorporating the '
release' label from
in the label set of the results. I wrote about the basics of
group_left's operation in Using group_* vector matching for
database lookups, where I used
basically as a database join between a disk space usage metric and
an alert level metric that also carried an additional label we
wanted to include for who should get alerted. Brian Brazil's overall
query here is similar to my case, except that here we don't care
about the value that the
node_uname_info metric has; we are
only interested in its '
In an ideal world, we could express this directly in PromQL to say
'match between these two metrics based on
instance and then copy
release label from the secondary one'. In this world,
group_right have the limitation
that they can only be used with arithmetic and comparison operators.
In my earlier entry this wasn't a problem
because we already wanted to compare the values of the two metrics,
Here, we don't care about the value of
node_uname_info at all.
Since we need an arithmetic or comparison operator in order to use
group_left and we want to ignore the value of
we need an operator that will leave
unchanged. Because the value of
node_uname_info is always 1,
we can simply use '
*', as multiplying by one will do nothing here.
(In theory we could instead use a comparison operator, which would
node_sockstat_TCP_tw's value unchanged (more
or less cf). However, in practice
it's often tricky to find a comparison operator that will always
be true. You might not have any sockets in TIME_WAIT so a '
could be false here, for example. Using an arithmetic operator that
will have no effect is simpler.)
The case of a secondary metric that's always 1 is the easy case, as we've seen. What about a secondary metric with a label you want that isn't necessarily always 1, and in fact may have an arbitrary value? Fortunately, Brian Brazil has provided the answer to that too. The simple but clever trick is to multiply the metric by zero and then add it:
node_sockstat_TCP_tw + on(instance) group_left(release) (node_uname_info * 0)
This works with arbitrary values; multiplying by zero turns the
value for the right side to 0, and then adding 0 has no effect on
As a side note, this illustrates a good reason to have '1' be the
value of any metric that exists to publish its labels, as is the
node_uname_info or metrics that publish, say, the
version of your program. The value these metrics have is arbitrary
in one sense, but '1' is both conventional and convenient.
My new favorite tool for looking at TLS things is
For a long time I've used the OpenSSL command line tools to do
things like looking at certificates and chasing
certificate chains (although OpenSSL is
no longer what you want to use to make self-signed certificates). This works, and is in many ways
the canonical and most complete way to do this sort of stuff, but
if you've ever used the
openssl command and its many sub-options
you know that it's kind of a pain in the rear. As a result of this,
for some years now I've been using Square's
certigo command instead.
Certigo has two main uses. My most common case is to connect to some TLS-using service to see what its active certificate and certificate chain is (and try to verify it), as well as some TLS connection details:
$ certigo connect www.cs.toronto.edu:https ** TLS Connection ** Version: TLS 1.2 Cipher Suite: ECDHE_RSA key exchange, AES_128_GCM_SHA256 cipher ** CERTIFICATE 1 ** Valid: 2018-04-17 00:00 UTC to 2020-04-16 23:59 UTC Subject: [...]
Certigo will attempt to verify the certificate's OCSP status, but some OCSP verifiers seem to dislike its queries. In particular, I've never seen it succeed with Let's Encrypt certificates; it appears to always report 'ocsp: error from server: unauthorized'.
(Some digging suggest that Certigo is getting this 'unauthorized' response when it queries the OCSP status of the intermediate Let's Encrypt certificate.)
Certigo can connect to things that need STARTTLS using a variety of protocols, including SMTP but unfortunately not (yet) IMAP. For example:
$ certigo connect -t smtp smtp.cs.toronto.edu:smtp
(Fortunately IMAP servers usually also listen on
imaps, port 993,
which is TLS from the start.)
My other and less frequent use of Certigo is to dump the details
of a particular certificate that I have sitting around on disk,
certigo dump ...'. If you're dumping a certificate that's
in anything except PEM format, you may have to tell Certigo what
format it's in.
Certigo also has a '
certigo verify' operation that will attempt
to verify a certificate chain that you provide it (against a
particular host name). I don't find myself using this very much,
because it's not necessarily representative of what either browsers
or other sorts of clients are going to do (partly because it uses
your local OS's root certificate store, which is not necessarily
anything like what other programs will use). Generally if I want
to see a client-based view of how a HTTPS server's certificate chain
looks, I turn to the SSL server test from Qualys SSL Labs.
All Certigo sub-commands take a '
-v' argument to make them report
more detailed things. Their normal output is relatively minimal,
although not completely so.
Certigo is written in Go and uses Go's standard libraries for TLS, which means that it's limited to the TLS ciphers that Go supports. As a result I tend to not pay too much attention to the initial connection report unless it claims something pretty unusual.
(It also turns out that you can get internal errors in Certigo if you compile it with the very latest development version of Go, which may have added TLS ciphers that Certigo doesn't yet have names for. The moral here is likely that if you compile anything with bleeding edge, not yet released Go versions, you get to keep both pieces if something breaks.)
What we'll want in a new Let's Encrypt client
Over on Twitter, I said:
It looks like we're going to need a new Let's Encrypt client to replace acmetool (which we love); acmetool uses the v1 API and seems to no longer be actively developed, and the v1 API runs into problems in November: <link: End of Life Plan for ACMEv1>
(There is an unfinished ACMEv2 branch of acmetool, but, and also. It would be ideal if the community stepped forward to continue acmetool development, but sadly I don't see signs of that happening so far and I can't help with such work myself.)
November is when Let's Encrypt will turn off new account registrations through ACMEv1, which is a problem for us because we don't normally re-use Let's Encrypt accounts (for good reasons, and because it's easier). So in November, we would stop being able to install acmetool on new machines without changing our procedures to deliberately reuse accounts. Since doing so would only prolong things, we should get a new client instead. As it happens, we would like something that is as close to acmetool as possible, because acmetool is basically how we want to handle things.
Rather than try to write a lot of words about why we like
so much (with our custom configuration file),
I think it's simpler to demonstrate it by showing you the typical
install steps for a machine:
apt-get install acmetool mkdir /var/lib/acme/conf cp <master>/responses /var/lib/acme/conf/ acmetool quickstart acmetool want NAME1 ALIAS2 ...
(Alternately, we copy
/var/lib/acme from the live version of the
server. We may do both, using '
want' during testing and then overwriting it with the official
version when we go to production.)
After this sequence, we have a new Let's Encrypt account, a cron job that automatically renews
certificates at some random time of the day when they are 30 days
(or less) from expiry, and a whole set of certificates, intermediate
chains, and keys accessible through
and so on, with appropriate useful permissions (keys are root only
normally, but everything else is generally readable). When a
certificate is renewed,
acmetool will reload or restart any
potentially certificate-using service that is active on the machine.
If we want to add additional certificates for different names,
that's another '
acmetool want NAME2' (and then the existing cron
job automatically renews them). All of this works on machines that
aren't running a web server as well as machines that are running a
properly configured one (and these days the Ubuntu 18.04 acmetool
package sets that up for Apache).
(We consider it a strong feature that acmetool doesn't otherwise attempt to modify the configurations of other programs to improve their ability to automatically do things with Let's Encrypt certificates.)
Acmetool accomplishes this with a certain amount of magic. Not only does it keep track of state (including what names you want certificates for, even if you haven't been able to get them yet), but it also has some post-issuance hook scripts that do that magic reloading. The reloading is blind (if you're running Apache, it gets restarted whether or not it's using TLS or acmetool's certificates), but this hasn't been a problem for us and it sure is convenient.
We can probably duplicate a lot of this by using scripts on top of some other client, such as lego. But I would like us to not need a collection of home-grown scripts (and likely data files) to mimic the simplicity of operation that acmetool provides. Possibly we should explore Certbot, the more or less officially supported client, despite my long-ago previous experiences with it as a heavyweight, complex, and opinionated piece of software that wanted to worm its way into your systems. Certbot seems like it supports all of what we need and can probably be made to cooperate, and it has a very high chance of continuing to be supported in the future.
(A lot of people like minimal Let's Encrypt clients that leave you to do much of the surrounding work yourself. We don't, partly because such additional work adds many more steps to install instructions and opens the door to accidents like getting a certificate but forgetting to add a cron job that renews it.)
(My only experimentation with Certbot were so long ago that it wasn't called 'certbot' yet. I'm sure that a lot has changed since then, and that may well include the project's focus. At the time I remember feeling that the project was very focused on people who were entirely new to TLS certificates and needed a great deal of hand-holding and magic automation, even if that meant Certbot modifying their system in all sorts of nominally helpful ways.)
Some implications of using
offset instead of
delta() in Prometheus
I previously wrote about how
delta() can be inferior to subtraction
delta() has to
load the entire range of metric points and
offset doesn't. In
light of the issue I ran into recently with stale metrics and
range queries, there turn out to
be some implications and complexities of using
offset in place
delta(), even if it lets you make queries that you couldn't
Let's start with the basics, which is that '
can theoretically be replaced with '
mymetric - mymetric offset 30d'
to get the same result with far fewer metric points having to be
loaded by Prometheus. This is an important issue for us, because
we have some high-cardinality metrics that it turns out we want
to query over long time scales like 30 or 90 days.
The first issue with the
offset replacement is what happens when
a particular set of labels for the metric didn't exist 30 days ago.
Just like PromQL boolean operators (cf),
PromQL math operators on vectors are filters, so you'll ignore all
current metric points for
mymetric that didn't exist 30 days ago.
The fix for this is the inverse of ignoring stale metrics:
(mymetric - mymetric offset 30d) or mymetric
mymetric didn't exist 30 days ago we implicitly take its
starting value as 0 and just consider the delta to be the current
mymetric. Under some circumstances you may want a different
delta value for 'new' metrics, which will require a different
The inverse of the situation is metric labels that existed 30 days
ago but don't exist now. As we saw in an earlier entry, the range query in the
version will include those metrics, so they will flow through to
delta() calculation and be included in your final result set.
sort of claims otherwise, the actual code implementing
reasonably doesn't currently extrapolate samples that start and end
significantly far away from the full time range, so the
result will probably be just the change over the time series points
available. In some cases this will go to zero, but in others it
will be uninteresting and you would rather pretend that the time
series is now 0. Unfortunately, as far as I know there's no good
way to do that.
If you only care about time series (ie label sets) that existed at the start of the time period, I think you can extend the previous case to:
((mymetric - mymetric offset 30d) or mymetric) or -(mymetric offset 30d)
(As before, this assumes that a time series that disappears is implicitly going to zero.)
If you care about time series that existed in the middle of the
time range but not at either the beginning or the end, I think
you're out of luck. The only way to sweep those up is a range query
delta(), which runs the risk of a 'too many metric
points loaded' error.
Unfortunately all of this is increasingly verbose, especially if
you're using label matches restricting
mymetric to only some
values (because then you need to propagate these label restrictions
into at least the
or clauses). It's a pity that PromQL doesn't
have any function to do this for us.
I also have to modify something I said in my first entry on
delta(). Given all of these
issues with appearing and disappearing time series, it's clear that
delta() to not require the entire range is not as
simple as it looks. It would probably require some deep hooks into
the storage engine to say 'we don't need all the points, just the
start and the end points and their timestamps', and that stuff would
only be useful for gauges (since counters already have to load the
entire range set and sweep over it looking for counter resets).
In our current usage we care more about how the current metrics got
there than what the situation was in the past; we are essentially
looking backward to ask what disk space usage grew or shrank. If
some past usage went to zero and disappeared, it's okay to exclude
it entirely. There are some potentially tricky cases that might
cause me to rethink that someday, but for now I'm going to use the
shorter version that only has one
or, partly because Grafana makes
it a relatively large pain to write complicated PromQL queries.