There are multiple uses for metrics (and collecting metrics)

May 23, 2024

In a comment on my entry on the overhead of the Prometheust host agent's 'perf' collector, a commentator asked a reasonable question:

Not to be annoying, but: is any of the 'perf data' you collect here honestly 'actionable data' ? [...] In my not so humble opinion, you should only collect the type of data that you can actually act on.

It's true that the perf data I might collect isn't actionable data (and thus not actionable metrics), but in my view this is far from the only reason to collect metrics. I can readily see at least three or four different reasons to collect metrics.

The first and obvious purpose is actionable metrics, things that will get you to do things, often by triggering alerts. This can be the metric by itself, such as free disk space on the root of a server (or the expiry time of a TLS certificate), or the metric in combination with other data, such as detecting that the DNS SOA record serial number for one of your DNS zones doesn't match across all of your official DNS servers.

The second reason is to use the metrics to help understand how your systems are behaving; here your systems might be either physical (or at least virtual) servers, or software systems. Often a big reason to look at this information is because something mysterious happened and you want to look at relatively detailed information on what was going on at the time. While you could collect this data only when you're trying to better understand ongoing issues, my view is that you also want to collect it when things are normal so that you have a baseline to compare against.

(And since sometimes things go bad slowly, you want to have a long baseline. We experienced this with our machine room temperatures.)

Sometimes, having 'understanding' metrics available will allow you to head off problems before hand, because metrics that you thought were only going to be for understanding problems as and after they happened can be turned into warning signs of a problem so you can mitigate it. This happened to us when server memory usage information allowed us to recognize and then mitigate a kernel memory leak (there was also a case with SMART drive data).

The third reason is to understand how (and how much) your systems are being used and how that usage is changing over time. This is often most interesting when you look at relatively high level metrics instead of what are effectively low-level metrics from the innards of your systems. One popular sub-field of this is projecting future resource needs, both hardware level things like CPU, RAM, and disk space and larger scale things like the likely future volume of requests and other actions your (software) systems may be called on to handle.

(Both of these two reasons can combine together in exploring casual questions about your systems that are enabled by having metrics available.)

A fourth semi-reason to collect metrics is as an experiment, to see if they're useful or not. You can usually tell what are actionable metrics in advance, but you can't always tell what will be useful for understanding your various systems or understanding how they're used. Sometimes metrics turn out to be uninformative and boring, and sometimes metrics turn out to reveal surprises.

My impression of the modern metrics movement is the general wisdom is to collect everything that isn't too expensive (either to collect or to store), because more data is better than less data and you're usually not sure in advance what's going to be meaningful and useful. You create alerts carefully and to a limited extend (and in modern practice, often focusing on things that people using your services will notice), but for the underlying metrics, the more the potentially better.


Comments on this page:

By Anonymous at 2024-05-24 09:02:43:

I tend to agree with a lot of your points: you want to have historic data that goes back 'far enough', and you want to have a baseline to know how the data looked when 'everything was still ok'. But I do feel that your examples of 'machine room temperature', 'kernel memory leak' and 'system usage (CPU,RAM,disk space)' (your 2nd and 3th reasons) still are or turned out to be actionable data.

And although it is true that you do not always know what data you are going to need in advance, I'm also not sure I agree with the apparent modern general wisdom that the more metrics you have, the better it is. You will reach a point where you will be drowning in data, and focus on or be lost in totally irrelevant data just because 'something' for a specific metric has changed between then and now (without it having any impact on the reason you are now looking at the metrics).

Oh, well. Agree to disagree ? Thank you for taking the time to write this post as a response to my comment on your earlier post. And thank you for all the other posts, I tend to check out your blog on a daily basis.

It seems like there's significant overlap between deciding "how many metrics to collect" and "how much to log." You can log too much, and lose track of the wheat in all the chaff coming through. But it's also extremely easy to log too little, leaving you (to stretch the metaphor) hungry in an emergency-response situation.

The latter is usually where I find myself, even after lots of experience in writing good log messages for what I expect to be impossible errors.

I guess the crux of the problem is the stakes: if you collect a few too many metrics and they don't cost too much (collecting or storing), the downside is much less than not collecting metrics that you find out were necessary later.

By cks at 2024-05-24 12:11:56:

Sorry, I was unclear in some pieces. Machine room temperature is obviously actionable data (we alert if it gets too high), and we consider system CPU usage to be actionable in some situations. Kernel memory is 'actionable' if you can detect a leak through it so perhaps it's actionable in a broad sense, although we don't try to alert on it.

A concrete example of a metric for understanding that I don't think we could ever alert on is detailed histograms of disk IO latency; another is various detailed ZFS metrics, such as ones that report on the ZFS ZIL. These are only 'actionable' in a broad sense in that if they give us a better understanding of what's going on or going wrong and we can do something about it, we will. There are a lot of system metrics more or less like this, at least for us (another example is network bandwidth levels).

A concrete example of a usage metric is the metrics we gather on the usage of our VPNs and our SLURM cluster. These will never trigger alerts and outside of weird situations they're not something we'll ever look at to diagnose problems.

By cks at 2024-05-24 12:22:33:

One thing that strongly influences my views on collecting metrics is that with modern metrics stacks, what you collect doesn't have to be what you visualize. I believe that our dashboards are unusually dense and cluttered, and we still collect many more metrics into Prometheus than are on any dashboard. Some of them will eventually wind up being looked at in ad-hoc queries; others may never be looked at at all.

(Metrics are generally easier than logs in this, in that extra logs get in the way of queries more than extra metrics.)

By Anonymous at 2024-05-24 12:55:54:

Again, thanks for the reply. Just a quick few thoughts on your last comment, and then I'll crawl back under my rock:

I have absolutely zero knowledge about ZFS (despite all your posts on the subject on this very blog), so I really cannot comment on that. But 'network bandwidth levels' seems to be sort-ish of actionable (in the sense when the max capacity of your NIC's is reached) by adding additional NIC's for 'NIC trunking/teaming' (not sure if this is a common practice anymore these days) or 'horizontal scaling' by adding more instances/VM's/machines of your software and then load balancing across these instances (assuming the software you are using allows for this). Perhaps even the usage of your VPN can be monitored for reaching 'max capacity' (however you choose to define that), and then decide to scale up/out.

You may not want to alert for these in the sense that the designated stand-by gets called out of bed at 3 in the morning, but perhaps you do in the sense of 'we need to buy/add/configure' additional resources to deal with the increased demand/load.

Anyway, the only thing that really triggered me was the examples of the perf metrics you mentioned; it made me think about past endless and pointless discussions with colleagues of how many 'context switches per second' was acceptable, and what to do about it or could be done about it at all.

I don't know. Perhaps I'm just getting old ;).

By Anonymous at 2024-05-24 13:09:47:

And perhaps I meant 'EtherChannel' there, instead of 'trunking/teaming'. The network side of things has always been my weak spot. Oh, well.

Written on 23 May 2024.
« The Prometheus host agent's 'perf' collector can be kind of expensive
The long-overdue problem coming for some people in Go 1.23 »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu May 23 22:56:24 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.