There are multiple uses for metrics (and collecting metrics)
In a comment on my entry on the overhead of the Prometheust host agent's 'perf' collector, a commentator asked a reasonable question:
Not to be annoying, but: is any of the 'perf data' you collect here honestly 'actionable data' ? [...] In my not so humble opinion, you should only collect the type of data that you can actually act on.
It's true that the perf data I might collect isn't actionable data (and thus not actionable metrics), but in my view this is far from the only reason to collect metrics. I can readily see at least three or four different reasons to collect metrics.
The first and obvious purpose is actionable metrics, things that will get you to do things, often by triggering alerts. This can be the metric by itself, such as free disk space on the root of a server (or the expiry time of a TLS certificate), or the metric in combination with other data, such as detecting that the DNS SOA record serial number for one of your DNS zones doesn't match across all of your official DNS servers.
The second reason is to use the metrics to help understand how your systems are behaving; here your systems might be either physical (or at least virtual) servers, or software systems. Often a big reason to look at this information is because something mysterious happened and you want to look at relatively detailed information on what was going on at the time. While you could collect this data only when you're trying to better understand ongoing issues, my view is that you also want to collect it when things are normal so that you have a baseline to compare against.
(And since sometimes things go bad slowly, you want to have a long baseline. We experienced this with our machine room temperatures.)
Sometimes, having 'understanding' metrics available will allow you to head off problems before hand, because metrics that you thought were only going to be for understanding problems as and after they happened can be turned into warning signs of a problem so you can mitigate it. This happened to us when server memory usage information allowed us to recognize and then mitigate a kernel memory leak (there was also a case with SMART drive data).
The third reason is to understand how (and how much) your systems are being used and how that usage is changing over time. This is often most interesting when you look at relatively high level metrics instead of what are effectively low-level metrics from the innards of your systems. One popular sub-field of this is projecting future resource needs, both hardware level things like CPU, RAM, and disk space and larger scale things like the likely future volume of requests and other actions your (software) systems may be called on to handle.
(Both of these two reasons can combine together in exploring casual questions about your systems that are enabled by having metrics available.)
A fourth semi-reason to collect metrics is as an experiment, to see if they're useful or not. You can usually tell what are actionable metrics in advance, but you can't always tell what will be useful for understanding your various systems or understanding how they're used. Sometimes metrics turn out to be uninformative and boring, and sometimes metrics turn out to reveal surprises.
My impression of the modern metrics movement is the general wisdom is to collect everything that isn't too expensive (either to collect or to store), because more data is better than less data and you're usually not sure in advance what's going to be meaningful and useful. You create alerts carefully and to a limited extend (and in modern practice, often focusing on things that people using your services will notice), but for the underlying metrics, the more the potentially better.
Comments on this page:
|
|