2018-03-16
Wrestling with metrics to get meaningful, useful ones
I'm currently working on hacking together something
to show us useful information about the most active NFS filesystems
on a client (what I called nfsiotop
in yesterday's entry). Linux has copious per-mount statistics and the program that I
started from already read them all, so a great deal of what I've
been doing has been wrestling with the raw data available to come
up with useful metrics and figure out good ways of displaying them.
This is a common experience; I have some version of it almost every
time I wind up trying to boil a flood of raw data down to some
useful summaries of it.
The first part of this wrestling is just figuring out what pieces
of the raw data are even useful in practice. Looking at the actual
data on live systems always produces a certain amount of surprises;
for example, one promising looking field turned out to be zero on
all of our systems. Others can just be too noisy or not quite mean
what you understood them to mean, or not behave the way you thought
they were going to when the system is under load or otherwise in
an interesting state. One common thing to discover is that in
practice, certain detailed breakdowns in the raw data aren't
interesting and you actually want much more aggregated versions
(then you get to figure out how to aggregate in useful ways that
still keep things meaningful). In the specific case of Linux NFS
filesystem statistics, you could present various data separately
for each different NFS operation, but you don't really want to; you
probably don't care about, for example, how many MKDIR
operations
a second were done on the filesystem. At the same time you might
care about some broad categories since different NFS operations
have different impacts on the server.
The second part of this wrestling is figuring out if I can use some tempting piece of raw data in a useful and meaningful way, and the mirror image version of this, if there is some way to torture the raw data that I have so that it creates a useful metric that I really want. There are a great many metrics you can calculate from raw statistics, but a lot of the metrics don't necessarily mean anything much or can be misleading. It's tempting to believe that a particular calculation you've come up with means something useful, especially if it seems to correlate with load or some other interesting state, but it isn't necessarily so. I've find it all too easy to have my desire for a particular useful metric wind up blinding me to the flaws in what I'm calculating; I want to believe that I've come up with a clever trick to give me something I want, even if I haven't.
(I'm very aware of this since years ago I wound up being quite
annoyed that Linux's iostat
was confidently presenting a metric
that was very desirable but couldn't actually be calculated accurately
from the available information (see here).
I don't want to do that to myself in my own tools; if I print out
a metric, I want it to be meaningful, useful, and not misleading.)
For a concrete example of this, let's talk about a hypothetical 'utilization' metric for NFS mounts, by analogy to the stat for disks, where 100% utilization of a NFS mount would mean that there always was at least one outstanding NFS operation during the particular time period. Utilization is nice because it tells you more about how busy something is than a raw operation count does. Is 100 operations a second busy or nothing? It depends on how fast the server responds and how many operations you issue in parallel and so on.
The current Linux kernel NFS client statistics don't directly expose enough data to generate this number. But they do expose the total cumulative time spent waiting for the server to reply to each request (you have to sum it up from each separate NFS operation, but it's there). Is it meaningful to compare this total time to the time period and compute, say, a ratio or a percentage? On the one hand, if the total cumulative time is less than the time period, your utilization has to be under 100%; if you spent only half a second waiting for all operations issued over a second, then at least half of the time there had to be nothing outstanding. On the other hand, a high cumulative time doesn't necessarily mean high utilization, because you can easily have multiple outstanding requests that the server processes in parallel.
Let's call the ratio of cumulative time to elapsed time the 'saturation'. This metric does mean something, but it may not be useful and it may be misleading. How do we want to present it, if we present it at all? As a percentage clamped to 100%? As a percentage that can go above 100%? As a raw ratio? Is it mostly useful if it's below 100%, because then it's clearly signaling that we can't possibly have 100% utilization, or is it meaningful to see how much over 100% it goes? I don't currently have answers for any of these questions.
All of this is typical of the sort of wrestling with metrics that I wind up doing. I work out some metrics, I fiddle around with printing them in various ways, I try to see if they tell me things that look useful when I know that various sorts of load or stress are happening, and then I try to convince myself that they mean something and I'm not fooling myself.
PS: After you've convinced yourself that a metric means something (and what it means, and why), do write it all down in a comment in the code to capture the logic before it falls out of your head. And by 'you' I mean 'me'.