Having metrics has taught me I didn't really know how our systems behave
I'll start with a story.
I tweeted some screenshots from our Prometheus setup recently. As it happens, there's something interesting going on in the first one, which you can find a 1920x1010 'full sized' PNG of here. This is a dashboard of one of our Amanda backup servers during backups and thus as it writes data to the disk drives we use as 'tapes'. If you look at the 'disk write bandwidth' graph, you can see the steadily declining write rate to the yellow disk; it drops from about 125 MBytes/sec at the start of writing data to just under 100 MBytes/sec by the end. The disk is 100% utilized during these write-out periods where its bandwidth slowly drops, and over the same time the average write time also goes up. All of this is consistent; it happens on all of our backup servers and on all of our disks.
I don't know what causes any of this or what it means, or if it's important. But what I do know is we had no idea that this was happening until we looked at the metrics. And it's not the only thing about our systems that I've discovered by building dashboards and poking around in them. Over and over I've discovered things from our new metrics system, for example that our FTP area is surprisingly active.
What building a metrics system and using it has taught me is that I didn't really know how our systems behaved. I had some idea of the large scale, obvious things, and for some issues I had looked in specific even before we built our metrics system, but beyond that I was mostly ignorant. This is not really surprising in any complex system, which most environments are. There's just too much going on, too many factors influencing each other, too many things you don't know about until you measure and study the system as it really is, not how you imagine it to be.
(This is pretty much the corollary of how you don't know what matters for performance in a complex system, which means you can't really know what your systems are doing until you measure them.)
Having a metrics system is not the only way to discover surprising, previously unknown things about your systems, but it's one of the most wide-ranging ones. By collecting data broadly, without trying to restrict either what you gather or when you gather it to what you expect to be interesting, you enable retrospective serendipity (in addition to the power of being able to explore casual questions, which can itself lead to discoveries).
Of course, our new metrics don't magically let us fully understand our systems. Beyond the limitation created by what data we collect and don't collect, which constrains what we can even possibly see into, there's two more limitations. The first is that we only collect metrics every so often, so transient events that happen faster than that are either invisible or smoothed out (depending on what sort of metrics we're collecting). The second is that most of our metrics are only averages, and even over pretty short periods of time averages can be misleading (see here and here).
To put this in concrete terms, that write IO bandwidth decline we see in the graphs of our Amanda backup servers is real, but it's probably not the full story; all of the cautions I wrote about in my entry on Prometheus disk IO stats definitely apply. If we wanted to fully and deeply understand how our backup servers were behaving here, we'd need more detailed information, either histograms or even block IO traces.
(And even with a lot more data that let us precisely characterize what was going on, we might not understand why the drives were slowing down, because modern drives are opaque black boxes with a lot of things going on inside.)
PS: Of course, if you suspect that your servers are perhaps doing something undesirable, for instance your IMAP server might be doing excessive NFS operations, then it's a good idea to turn your metrics spotlight on that specific aspect of things. But that's a slightly different case, one where you sort of have a hypothesis and you want to test it. An additional advantage of having a metrics system is that when such cases come up you're much more likely to take some easy steps to check your theories of what's wrong, instead of just assuming you're right. As we've found out, it's important to make sure you actually understand your problem.
How I wound up finding a bug in GNU Tar
Every so often something happens in my work that makes me think, even if I don't know what conclusions to really draw from it. I recently mentioned that we'd found a bug in GNU Tar, and the story of how that happened is one of those times.
We back up our fileservers
through Amanda and GNU Tar. For a long time, we've had a problem
where every so often, fortunately quite rarely, tar would freak out
while backing up the filesystem that held
huge amounts of output. Most of the time this would go on forever
and we'd have to kill the dump eventually; other times it would
eventually finish, having produced terabyte(s) of output that
fortunately seemed to compress very well. At one point we captured
such a giant tar file and I subjected it to some inspection, which
revealed that the runaway area was a giant sea of null bytes, which
tar -t' didn't like, but after a while things returned to normal.
(This led to me wondering if null bytes were naturally occurring in people's inboxes. It turns out that hunting for null bytes in text files is not quite as easy as you'd like, and yes, people's inboxes have some.)
We recently moved the filesystem with
/var/mail to our new Linux
fileservers, which are on Ubuntu 18.04
and so have a more recent and more mainline version of GNU Tar than
our OmniOS machines. We hoped that this would solve our GNU Tar
issues, but then we almost immediately had one of these runaway tar
incidents occur. This time around, with GNU Tar running on an Ubuntu
machine where I felt fully familiar with all of the debugging tools
available, I did some inspection of the running
tar process. This
inspection revealed that
tar was issuing an endless stream of
read()s that were all returning 0 bytes:
read(6, "", 512) = 0 read(6, "", 512) = 0 [...] read(6, "", 512) = 0 write(1, "\0\0\0\0\0"..., 10240) = 10240 read(6, "", 512) = 0 [...]
lsof said that file descriptor 6 was someone's mailbox.
apt-get source tar', I fetched the source code to Ubuntu's
version of GNU Tar and went rummaging around through it for
system calls that didn't check for end of file. Once I decoded some
levels of indirection, there turned out to be one obvious place that
seemed to skip it, in the
sparse_dump_region function in sparse.cs.
A little light went on in my head.
A few months ago, we ran into a NFS problem with Alpine. While working on that bug, I
an Alpine process and noticed, among other things, that it was using
ftruncate() to change the size of mailboxes; sometimes it extended
them, temporarily creating a sparse section of the file until it
filled it in, and perhaps sometimes it shrunk them too. This seemed
to match what I'd spotted; sparseness was related, and shrinking a
file's size with
ftruncate() would create a situation where tar
hit end of file before it was expecting to.
(This even provides an explanation for why tar sometimes recovered; if something later delivered more mail to the mailbox, taking it back to or above the size tar expected, tar would stop getting this unexpected end of file.)
I did some poking around in GDB, using Ubuntu's debugging symbols
and the tar package source code I'd fetched, and I can reproduce
the bug, although it's somewhat different than my initial theory.
It turns out that
sparse_dump_region is not dumping sparse
regions of a file, it's dumping non-sparse ones (of course), and
it's used on all files (sparse or not) if you run tar with the
--sparse argument. So the actual bug is if you run GNU Tar with
--sparse and a file shrinks while tar is reading it, tar fails
to properly handle the resulting earlier than expected end of file.
If the file grows again, tar recovers.
(Except if a file that is sparse at the end shrinks purely in that sparse section. In that case you're okay.)
What is interesting to me about this is that there's nothing here
I could not have done years ago on our OmniOS fileservers, in theory.
OmniOS has ways of tracing a program's system call activity, and
it has general equivalents of
lsof, and I could have probably
found and looked at the source code for its version of GNU Tar and
run it under some OmniOS debugger (although we don't seem to have
any version of GDB installed), and so on. But I didn't. Instead we
shrugged a bit and moved on. It took moving this filesystem to an
Ubuntu based environment to get me to dig into the issue.
(It wasn't just an issue of tools and environment, either; part of it was that we automatically assumed that the OmniOS version of GNU Tar was some old unsupported version that there was no reason to look at, because surely the issue was fixed in a newer one.)
PS: Our short term fix is likely to be to tell Amanda to run GNU
--sparse when backing up this filesystem. Mailboxes
shouldn't be sparse, and if they are we're compressing this
filesystem's backups anyway so all those null bytes
will compress really well.
PPS: I haven't tried to report this as a bug to the GNU Tar people because I only confirmed it Friday and the university is now on its winter break. Interested parties should feel free to beat me to it.
Update: The bug has been reported to the GNU Tar people and is now fixed in commit c15c42c.
My current trick for keeping reasonably ready virtual machine images
A lot of my use of virtualization is to test and develop our Ubuntu systems and changes to them. If I have a significant configuration change for one of our Exim machines, for example, I will usually spin up a test VM, build a version of that machine, and test the configuration there before I risk deploying it onto the real machine. Similarly, I prototype a lot of the build and operating procedures for machines on VMs, because if something goes wrong I can pave and rebuild easily (or simply revert to a snapshot).
For a long time, I redid the Ubuntu install on these VMs from scratch every time I needed to change what I was doing with one of them, which was reasonably frequently (or at least I mostly did, because sometimes I was lazy and reused an existing Ubuntu install that wasn't too mutated from our baseline, and then sometimes I was wrong about how mutated it was). Years ago I toyed with the idea of preconfigured virtual machine templates, but I never went anywhere with it for various reasons, including how many potential image variants there were.
Recently I not so much figured something out as got lazy. Most of the VMs I build are of the latest version of Ubuntu and there's definitely one portion of our install system that changes extremely rarely, which is the initial install from our CD image. So at some point, I started just making a snapshot of that initial install in every test machine VM. This still means I have to go through our postinstall process, but that's mostly hands-off and especially I don't have to interact with the VM on the console; I can do it all through SSH.
(The CD stage of the install requires configuring networking and partitioning the local disk before Ubuntu can be installed and brought up on our network. After that it's deliberately accessible via SSH.)
This doesn't help if I need something other than Ubuntu 18.04, but
in that case I do '
dd if=/dev/zero of=/dev/sda' and start from
scratch with a CD install. I don't bother deleting the snapshot;
I'll revert back to it when I go back to wanting this VM to be an
18.04 machine. Fortunately the CD install stage doesn't take too
(Most of our Exim machines are still running 16.04, so testing an Exim change is one recent reason I needed a 16.04 VM. I just recycled that one, in fact.)
PS: I can't keep a generic image and clone it to make new VMs, because our install CD sets the machine up with a specific IP address (whatever you enter). It's easier to keep frozen snapshots for several VMs, each with its own IP address in it, than to fix that somehow after I clone the image into a VM.
Sidebar: Something I should try out with VMWare Workstation
Normally I think of VMWare Workstation snapshots as happening in a straight line of time, but if I remember right you can actually fork them into a tree. If this works well, the obvious approach to dealing with different Ubuntu versions would be to keep forked snapshots of each. However, even if this is something you can do reasonably in the VMWare UI, I wonder what sort of performance impact it has on disk IO in the VM. I don't want to slow down my VMs too much for something that I only use occasionally.
Exploring casual questions with our new metrics system
On Mastodon a while back, I said:
What's surprised me about having our new metrics system is how handy it is to be able to answer casual questions, like 'does this machine's CPU get used much', and how often I poke around with that sort of little question or small issue I'm curious about.
(I was expecting us to only really care about metrics when trying to diagnose problems or crises.)
Putting together some sort of metrics and performance statistics system for our servers has been one of my intentions for many years now (see, for example, this 2014 entry, or this one on our then goals). Over all of that time, I have assumed that what mattered for this hypothetical system and what we'd use it for was being able to answer questions about problems (often serious ones), partly to make sure we actually understood our problem, or do things like check for changes in things we think are harmless. Recently we actually put together such a system, based around Prometheus and Grafana, and my experience with it so far has been rather different than I expected.
Over and over again, I've turned to our metrics system to answer relatively casual or small questions where it's simply useful to have the answers, not important or critical. Sometimes it's because we have questions such as how used a compute machine's CPU or memory is; sometimes it's let me confirm an explanation for a little mystery. Some of the time I don't even have a real question, I'm just curious about what's going on with a machine or a service. For instance, I've looked into what our Amanda servers are doing during backups and turned up interesting patterns in disk IO, as well as confirming and firming up some vague theories we had about how they performed and what their speed limits were.
(And just looking at systems has turned up interesting information, simply because I was curious or trying to put together a useful dashboard.)
The common element in all of this is that having a metrics system now makes asking questions and getting answers a pretty easy process. It took a lot of work to get to this point, but now that I've reached it I can plug PromQL queries into Prometheus or look at the dashboards I've built up and pull out a lot with low effort. Since it only takes a little effort to look, I wind up looking a fair bit, even for casual curiosities that we would never have bothered exploring before.
I didn't see this coming at all, not over all of the time that I've been circling around system performance stats and metrics and so on. Perhaps this is to be expected; our focus from the start has been on looking for problems and dealing with them, and when people talk about metrics systems it's mostly about how their system let them see or figure out something important about their environment.
(This focus is natural, since 'it solved our big problem' is a very good argument for why you want a metric system and why investing the time to set one up was a smart decision.)
PS: This is of course yet another example of how reducing friction increases use and visibility and so on. When it is easy to do something, you often wind up doing it more often, as I've seen over and over again.
One situation where you absolutely can't use
irate() in Prometheus
This is a story about me shooting myself in the foot repeatedly, until I finally noticed.
We have Linux NFS client machines, and we would like to know NFS client performance information about them so we can see things which filesystems they use the most. The Linux kernel provides copious per-mount information on NFS mounts and the Prometheus node exporter can turn all of it into metrics, but for us the raw node exporter stats are just too overwhelming in our environment; a typical machine generates over 68,000 different node_mountstats metrics values. So we wrote our own program to digest the raw mountstats metrics and do things like sum all of the per NFS operation information together for a given mount. We run the program from cron once a minute and publish its information through the node exporter's extremely handy textfile collector.
(The node exporter's current mountstats collector also has the same
misunderstanding about the
xprt: NFS RPC information that I did, and so reports it on
a per-mount basis.)
For a while I've been trying to use the cumulative total RPC time
with the total RPC count to generate an average time per RPC graph.
Over and over I've plugged in some variation on 'cumulative seconds
/ cumulative operations' for various sources of both numbers, put
irate() around both, graphed this, and gotten either nothing
at all or maybe a few points. Today I really dug into this and the
penny finally dropped while I was brute-forcing things. The problem
was that I was reflexively using
irate() instead of
The one situation where you absolutely can't use
irate() is that
you can't apply
irate() to oversampled metrics, metrics
that Prometheus is scraping more often than they're being generated.
We're generating NFS metrics once a minute but scraping the node
exporter every fifteen seconds, so three out of four of our recorded
NFS metrics are the same.
irate() very specifically uses the last two data points in your sample
interval, and when you're oversampling, those two data points are
very often going to be the exact same actual metric and so have no
In other words, a great deal of my graph was missing because it was all 'divide by zero' NaNs, since the irate() of the 'cumulative operations' count often came out as zero.
(Oversampled metric is my term. There may be an official term for 'scraped more often than it's generated' that I should be using instead.)
Looking back, I specifically mentioned this issue in my entry on
irate(), but apparently
it didn't sink in thoroughly enough. Especially I didn't think
through the consequences of dividing by the
irate() of an oversampled
metric, which is where things really go off the rails.
(If you merely directly graph the
irate() of an oversampled metric,
you just get a (very) spiky graph.)
Part of what happened is that when I was writing my exploratory
PromQL I was thinking in terms of how often the metric was generated,
not how often it was scraped, so I didn't even consciously consider
that I had an oversampled metric. For example, I was careful to use
irate(...[3m])', so I would be sure to have at least two valid metric
points for a once a minute metric. This is probably a natural mindset
to have, but that just means it's something I'm going to have to try to
keep in mind for the future for node exporter textfile metrics and
Sidebar: Why I reflexively reach for
rate() is a
choice of convenience and what time range and query step you're
working on. Since the default Prometheus graph is one hour, I'm
almost always exploring with a small enough query step that
irate() is the easier way to go. Even
when I'm dealing with an oversampled metric, I can get sort of get
away with this if I'm directly graphing the
irate() result; it
will just be (very) spiky, it won't be empty or clearly bogus.
The brute force cron-based way of flexibly timed repeated alerts
Suppose, not hypothetically, that you have a cron job that monitors something important. You want to be notified relatively fast if your Prometheus server is down, so you run your cron job frequently, say once every ten minutes. However, now we have the problem that cron is stateless, so if our Prometheus server goes down and our cron job starts alerting us, it will re-alert us every ten minutes. This is too much noise (at least for us).
There's a standard pattern for dealing with this in cron jobs that send alerts; once the alert happens, you create a state file somewhere and as long as your current state is the same as the state file, you don't produce any output or send out your warning or whatever. But this leads to the next problem, which is that you alert once and are then silent forever afterward, leaving it to people to remember that the problem (still) exists. It would be better to re-alert periodically, say once every hour or so. This isn't too hard to do; you can check to see if the state file is more than an hour old and just re-send the alert if it is.
(One way to do this is with '
find <file> -mmin +... -print'.
Although it may not be Unixy, I do rather wish for
newerthan utilities as a standard and widely available thing. I
know I can write them in a variety of ways, but it's not the same.)
But this isn't really what we want, because we aren't around all of the time. Re-sending the alert once an hour in the middle of the night or the middle of the weekend will just give us a big pile of junk email to go through when we get back in to the office; instead we want repeats only once every hour or two during weekdays.
When I was writing our checker script, I got to this point and started planning out how I was going to compare against the current hour and day of weeek in the script to know when I should clear out the state file and so on. Then I had a flash of the obvious and realized that I already had a perfectly good tool for flexibly specifying various times and combinations of time conditions, namely cron itself. The simple way to reset the state file and cause re-alerts at whatever flexible set of times and time patterns I want is to do it through crontab entries.
So now I have one cron entry that runs every ten minutes for the main script, and another cron entry that clears the state file (if it exists) several times a day during the weekday. If we decide we want to be re-notified once a day during the weekend, that'll be easy to add as another cron entry. As a bonus, everyone here understands cron entries, so it will be immediately obvious when things run and what they do in a way that it wouldn't be if all of this was embedded in a script.
(It's also easy for anyone to change. We don't have to reach into a script; we just change crontab lines, something we're already completely familiar with.)
As it stands this is slightly too simplistic, because it clears the
state file without caring how old it is. In theory we could generate
an alert shortly before the state file is due to cleared, clear the
state file, and then immediately re-alert. To deal with that I
decided to go the extra distance and only clear the state file if
it was at least a minimum age (using
find to see if it was old
enough, because we make do with the tools Unix gives us).
(In my actual implementation, the main script takes a special
argument that makes it just clear the state file. This way only the
script has to know where the state file is or even just what to do
to clear the 'do not re-alert' state; the crontab entry just runs