Wandering Thoughts

2019-02-08

Making more use of keyboard control over window position and size

For a long time, I've done things like move and resize windows in my highly customized X environment through the mouse. Using the mouse for this is the historical default in X window managers and X window manager setups, and I have a whole set of fvwm mouse bindings to make this fast and convenient. I may have used keyboard bindings for things like raising and lowering windows and locking the screen, but not beyond that.

(I was familiar with the idea, since keyboard-driven tiling windows are very much a thing in X circles. But I happen to like the mouse.)

Then I got a work laptop with a screen that was big enough to actually need window management and discovered that my mouse-driven window manipulation didn't really work well due to a touchpad not being a good mouse. This led to me exploring Cinnamon's keyboard bindings for re-positioning and re-sizing windows (once I discovered that they exist), and progressively using them more and more rather than shuffle windows around by hand on the touchpad.

(Unfortunately there doesn't seem to be a good list of Cinnamon's standard keybindings, or even anything obvious to tell you that they exist. I'm not sure how I stumbled over them.)

These keybindings have turned out to be quite pleasant to use and make for a surprisingly fluid experience. Cinnamon's basic positioning keybindings let you tile windows to the halves and quarters of the screen, which works out about right on my laptop, and of course it supports maximize. Recently I've also been using a keybinding for flipping to a new virtual desktop ('workspace' in Cinnamon terminology) and taking the current window with you, which is helpful for de-cluttering a screen that's getting too overgrown.

My laptop experience has pushed me into adding a certain number of similar keyboard bindings to my desktop fvwm setup. Most of these are pure positioning bindings that do things like move a window to the bottom left or bottom right corner of the screen, because this turns out to be an operation I want to do a fair bit on my desktop (partly so I can just randomly place a new window then snap it into position). I also created a keybinding that resets terminals and browser windows to my preferred size for them (along with a 'maximize window' keybinding), which has turned out to be very handy. Now that it's trivial to reset a window to the right size, I'm much more likely to temporarily resize windows to whatever is convenient for right now.

(A maximized Firefox on either my work or my home display is a little bit ridiculous, but sometimes I just want to hit things with a giant hammer and there are cases where it really is what I want.)

Clearly one of the things that's going on here is the usual issue of reducing friction, much as with creating keyboard controls for sound volume and other cases. Having specific, easily accessible controls for various sorts of operations makes it more likely that I will actually bother to do them, and when they're useful operations that's a good thing.

(It's not sufficient to have accessible controls, as I'm not using all of the fvwm keyboard bindings for this that I added. Some superficially attractive positioning and resizing operations turn out to be too infrequent for me to bother to remember and use the keybindings.)

KeyboardWindowControl written at 21:31:25; Add Comment

2019-01-30

How having a metrics system centralized information and got me to check it

We recently had a little incident with one of the SSDs on our new fileservers. To condense the story, ZFS started detecting a few checksum errors over time (starting January 13th), we assumed they were a one-time thing, we arranged to scrub the pool this weekend, and during the scrub ZFS discovered dozens more checksum errors (and faulted the SSD). With the Linux kernel reporting no drive errors, we turned to smartctl to see if there was anything in the SMART attributes. And indeed there was; when I looked on Friday, before the scrub, I noticed right away that the drive had a 'reallocated event count' and 'reallocate(d) NAND block count' of 1.

A while back, we arranged to periodically collect all SMART attributes for all of our fileserver drives and dump them into our Prometheus system as local metrics (we had reasons to want to keep an eye on these). Shortly after that, I built a very brute force Grafana dashboard that let me see the state of those metrics as of some point in time and what had changed in them from an earlier point in time. On Friday, I used the historical log of metrics in Prometheus to see that the reallocated counts had both gone from 0 to 1 early on January 13th, not long before the first checksum errors were reported.

(There is probably a clever way in Prometheus to extract the time when this happened. I did it the brute force way of graphing the attributes and then backing up in time until the graph changed.)

After the scrub had run into all of those problems, I took a second look at smartctl output for the drive to see if anything new had appeared, and it didn't look like it; certainly both reallocated counts were still the '1' that they had been on Friday. Then I also checked my brute force Grafana dashboard and it promptly showed me another difference; the 'Raw Read Error Rate' had gone up by a decent amount. There were two reasons I hadn't spotted this earlier; first, it was a reasonably large number that looked like the other reasonably large numbers in other SMART attributes, and second, it had also been non-zero on Friday, before the scrub.

(smartctl output is dozens of lines with a bunch of fields for each attribute; there is a lot of noise, and it's easy to glaze over yet another set of things. The Grafana dashboard made things jump out by only presenting changed attributes and the changes in their raw values, which reduced it to about five or six much easier to read attributes.)

Some quick graphing later and I could see that the raw read error rate had been zero before January 13th and had been steadily counting up ever since then (with a sudden jump during the scrub). This didn't look like a SSD that had one NAND block that had gone bad, taking out some random collection of sectors and ZFS blocks; this looked like a SSD that was dying. Or maybe not, because SSDs are notoriously magical at a low level so perhaps it was routine for our Crucial SSDs to see some raw read errors and to count up that SMART attribute. And if it was actually a problem indicator, were we seeing it on any other fileserver drives?

Since I had all of our SMART metrics for all of our fileserver drives in one place, in Prometheus, I could easily do some quick PromQL checks and see that basically no other Crucial SSD had a non-zero 'Raw Read Error Rate'. We had a real problem but it was only with this one drive.

There's nothing in this story that I couldn't have done without Prometheus, at least in theory. We could have created a historical log of SMART metrics in some other way, I could have paid more attention to smartctl output and noticed the non-zero 'Raw Read Error Rate' (either on Friday or after the scrub explosion), I could have manually diff'd or otherwise compared two smartctl reports to see the increased RRER, and I could have gone around to all of the fileservers to check all of the disks for their RRER. But in practice very little of this would have happened without Prometheus or something equivalent to it.

What Prometheus did for us here is two things. First, it centralized all of this information in one place and thus gave us an easy global view, just like a central syslog server. The second is that it reduced the friction of looking at all of these things (over and above centralizing them in one place). Reducing friction is always and evermore a huge subtle advantage, one that you shouldn't underrate; over and over, reducing friction past a crucial point has caused a sea change in how I do things.

(This is related to how our metrics system gets me to explore casual questions. This wasn't a casual question, but the low friction caused me to go further than I otherwise would have tried to.)

MetricsReducedFriction written at 00:27:29; Add Comment

2019-01-23

A little surprise with Prometheus scrape intervals, timeouts, and alerts

Prometheus pulls metrics from metric sources or, to put it in Prometheus terms, scrapes targets. Every scrape configuration and thus every target has a scrape interval and a scrape timeout as part of its settings; these can be specified explicitly or inherited from global values. In a perfect world where scraping targets either completes or fails in zero time, this results in simple timing; a target is scraped at time T, then T + interval, then T + interval + interval, and so on. However, the real world is not simple and scraping a target can take a non-zero amount of time, possibly quite a lot if you time out. You might sensibly wonder if the next scrape is pushed back by the non-zero scrape time.

The answer is that it is not, or at least it is sort of not. Regardless of the amount of time a scrape at time T takes, the next scrape is scheduled for T + interval and will normally happen then. Scrapes are driven by a ticker, which runs independently of how long a scrape took and adjust things as necessary to keep ticking exactly on time.

So far, so good. But this means that slow scrapes can have an interesting and surprising interaction with alerting rules and Alertmanager group_wait settings. The short version is that you can get a failing check and then a successful one in close succession, close enough to suppress an Alertmanager alert that you would normally expect to fire.

To make this concrete, suppose that you perform SSH blackbox checks every 90 seconds, time out at 60 seconds, trigger a Prometheus alert rule the moment a SSH check fails, and have a one minute group_wait in Alertmanager. Then if a SSH check times out instead of failing rapidly, you can have a sequence where you start the check at T, have it fail via timeout at T + 60, send a firing alert to Alertmanager shortly afterward, have the next check succeed at T + 90, and withdraw the alert shortly afterward from Alertmanager, before the one minute group_wait is up. The net result is that your 'alert immediately' SSH alert rule has not sent you an alert despite a SSH check failing.

It's natural to expect this result if your scrape interval is less than your group_wait, because then it's obvious that you can get a second scrape in before Alertmanager makes the alert active. It's not as obvious when the second scrape is possible only because the difference between the scrape interval and the scrape timeout is less than group_wait.

(If nothing else, this is going to make me take another look at our scrape timeout settings. I'm going to have to think carefully about just what all of the interactions are here, especially given all of the other alert delays. Note that a resolved alert is immediately sent to Alertmanager.)

PS: It's a pity that there's no straightforward way that I know of to get either Prometheus or Alertmanager to write a log record of pending, firing, and cleared alerts (with timestamps and details). The information is more or less captured in Prometheus metrics, but getting the times when things happened is a huge pain; being able to write optional logs of this would make some things much easier.

(I believe both report this if you set their log level to 'debug', but of course then you get a flood of other information that you probably don't want.)

Sidebar: How Prometheus picks the start time T of scrapes

If you've paid attention to your logs from things like SSH blackbox checks, you'll have noticed that Prometheus does not hit all of your scrape targets at exactly the same time, even if they have the same scrape interval. How Prometheus picks the start time for each scrape target is not based on when it learns about the scrape target, as you might expect; instead, well, let me quote the code:

base   = interval - now%interval
offset = t.hash() % interval
next   = base + offset

if next > interval {
   next -= interval
}

All of these values are in nanoseconds, and t.hash() is a 64-bit hash value, hopefully randomly distributed. The next result value is an offset to wait before starting the scrape interval ticker.

In short, Prometheus randomly smears the start time for scrape targets across the entire interval, hopefully resulting in a more or less even distribution.

PrometheusScrapeIntervalBit written at 01:36:30; Add Comment

2019-01-09

On right and wrong ways to harvest system-level performance stats

One of the great things about blogging, especially over a long time, is that it gives you opportunities to be spectacularly wrong in public. (I'm actually sort of serious about this being a good thing, but that's another entry.)

Back in 2012, I wrote an entry called "The wrong way to harvest system-level performance stats", in which I maintained that single-shot stats gathering of absolute counters was the wrong way and that the host agent should compute and publish ongoing deltas between measurements. This issue matters partly because it fundamentally affects how you can gather stats; if you have single-shot stats, you can collect them with simple stateless scripts that run and exit, but if you need differences, you also need continually running scripts to gather those differences.

There is no nice way to put it: I'm wrong here. System level stats should be gathered as absolute (total) values whenever possible, not as differences (deltas) over some time interval. Perhaps I was somewhat less wrong back in 2012 than I am today, but today you definitely want to gather only absolute measurements and then leave everything else up to your metrics system backend. And even in your metrics backend, you want to store the absolute values in addition to any deltas that you pre-compute for convenience and lower CPU usage.

In 2012 it was at least somewhat more likely that you would be dealing with rapidly rolling 32-bit counters and an environment where you could only collect samples relatively infrequently. Today, 64-bit counters are everywhere and more powerful systems mean that we can afford to collect stats quite frequently. Also, I think metrics system backends are clearly more powerful today than they were six years ago; today's systems on today's hardware can reasonably do quite sophisticated computations over fairly large amounts of data.

(In other words, you're no longer setting your CPUs on fire if you want to compute even fine-grained deltas with counter reset checks every time you (re)display a graph.)

The first fundamental reason for using absolute values in preference to anything else is that it's much easier to go from absolute values to differences over the time range of your choice than it is to go from differences over a specific time range to either absolute values or differences over another time range. You can do the second, but it is a bunch more work and you run into various potential precision issues. Going from absolute counter values to deltas over various times is relatively straightforward; you just need frequent enough samples and then to scan for counter resets.

The second fundamental reason is that gathering the actual stats is almost always easier. Operating systems mostly export absolute counters because absolute counters are the easiest thing to do, and then your stats gathering agent can be stateless and simply dump out the raw values from the system. You can even do this with scripts, as people were advocating (and implementing) back in 2012.

As a bonus, you wind up relatively indifferent to whether any particular stat is a counter that is supposed to always go up or a gauge that goes up and down. Interpretation of this is up to the metrics system or even to the user writing queries against your data, and they're the best placed to know what they want and what things really mean. Your stats gathering process can simply blindly read and export everything it can get its hands on.

So why have I realized that I was wrong? Simple; I got practical experience by building and using a metrics system that used a lot of absolute stats as its raw values and computed deltas from them. Prometheus is perfectly happy to do this (although there are issues), and having easy access to absolute stats has also made it easy to use them, which is sometimes the right way to present information. Now that I've used a system based around absolute stats, I've been smacked in the nose by how clearly off my previous entry is and it's time to admit it and set the record straight.

StatsGatheringNoDeltas written at 01:00:34; Add Comment

2019-01-04

Planning ahead in documentation worked out for us

Several years ago I wrote a kind of a war story about planning ahead in documentation, which can be summarized in my tweet from then:

Current status: changing documentation to leave notes for myself that we'll need in three or four years. Yes, this is planning way ahead.

Specifically, I documented various test procedures and additional requirements for our internal self-serve DHCP portal and its overall system. We weren't going to need this information before we updated the DHCP portal from 14.04 (which it became back in 2015) to something else.

Given that Ubuntu 18.04 came out last year and put Ubuntu 14.04 on life support, you can guess the follow-up to that old entry. It's a pretty straightforward one. This fall we updated our DHCP portal from 14.04 to 18.04, more or less on schedule for what we might have expected back in 2015, and when my co-worker did that, apparently my documentation notes from back in 2015 came in handy. So my planning way ahead actually paid off, which doesn't really surprise me but does please me in a quiet little way (since it means that I didn't waste my time back then).

(We missed some things, as always happens, but this time around they were either completely external things or things that were due to the various changes in Ubuntu 18.04. There were also some changes to the overall plans; for example, here in 2018 we've switched from Bind to NSD. We probably didn't have to do that, but we're switching to NSD on all our other machines too.)

One of the reasons that this worked out is that we didn't try to make any fundamental changes in how the DHCP portal worked. Partly we did this because we exist in an environment with long term stability, but another part of it is that we're pretty conservative once we have a system that works. In some other environments, people might have taken this as an opportunity to try something completely new (there's probably some canned 'captive portal' open source software out there, for example). In our environment, we save our energy for other things.

This sort of thing doesn't always pay off, of course. Sometimes our plans or needs change. Who knows what we'll have to do in an IPv6 world, for example. But generally I've been happy with what's come out of planning (way) ahead, and I should probably think about how to do it more.

(Every system here is eventually going to be replaced, and so it's probably worth spending at least a little bit of time thinking about how that might be done and what the obstacles to it might be when putting a new version of the system together. What's going to be a general problem? What little workaround is hopefully not going to be needed in the next version of Ubuntu? And so on.)

DocumentingPlanningAheadII written at 00:55:21; Add Comment

2018-12-27

Having metrics has taught me I didn't really know how our systems behave

I'll start with a story.

I tweeted some screenshots from our Prometheus setup recently. As it happens, there's something interesting going on in the first one, which you can find a 1920x1010 'full sized' PNG of here. This is a dashboard of one of our Amanda backup servers during backups and thus as it writes data to the disk drives we use as 'tapes'. If you look at the 'disk write bandwidth' graph, you can see the steadily declining write rate to the yellow disk; it drops from about 125 MBytes/sec at the start of writing data to just under 100 MBytes/sec by the end. The disk is 100% utilized during these write-out periods where its bandwidth slowly drops, and over the same time the average write time also goes up. All of this is consistent; it happens on all of our backup servers and on all of our disks.

I don't know what causes any of this or what it means, or if it's important. But what I do know is we had no idea that this was happening until we looked at the metrics. And it's not the only thing about our systems that I've discovered by building dashboards and poking around in them. Over and over I've discovered things from our new metrics system, for example that our FTP area is surprisingly active.

What building a metrics system and using it has taught me is that I didn't really know how our systems behaved. I had some idea of the large scale, obvious things, and for some issues I had looked in specific even before we built our metrics system, but beyond that I was mostly ignorant. This is not really surprising in any complex system, which most environments are. There's just too much going on, too many factors influencing each other, too many things you don't know about until you measure and study the system as it really is, not how you imagine it to be.

(This is pretty much the corollary of how you don't know what matters for performance in a complex system, which means you can't really know what your systems are doing until you measure them.)

Having a metrics system is not the only way to discover surprising, previously unknown things about your systems, but it's one of the most wide-ranging ones. By collecting data broadly, without trying to restrict either what you gather or when you gather it to what you expect to be interesting, you enable retrospective serendipity (in addition to the power of being able to explore casual questions, which can itself lead to discoveries).

Of course, our new metrics don't magically let us fully understand our systems. Beyond the limitation created by what data we collect and don't collect, which constrains what we can even possibly see into, there's two more limitations. The first is that we only collect metrics every so often, so transient events that happen faster than that are either invisible or smoothed out (depending on what sort of metrics we're collecting). The second is that most of our metrics are only averages, and even over pretty short periods of time averages can be misleading (see here and here).

To put this in concrete terms, that write IO bandwidth decline we see in the graphs of our Amanda backup servers is real, but it's probably not the full story; all of the cautions I wrote about in my entry on Prometheus disk IO stats definitely apply. If we wanted to fully and deeply understand how our backup servers were behaving here, we'd need more detailed information, either histograms or even block IO traces.

(And even with a lot more data that let us precisely characterize what was going on, we might not understand why the drives were slowing down, because modern drives are opaque black boxes with a lot of things going on inside.)

PS: Of course, if you suspect that your servers are perhaps doing something undesirable, for instance your IMAP server might be doing excessive NFS operations, then it's a good idea to turn your metrics spotlight on that specific aspect of things. But that's a slightly different case, one where you sort of have a hypothesis and you want to test it. An additional advantage of having a metrics system is that when such cases come up you're much more likely to take some easy steps to check your theories of what's wrong, instead of just assuming you're right. As we've found out, it's important to make sure you actually understand your problem.

MetricsAndUnderstandingMore written at 20:59:02; Add Comment

2018-12-23

How I wound up finding a bug in GNU Tar

Every so often something happens in my work that makes me think, even if I don't know what conclusions to really draw from it. I recently mentioned that we'd found a bug in GNU Tar, and the story of how that happened is one of those times.

We back up our fileservers through Amanda and GNU Tar. For a long time, we've had a problem where every so often, fortunately quite rarely, tar would freak out while backing up the filesystem that held /var/mail, producing huge amounts of output. Most of the time this would go on forever and we'd have to kill the dump eventually; other times it would eventually finish, having produced terabyte(s) of output that fortunately seemed to compress very well. At one point we captured such a giant tar file and I subjected it to some inspection, which revealed that the runaway area was a giant sea of null bytes, which 'tar -t' didn't like, but after a while things returned to normal.

(This led to me wondering if null bytes were naturally occurring in people's inboxes. It turns out that hunting for null bytes in text files is not quite as easy as you'd like, and yes, people's inboxes have some.)

We recently moved the filesystem with /var/mail to our new Linux fileservers, which are on Ubuntu 18.04 and so have a more recent and more mainline version of GNU Tar than our OmniOS machines. We hoped that this would solve our GNU Tar issues, but then we almost immediately had one of these runaway tar incidents occur. This time around, with GNU Tar running on an Ubuntu machine where I felt fully familiar with all of the debugging tools available, I did some inspection of the running tar process. This inspection revealed that tar was issuing an endless stream of read()s that were all returning 0 bytes:

read(6, "", 512)   = 0
read(6, "", 512)   = 0
[...]
read(6, "", 512)   = 0
write(1, "\0\0\0\0\0"..., 10240) = 10240
read(6, "", 512)   = 0
[...]

lsof said that file descriptor 6 was someone's mailbox.

Using 'apt-get source tar', I fetched the source code to Ubuntu's version of GNU Tar and went rummaging around through it for read() system calls that didn't check for end of file. Once I decoded some levels of indirection, there turned out to be one obvious place that seemed to skip it, in the sparse_dump_region function in sparse.cs. A little light went on in my head.

A few months ago, we ran into a NFS problem with Alpine. While working on that bug, I strace'd an Alpine process and noticed, among other things, that it was using ftruncate() to change the size of mailboxes; sometimes it extended them, temporarily creating a sparse section of the file until it filled it in, and perhaps sometimes it shrunk them too. This seemed to match what I'd spotted; sparseness was related, and shrinking a file's size with ftruncate() would create a situation where tar hit end of file before it was expecting to.

(This even provides an explanation for why tar sometimes recovered; if something later delivered more mail to the mailbox, taking it back to or above the size tar expected, tar would stop getting this unexpected end of file.)

I did some poking around in GDB, using Ubuntu's debugging symbols and the tar package source code I'd fetched, and I can reproduce the bug, although it's somewhat different than my initial theory. It turns out that sparse_dump_region is not dumping sparse regions of a file, it's dumping non-sparse ones (of course), and it's used on all files (sparse or not) if you run tar with the --sparse argument. So the actual bug is if you run GNU Tar with --sparse and a file shrinks while tar is reading it, tar fails to properly handle the resulting earlier than expected end of file. If the file grows again, tar recovers.

(Except if a file that is sparse at the end shrinks purely in that sparse section. In that case you're okay.)

What is interesting to me about this is that there's nothing here I could not have done years ago on our OmniOS fileservers, in theory. OmniOS has ways of tracing a program's system call activity, and it has general equivalents of lsof, and I could have probably found and looked at the source code for its version of GNU Tar and run it under some OmniOS debugger (although we don't seem to have any version of GDB installed), and so on. But I didn't. Instead we shrugged a bit and moved on. It took moving this filesystem to an Ubuntu based environment to get me to dig into the issue.

(It wasn't just an issue of tools and environment, either; part of it was that we automatically assumed that the OmniOS version of GNU Tar was some old unsupported version that there was no reason to look at, because surely the issue was fixed in a newer one.)

PS: Our short term fix is likely to be to tell Amanda to run GNU Tar without --sparse when backing up this filesystem. Mailboxes shouldn't be sparse, and if they are we're compressing this filesystem's backups anyway so all those null bytes will compress really well.

PPS: I haven't tried to report this as a bug to the GNU Tar people because I only confirmed it Friday and the university is now on its winter break. Interested parties should feel free to beat me to it.

Update: The bug has been reported to the GNU Tar people and is now fixed in commit c15c42c.

TarFindingTruncateBug written at 02:07:49; Add Comment

2018-12-17

My current trick for keeping reasonably ready virtual machine images

A lot of my use of virtualization is to test and develop our Ubuntu systems and changes to them. If I have a significant configuration change for one of our Exim machines, for example, I will usually spin up a test VM, build a version of that machine, and test the configuration there before I risk deploying it onto the real machine. Similarly, I prototype a lot of the build and operating procedures for machines on VMs, because if something goes wrong I can pave and rebuild easily (or simply revert to a snapshot).

For a long time, I redid the Ubuntu install on these VMs from scratch every time I needed to change what I was doing with one of them, which was reasonably frequently (or at least I mostly did, because sometimes I was lazy and reused an existing Ubuntu install that wasn't too mutated from our baseline, and then sometimes I was wrong about how mutated it was). Years ago I toyed with the idea of preconfigured virtual machine templates, but I never went anywhere with it for various reasons, including how many potential image variants there were.

Recently I not so much figured something out as got lazy. Most of the VMs I build are of the latest version of Ubuntu and there's definitely one portion of our install system that changes extremely rarely, which is the initial install from our CD image. So at some point, I started just making a snapshot of that initial install in every test machine VM. This still means I have to go through our postinstall process, but that's mostly hands-off and especially I don't have to interact with the VM on the console; I can do it all through SSH.

(The CD stage of the install requires configuring networking and partitioning the local disk before Ubuntu can be installed and brought up on our network. After that it's deliberately accessible via SSH.)

This doesn't help if I need something other than Ubuntu 18.04, but in that case I do 'dd if=/dev/zero of=/dev/sda' and start from scratch with a CD install. I don't bother deleting the snapshot; I'll revert back to it when I go back to wanting this VM to be an 18.04 machine. Fortunately the CD install stage doesn't take too long.

(Most of our Exim machines are still running 16.04, so testing an Exim change is one recent reason I needed a 16.04 VM. I just recycled that one, in fact.)

PS: I can't keep a generic image and clone it to make new VMs, because our install CD sets the machine up with a specific IP address (whatever you enter). It's easier to keep frozen snapshots for several VMs, each with its own IP address in it, than to fix that somehow after I clone the image into a VM.

Sidebar: Something I should try out with VMWare Workstation

Normally I think of VMWare Workstation snapshots as happening in a straight line of time, but if I remember right you can actually fork them into a tree. If this works well, the obvious approach to dealing with different Ubuntu versions would be to keep forked snapshots of each. However, even if this is something you can do reasonably in the VMWare UI, I wonder what sort of performance impact it has on disk IO in the VM. I don't want to slow down my VMs too much for something that I only use occasionally.

KeepingReadyVMImages written at 22:11:56; Add Comment

Exploring casual questions with our new metrics system

On Mastodon a while back, I said:

What's surprised me about having our new metrics system is how handy it is to be able to answer casual questions, like 'does this machine's CPU get used much', and how often I poke around with that sort of little question or small issue I'm curious about.

(I was expecting us to only really care about metrics when trying to diagnose problems or crises.)

Putting together some sort of metrics and performance statistics system for our servers has been one of my intentions for many years now (see, for example, this 2014 entry, or this one on our then goals). Over all of that time, I have assumed that what mattered for this hypothetical system and what we'd use it for was being able to answer questions about problems (often serious ones), partly to make sure we actually understood our problem, or do things like check for changes in things we think are harmless. Recently we actually put together such a system, based around Prometheus and Grafana, and my experience with it so far has been rather different than I expected.

Over and over again, I've turned to our metrics system to answer relatively casual or small questions where it's simply useful to have the answers, not important or critical. Sometimes it's because we have questions such as how used a compute machine's CPU or memory is; sometimes it's let me confirm an explanation for a little mystery. Some of the time I don't even have a real question, I'm just curious about what's going on with a machine or a service. For instance, I've looked into what our Amanda servers are doing during backups and turned up interesting patterns in disk IO, as well as confirming and firming up some vague theories we had about how they performed and what their speed limits were.

(And just looking at systems has turned up interesting information, simply because I was curious or trying to put together a useful dashboard.)

The common element in all of this is that having a metrics system now makes asking questions and getting answers a pretty easy process. It took a lot of work to get to this point, but now that I've reached it I can plug PromQL queries into Prometheus or look at the dashboards I've built up and pull out a lot with low effort. Since it only takes a little effort to look, I wind up looking a fair bit, even for casual curiosities that we would never have bothered exploring before.

I didn't see this coming at all, not over all of the time that I've been circling around system performance stats and metrics and so on. Perhaps this is to be expected; our focus from the start has been on looking for problems and dealing with them, and when people talk about metrics systems it's mostly about how their system let them see or figure out something important about their environment.

(This focus is natural, since 'it solved our big problem' is a very good argument for why you want a metric system and why investing the time to set one up was a smart decision.)

PS: This is of course yet another example of how reducing friction increases use and visibility and so on. When it is easy to do something, you often wind up doing it more often, as I've seen over and over again.

MetricsExploringCasualThings written at 00:56:28; Add Comment

2018-12-12

One situation where you absolutely can't use irate() in Prometheus

This is a story about me shooting myself in the foot repeatedly, until I finally noticed.

We have Linux NFS client machines, and we would like to know NFS client performance information about them so we can see things which filesystems they use the most. The Linux kernel provides copious per-mount information on NFS mounts and the Prometheus node exporter can turn all of it into metrics, but for us the raw node exporter stats are just too overwhelming in our environment; a typical machine generates over 68,000 different node_mountstats metrics values. So we wrote our own program to digest the raw mountstats metrics and do things like sum all of the per NFS operation information together for a given mount. We run the program from cron once a minute and publish its information through the node exporter's extremely handy textfile collector.

(The node exporter's current mountstats collector also has the same misunderstanding about the xprt: NFS RPC information that I did, and so reports it on a per-mount basis.)

For a while I've been trying to use the cumulative total RPC time with the total RPC count to generate an average time per RPC graph. Over and over I've plugged in some variation on 'cumulative seconds / cumulative operations' for various sources of both numbers, put an irate() around both, graphed this, and gotten either nothing at all or maybe a few points. Today I really dug into this and the penny finally dropped while I was brute-forcing things. The problem was that I was reflexively using irate() instead of rate().

The one situation where you absolutely can't use irate() is that you can't apply irate() to oversampled metrics, metrics that Prometheus is scraping more often than they're being generated. We're generating NFS metrics once a minute but scraping the node exporter every fifteen seconds, so three out of four of our recorded NFS metrics are the same. irate() very specifically uses the last two data points in your sample interval, and when you're oversampling, those two data points are very often going to be the exact same actual metric and so have no change.

In other words, a great deal of my graph was missing because it was all 'divide by zero' NaNs, since the irate() of the 'cumulative operations' count often came out as zero.

(Oversampled metric is my term. There may be an official term for 'scraped more often than it's generated' that I should be using instead.)

Looking back, I specifically mentioned this issue in my entry on rate() versus irate(), but apparently it didn't sink in thoroughly enough. Especially I didn't think through the consequences of dividing by the irate() of an oversampled metric, which is where things really go off the rails.

(If you merely directly graph the irate() of an oversampled metric, you just get a (very) spiky graph.)

Part of what happened is that when I was writing my exploratory PromQL I was thinking in terms of how often the metric was generated, not how often it was scraped, so I didn't even consciously consider that I had an oversampled metric. For example, I was careful to use 'irate(...[3m])', so I would be sure to have at least two valid metric points for a once a minute metric. This is probably a natural mindset to have, but that just means it's something I'm going to have to try to keep in mind for the future for node exporter textfile metrics and Pushgateway metrics.

Sidebar: Why I reflexively reach for irate()

Normally irate() versus rate() is a choice of convenience and what time range and query step you're working on. Since the default Prometheus graph is one hour, I'm almost always exploring with a small enough query step that irate() is the easier way to go. Even when I'm dealing with an oversampled metric, I can get sort of get away with this if I'm directly graphing the irate() result; it will just be (very) spiky, it won't be empty or clearly bogus.

PrometheusWhenNotIrate written at 01:53:32; Add Comment

(Previous 10 or go back to December 2018 at 2018/12/05)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.