Wandering Thoughts


Getting a CPU utilization breakdown in Prometheus's query language, PromQL

A certain amount of Prometheus's query language is reasonably obvious, but once you start getting into the details and the clever tricks you wind up needing to wrap your mind around how PromQL wants you to think about its world. Today I want to tackle one apparently obvious thing, which is getting a graph (or numbers) of CPU utilization.

Prometheus's host agent (its 'node exporter') gives us per-CPU, per mode usage stats as a running counter of seconds in that mode (which is basically what the Linux kernel gives us). A given data point of this looks like:

node_cpu_seconds_total{cpu="1", instance="comps1:9100", job="node", mode="user"} 3632.28

Suppose that we want to know how our machine's entire CPU state breaks down over a time period. Our starting point is the rate over non-idle CPU modes:

irate(node_cpu_seconds_total {mode!="idle"} [1m])

(I'm adding some spaces here to make things wrap better here on Wandering Thoughts; in practice, it's conventional to leave them all out.)

Unfortunately this gives us the rate of individual CPUs (expressed as time in that mode per second, because rate() gives us a per second rate). No problem, let's sum that over everything but the CPUs:

sum(irate(node_cpu_seconds_total {mode!="idle"} [1m])) without (cpu)

If you do this on a busy system with multiple CPUs, you will soon observe that the numbers add up to more than 1 second. This is because we're summing over multiple CPUs; if each of them is in user mode for all of the time, the summed rate of user mode is however many CPUs we have. In order to turn this into a percentage, we need to divide by how many CPUs the machine has. We could hardcode this, but we may have different numbers of CPUs on different machines. So how do we count how many CPUs we have in a machine?

As a stand-alone expression, counting CPUs is (sort of):

count(node_cpu_seconds_total) without (cpu)

Let's break this down, since I breezed over 'without (cpu)' before. This takes our per-CPU, per-host node_cpu_seconds_total Prometheus metric, and counts up how many things there are in each distinct set of labels when you ignore the cpu label. This doesn't give us a CPU count number; instead it gives us a CPU count per CPU mode:

{instance="comps1:9100", job="node", mode="user"} 32

Fortunately this is what we want in the full expression:

(sum(irate(node_cpu_seconds_total {mode!="idle"} [1m])) without (cpu)) / count(node_cpu_seconds_total) without (cpu)

Our right side is a vector, and when you divide by vectors in PromQL, you divide by matching elements (ie, the same set of labels). On the left we have labels and values like this:

{instance="comps1:9100", job="node", mode="user"} 2.9826666666675776

And on the right we have a matching set of labels, as we saw, that gives us the number '32'. So it all works out.

In general, when you're doing this sort of cross-metric operation you need to make it so that the labels come out the same on each side. If you try too hard to turn your CPU count into a pure number, well, it can work if you get the magic right but you probably want to go at it the PromQL way and match the labels the way we have.

(I'm writing this down today because while it all seems obvious and clear to me now, that's because I've spent much of the last week immersed in Prometheus and Grafana. Once we get our entire system set up, it's quite likely that I'll not deal with Prometheus for months at a time and thus will have forgotten all of this 'obvious' stuff by the next time I have to touch something here.)

PS: The choice of irate() versus rate() is a complicated subject that requires an entry of its own. The short version is that if you are looking at statistics over a short time range with a small query step, you probably want to use irate() with a range selector that is normally a couple of times your basic sampling interval.

PrometheusCPUStats written at 21:47:03; Add Comment

How Prometheus's query steps (aka query resolution) work

Prometheus and the combination of Prometheus and Grafana have many dark corners and barely explained things that you seem to be expected to just understand. One of them is what is variously called query resolution or query steps (in, for example, the Grafana documentation for using Prometheus). Here is what I think I understand about this area, having poked at a number of things and scrutinized the documentation carefully.

In general, when you write a simple Prometheus PromQL query, it is evaluated at some point in time (normally the current instant, unless you use an offset modifier). This includes queries with range vector selectors; the range vector selector chooses how far back to go from the current instant. This is the experience you will get in Prometheus's expression browser console. However, something different happens when you want to graph something, either directly in Prometheus's expression browser or through Grafana, because in order to graph things we need multiple points spread over time, and that means we have to somehow pick which points.

In a Prometheus graphing query, there is a range of time you're covering and then there is the query step. How Prometheus appears to work is that your expression is repeatedly evaluated at instants throughout the time range, starting at the first instant of the time range and then moving forward by the query step until things end. The query step or query resolution (plus the absolute time range) determines how many points you will get back. The HTTP API documentation for range queries makes this more or less explicit in its example; in a query against a 30-second range with a query step of 15 seconds, there are three data points returned, one at the start time, one in the middle, and one at the end time.

A range query's query step is completely independent from any range durations specified in the PromQL expression it evaluates. If you have 'rate(http_requests_total[5m])', you can evaluate this at a query step of 15 seconds and Prometheus doesn't care either way. What happens is that every 15 seconds, you look back 5 minutes and take the rate between then and now. It is rather likely that this rate won't change much on a 15 second basis, so you'll probably get a smooth result. On the other hand, if you use a very large query step with this query, you may see your graphs go very jagged and spiky because you're sampling very infrequently. You may also get surprisingly jagged and staircased results if you have very small query steps.

The Prometheus expression browser's graph view will tell you the natural query step in the top right hand corner (it calls this the query resolution), and it will also let you manually set the query step without changing anything else about the query. This is convenient for getting a hang on what happens to a graph of your data as you change the resolution of a given expression. In Grafana, you have to look at the URL you can see in the editor's query inspector; you're looking for the '&step=<something>' at the end. In Grafana, the minimum step is (or can be) limited in various ways, both for the entire query (in the data source 'Options') and in the individual metrics queries ('Min step', which Grafana grumbles about in the Grafana documentation for using Prometheus).

This unfortunately means that there is no universal range duration that works across all time ranges for Prometheus graphs. Instead the range duration you want is quite dependent on both the query resolution and how frequently your data updates; roughly speaking, I think you want the maximum of the query resolution and something slightly over your metric's minimum update period. Unfortunately I don't believe you can express this in Grafana. This leaves you deciding in advance on the primary usage of your graphs, especially in Grafana; you want to decided if you are mostly going to look at large time ranges with large query steps or small time ranges with fine grained query steps.

(You can get very close to generating the maximum of two times here, but then you run aground on a combination of the limitations of Grafana's dashboard variables and what manipulations you can do in PromQL.)

(This is one of those entries that I write partly for myself in the future, where I am unfortunately probably going to need this.)

PrometheusQuerySteps written at 02:15:48; Add Comment


Some notes on Prometheus's Blackbox exporter

To make a long story short, I'm currently enthusiastically experimenting with Prometheus. As part of this I'm trying out Prometheus's support for 'black box' status and health checks, where you test services and so on from the outside (instead of the 'white box' approach of extracting health metrics from them directly). The Prometheus people don't seem to be too enthusiastic about black box metrics, so it's perhaps not surprising that the official Prometheus blackbox exporter is somewhat underdocumented and hard to understand.

The three important components in setting up blackbox metrics are targets, modules, and probers. A prober is the low level mechanism for making a check, such as making a HTTP request or a TCP connection; the very limited set of probers is built into the code of the blackbox exporter (and Prometheus is probably unenthused about adding more). A module specifies a collection of parameters for a specific prober that are used together to check a target. More than one module may use the same prober, presumably with different parameters. Modules are specified in the blackbox exporter's configuration file. Finally, a target is whatever you are checking with a module and its prober, and it comes from your Prometheus configuration.

The names of probers are set because they are built into the code of the blackbox exporter. The names of modules are arbitrary; you may call them whatever you want and find convenient. Although the official examples give modules names that are related to their prober (such as http_2xx and imap_starttls), this doesn't matter and doesn't influence the prober's behavior, such as what port the TCP prober connects to. This was quite puzzling to me for a long time because it was far from obvious where the TCP prober got the port to connect to from (and it isn't documented).

When Prometheus makes a blackbox check, the blackbox exporter is passed the module and the target in the URL of the request:


The target is the only per-target parameter that is passed in to the blackbox exporter, so everything that the prober allows you to vary or specify on a per-target basis is encoded into it (and all other prober configuration comes from the specific module you use). How things are encoded in the target and what you can put there depends on the specific prober.

For the icmp prober, the target is a host name or an IP address. No further per-target things can be provided, and the module parameters are sort of minimal too.

For the dns prober, the target is a host name or IP address of the DNS server to query, plus the port, formatted as 'host:port' (and so normally 'host:53'). What DNS query to make is set in the module's parameters, as is what reply to expect, whether to use TCP or UDP, and so on. There is no way to specify these as part of the target, so if you want to query different DNS names, you need different modules. This is not particularly scalable if you want to query the same DNS server for several names, but then I suspect that the Prometheus people would tell you to write a script for that sort of thing.

(It turns out that if you leave off ':port', it defaults to 53, but I had to read the code to find this out.)

For the http prober, the target is the full URL to be requested (or, as an undocumented feature, you can leave off the 'http://' at the start of the URL). What to require from the result is configured through the module's parameters, as is various aspects of TLS. As with DNS probes, if you want to check that some URLs return a 2xx status and some URLs redirect, you will need two separate modules. The http prober automatically choses HTTP or HTTPS based on the scheme of the target, as you'd expect, which is why a single http prober based module can be used for URLs from either scheme. Under normal circumstances, using HTTPS URLs automatically verifies the certificate chain through whatever system certificate store Go is using on your machine.

For the tcp prober, the target is the host:port to connect to. As we should expect by now, everything else is configured through the module's parameters, including everything to do with TLS; this means that unlike HTTP, you need different modules for checking non-TLS connections and TLS connections. The tcp prober lets the module control whether or not to do TLS on connection (normally with server certificate verification), and you can set up a little chat dialog to test the service you're connecting to (complete with switching to TLS at some suitable point in the dialog). Contrary to the documentation, the expect: strings in the chat dialog are regular expressions, not plain strings.

Much of Prometheus's official blackbox checking would be more flexible if you could pass optional additional parameters outside of the target; the obvious case is DNS checks.

Sidebar: Understanding blackbox relabeling

The Prometheus configuration examples for the blackbox exporter contain a great deal of magic use of relabel_configs. Perhaps what it is doing and what is required is obvious to experienced Prometheus people, but in any case I am not one right now.

The standard example from the README is:

metrics_path: /probe
  module: [AMODULE]
  - targets:
     - ATARGET
  - source_labels: [__address__]
    target_label: __param_target
  - source_labels: [__param_target]
    target_label: instance
  - target_label: __address__

The __address__ label starts out being set to ATARGET, our blackbox exporter target, because that is what we told Prometheus; if we were using an ordinary exporter, this would be the host and port that Prometheus was going to scrape. Since the blackbox exporter is special, we must instead turn it into the target= parameter of the URL we will scrape, which we do by relabeling it to __param_target. We also save it into the instance label, which will propagate through to the final metrics that come from this (so that we can later find things by our targets). Finally, we set the __address__ label to the actual host:port to scrape from, because if we didn't Prometheus wouldn't even talk to the blackbox exporter.

We also need a module= URL parameter. Here, that comes from the module: in the params section; during relabeling it is __param_module, and it will be set to AMODULE.

PrometheusBlackboxNotes written at 01:12:39; Add Comment


Thinking about what we want to be alerted about

Thinking about the broad subject of what we probably want for metrics, alerting, and so on leads pretty much straight to the obvious question of what do we want to be alerted about in the first place. It may seem peculiar to have to ask this, but we've sort of drifted into our current set of alerts and non-alerts over time (probably like many places). So our current alerts are some combination of things that seemed obvious at the time, things we added in reaction to stuff that happened, and things that haven't been annoying enough yet to turn off. This is probably not what we actually want to alert on or what we would alert on if we were starting over from scratch (which we probably are going to).

For many people the modern day answer here is pretty straightforward (eg); you alert if your user-facing service is down or significantly impaired so that it's visible to people. You may also alert on early warning signs that this is going to happen if you don't do something fairly soon. We're in an unusual environment in that we don't really run services like this, and in many cases there is nothing we can do about impaired user visible stuff. An obvious case is that if people fill up their filesystem, well, that's how much storage they bought. A less obvious case is that if our IMAP server is wallowing, it's quite possible that there's nothing wrong as such that we can fix, it's just slow because lots of people are using it.

My current thoughts are that we want to be alerted on the following things:

  • Actual detected outages, for both hosts and some readily checkable services. Almost all of our hosts are pets, so we definitely care if one goes down and we're going to actively try to bring it back up immediately.

    I expect this to be our primary source of alerts.

  • Indicators that are good signs of actual outages that we can't detect directly, such as a significant rise in machine room temperature as a good sign of AC failure or serious problems.

  • Things that very strongly presage actual outages. One example for us is /var/mail getting too full, because if it fills up entirely a lot of email stuff will stop working very well.

(I'm not sure we have very many of the latter two types.)

If we use a weak sense of 'alerting' that is more 'informing' than 'poking us to do something', there may also be a use in alerting us about things that are sufficiently crucial but that we probably can't do anything about. If the department's administrative staff fill up all of their disk space, our hands are tied but at least we can know why they're suddenly having problems. This only works if the resulting alerts are infrequent.

(One possible answer here is that we should deal with 'be informed' cases by having some dashboards instead. Then if someone reports problems, we can turn to our dashboards and say 'ah, it looks like <X> happened'.)

Detecting that hosts are down is fairly straightforward; our traditional approach is to check to see if a host pings and if it answers on port 22 with a SSH banner. Detecting when services are down is potentially quite complicated, so I suspect that we want to limit ourselves to simple checks of straightforward things that are definitely indicators of problems rather than spending a lot of effort building end to end tests or figuring out excessively clever ways of, say, checking that our DHCP server is actually giving out DHCP leases. Checking whether all of our mail handling machines respond to SMTP connections is crude, but it has the twin virtues that it's easy and if it fails, we definitely know that we have a problem.

I'm not sure if this is less or more alerts than what we've currently wound up with, and in a sense it doesn't matter. What I'm most interesting in is having a framework where we can readily answer the question 'should we alert on <X>?', or at least have a general guide for it.

(One implication of our primary source of alerts being detected outages is that status checks are probably the most important thing to have good support for in a future alert system. Another one is that we need to figure out if and how we can detect certain sorts of outages, like a NFS server stopping responding instead of just getting really slow.)

WhatToAlertUsOn written at 22:53:14; Add Comment


Thinking about what we probably want for monitoring, metrics, etc

I tweeted:

The more I poke at this, the more it feels like we should almost completely detach collecting metrics from our alerting. Most of what we want to alert on aren't metrics, and most plausible ongoing metrics aren't alertable. (Machine room temperature is a rare exception.)

Partly this is because there is almost no use in alerting on high system-level metrics that we can't do anything about. Our alertable conditions are mostly things like 'host down'.

(Yes, we are an all-pets place.)

Right now, what we have for all of this is basically a big ball of mud, hosted on an Ubuntu 14.04 machine (so we have to do something about it pretty soon). Today I wound up looking at Prometheus because it was mentioned to me that they'd written code to parse Linux's /proc/self/mountstats, and I was impressed by their 'getting started' demo, and it started thoughts circulating in my head.

Prometheus is clearly a great low-effort way to pull a bunch of system level metrics out of our machines (via their node exporter). But a significant amount of what we use for alerts with our current software for is status checks such as 'is the host responding to SSH connections', and it isn't clear that status checks fit very well into a Prometheus world. I'm sure we could make things work, but perhaps a better choice is to not try to fit a square peg into a round hole.

In contemplating this, I think we have four things all smashed together currently: metrics (how fast do IMAP commands work, what network bandwidth is one of our fileservers using), monitoring (amount of disk space used on filesystems, machine room temperature), status checks (does a host respond to SSH, is our web server answering queries), and alerting, which is mostly driven by status checks but sometimes comes from things we monitor (eg, machine room temperature). Metrics are there for their history alone; we'll never alert on them, often because there's nothing we can do about them in the first place. For monitoring we want both history and alerting, at least some of the time (although who gets the alerts varies). Our status checks are almost always there to drive alerts, and at the moment we mostly don't care about their history in that we never look at it.

(It's possible that we could capture and use some additional status information to help during investigations, to see the last captured state of things before a crash, but in practice we almost never do this with our existing status information.)

In the past when I focused my attention on this area I was purely thinking about adding metrics collection along side our existing system of alerting, status checking, monitoring, and some metrics. I don't think I had considered actively yanking alerting and status checks out from the others (for various reasons), and now it at least feels more likely that we'll do something this time around.

(Four years ago I planned to use graphite and collectd for metrics, but that never went anywhere. I don't know what we'd use today and I'm wary of becoming too entranced with Prometheus after one good early experience, although I do sort of like how straightforward it is to grab stats from hosts. Nor do I know if we want to try to connect our metrics & monitoring solution with our status checks & alerting solution. It might be better to use two completely separate systems that each focus on one aspect, even if we wind up driving a few alerts from the metrics system.)

MetricsAndAlertsForUs written at 21:09:14; Add Comment


Why I mostly don't use ed(1) for non-interactive edits in scripts

One of the things that is frequently said about ed(1) is that it remains useful for non-interactive modifications to files, for example as part of shell scripts. I even mentioned this as a good use of ed today in my entry on why ed is not a good (interactive) editor today, and I stand by that. But, well, there is a problem with using ed this way, and that problem is why I only very rarely actually use ed for scripted modifications to files.

The fundamental problem is that non-interactive editing with ed has no error handling. This is perfectly reasonable, because ed was originally written for interactive editing and in interactive editing the human behind the keyboard does the error handling, but when you apply this model to non-interactive editing it means that your stream of ed commands is essentially flying blind. If the input file is in the state that you expected it to be, all will go well. If there is something different about the input file, so that your line numbers are off, or a '/search/' address doesn't match what you expect (or perhaps at all), or any number of other things go wrong, then you can get a mess, sometimes a rapidly escalating one, and then you will get to the end of your ed commands and 'w' the resulting mess into your target file.

As a result of this, among other issues, ed tends to be my last resort for non-interactive edits in scripts. I would much rather use sed or something else that is genuinely focused on stream editing if I can, or put together some code in a language where I can include explicit error checking so I'll handle the situation where my input file is not actually the way I thought it was going to be.

(If I did this very often, I would probably dust off my Perl.)

If I was creating an ideal version of ed for non-interactive editing, I would definitely have it include some form of conditionals and 'abort with a non-zero exit status if ...' command. Perhaps you'd want to model a lot of this on what sed does here with command blocks, b, t (and T in GNU sed), and so on, but I can't help but think that there has to be a more readable and clear version with things like relatively explicit if conditions.

(I have a long standing sed script that uses some clever tricks with b and the pattern space and so on. I wrote it in sed to deliberately explore these features and it works, but it's basically a stunt and I would probably be better off if I rewrote the script in a language where the actual logic was not hiding in the middle of a Turing tarpit.)

PS: One place this comes up, or rather came up years ago and got dealt with then, is in what diff format people use for patch. In theory you can use ed scripts; in practice, everyone considers those to be too prone to problems and uses other formats. These days, about the only thing I think ed format diffs are used for is if you want to see a very compact version of the changes. Even then I'm not convinced by their merits against 'diff -u0', although we still use ed format diffs in our worklogs out of long standing habit.

Sidebar: Where you definitely need ed instead of sed

The obvious case is if you want to move text around (or copy it), especially if you need to move text backwards (to earlier in the file). As a stream editor, sed can change lines and it can move text to later in the file if you work very hard at it, but it can never move text backward. I think it's also easier to delete a variable range of lines in ed, for example 'everything from a start line up to but not including an end marker'.

Ed will also do in-place editing without the need to write to a temporary file and then shuffle the temporary file into place. I'm neutral on whether this is a feature or not, and you can certainly get ed to write your results to a new file if you want to.

EdScriptErrorProblem written at 00:18:31; Add Comment


A surprise discovery about procmail (and wondering about what next)

I've been using procmail for a very long time now, and over that time I generally haven't paid much attention to the program itself. It was there in the operating systems I used, it worked, and so everything was fine; it was just sort of there, like cat. Thus, I was rather surprised to stumble over the 2010 LWN article Reports of procmail's death are not terribly exaggerated (via, sort of via, via, via Planet Debian), which covers how procmail development and maintenance had stopped. Things don't exactly seem to have gotten more lively since 2010 (for example, the procmail domain seems to have mostly vanished, and then there's the message from Philip Guenther that's linked to from the wikipedia page). This raises a number of questions.

The obvious question is whether this even matters (as LWN notes in the original article). Procmail still works fine, and just as importantly, it's still being packaged by Debian, Ubuntu, and so on. There are outstanding Debian bugs, but Debian appears to also be fixing issues in their patches (and there's a 2017 patch in there, so it's not all old stuff). While we have quite a few users that depend a lot on procmail and we'd thus have real problems if, say, Ubuntu stopped packaging it, this doesn't appear likely to happen any time soon.

(Actually, if Ubuntu dropped procmail our answer would likely be to start building the package ourselves. It's not like it changes much.)

But, well, procmail is sort of Internet software, and I've said before that Internet software decays if not actively maintained. Knowing that procmail is only sort of being looked after does make me a little bit uncomfortable. However, this raises the question of what alternatives I (and we) would have for equivalent mail filtering systems. Many people seem to use Sieve, but I believe that has to be integrated into your MTA instead of run through a program in the way that procmail operates, and I don't think it can run external programs (which is important for some people). The closest thing to procmail that I've read about is maildrop, but it's slightly more limited than procmail in several spots and I'm not sure it could fully cover the various ways people here use procmail for spam filtering and running spam filters.

Exim itself has its own filtering system (documented here). These are more powerful than Exim-based Sieve filters (they can deliver to external programs, for example) but of course they require Exim specifically and couldn't be moved to another mailer. They're still not quite as capable as procmail; specifically Exim filters can't directly write to MH format directories (which matters to me because of how I now do a bunch of mail filtering).

We've historically declined to enable either Sieve based filtering or Exim's own filtering in our mail system on the grounds that we wanted to preserve our freedom to change mailers. In light of what I've now learned about procmail, I'm wondering if that's still the right choice. We also don't currently have maildrop installed on our central mail machine (where people already run procmail); perhaps we should change that as well, to give people the option (even if they most likely won't take it).

PS: A quick check suggests that we have around 195 people or so who are using procmail (in that they have it set up in their .forward), which is actually more than I expected. Not all of them are necessarily using our mail system much any more, though.

ProcmailWhatNext written at 01:48:09; Add Comment


Our future IPv6 access control problems due to non-DHCP6 machines

Back almost two years ago, I wrote about how I suspected a lot of IPv6 hosts wouldn't have reverse DNS because they would be using stateless address autoconfiguration (SLAAC) where they essentially assign themselves one or more random IPv6 addresses when they show up on your network. For us, this presents a problem much larger than just DNS, because control over what hosts DHCP will give addresses to (and what addresses it will assign) are how we force machines to be registered on our laptop network and our wireless network before we give them network access.

The specific driver of IPv6 SLAAC is Android devices, which don't do DHCP6 at all; unfortunately this also includes ChromeOS, which means Chromebooks. But once you enable SLAAC on your network, any number of things may decide to grab themselves SLAAC addresses and then use them, even if they also do DHCP6 and so get whatever address you give them there (this is the iOS behavior I observed a couple of years ago; I don't know how Windows, macOS, and so on behave here). If the IPv6 address and routing they get via DHCP6 doesn't seem to work, I suspect that quite a lot of devices will be perfectly happy to route via their SLAAC address and route, and if that doesn't work, well, the Android and ChromeOS devices aren't getting on the Internet.

There are a number of approaches I can think of. One possible brute force answer is to simply not do SLAAC, only DHCP6 and (IPv4) DHCP. This would mean that SLAAC-only devices would only get IPv4 addresses, but that's not likely to be a practical problem for a long time to come. I think this is our most likely short term answer, because it's the easiest approach and we can always get more complicated later. The other brute force approach is some sort of MAC filtering on our firewalls, but we use OpenBSD and my understanding is that there are a number of issues around MAC filtering in OpenBSD PF.

The officially approved answer is probably to move to IEEE 802.1X on our networks that require this sort of access control. This is infeasible for multiple reasons, including that I believe it would require a wholesale replacement of our network switches on the affected networks. For extra bonus points we don't even run much of the infrastructure that provides our wireless network, which is one of the networks we need this access control on (this is not as crazy as it sounds, but that's another entry).

All of this is yet another reason why any migration to IPv6 will be neither fast nor easy for us, and thus why we still haven't done more than vaguely look in the direction of IPv6. Someday, maybe, when IPv6 appears to actually be important for something.

(And when we do start doing IPv6, it's highly likely to start out being only for a few servers with static IP addresses. Extending it to people's own 'client' devices is likely to be one of the last things we get around to.)

(I was reminded of all of this today by cweiske's question on my old entry.)

IPv6AccessControlProblem written at 00:00:52; Add Comment


Configurations can quietly drift away from working over time, illustrated

At this point, we've been running various versions of Ubuntu LTS for over ten years. While we reinstall individual systems when we move from LTS version to LTS version, we almost never rebuild our local customizations from scratch unless we're forced to; instead we carry forward the customizations from the last LTS version, only changing what seems to need it. This is true both for the configuration of our systems and also for the configuration of things we build on top of Ubuntu, such as our user-run web servers. However, one of the hazards of carrying forward configurations for long enough is that they can silently drift away from actually working or making sense. For example, you can set (or try to set) Linux sysctls that don't exist any more and often nothing will complain loudly enough for you to notice. Today, I had an interesting illustration of how far this can go without anything obvious breaking or anyone saying anything.

For our user-run web servers, we supply a set of configurations for Apache, PHP, and MySQL that works out of the box, so users with ordinary needs don't have to muck around with that stuff themselves. Although some people customize their setups (or run web servers other than Apache), most people just use the defaults. In order to make Ubuntu version to Ubuntu version upgrades relatively transparent, most of this configuration is central and maintained by us, instead of being copied to each user's Apache configuration area and so on. This has basically worked over all of the years and all of the Ubuntu LTS versions; generally the only version to version change people have had to do in their user-run web server is to run a magic MySQL database update process. Everything else is handled by us changing the our central configurations.

(I'm quite thankful that both Apache and MySQL have 'include' directives in their configuration file formats. You may also detect that we know very little about operating MySQL.)

One of the things that we customize for user-run web server is the MySQL settings in PHP, because the stock settings are set up to try to talk to the system MySQL and we don't run a system MySQL (especially not one that people can interact with). We do this with a custom php.ini, and that php.ini is configured in the Apache configuration in a little .conf snippet. Here is the current one, faithfully carried forward from no more recently than 2009 and currently running on our Ubuntu 16.04 web server since the fall of 2016 or so:

<IfModule mod_php5.c>
  PHPIniDir conf/php.ini

Perhaps you can see the problem.

Ubuntu 16.04 doesn't ship with PHP 5 any more; it only ships with PHP 7. That makes the IfModule directive here false, which means that PHP is using its standard system Apache php.ini. For that matter, I'm not certain this directive was actually working for Ubuntu 14.04's PHP 5 either.

This means that for at least the past two years or so, people have been operating their user-run web servers without our PHP customizations that are supposed to let their PHP code automatically talk to their MySQL instances. I'm not sure that no one noticed anything but at the very least no one said anything to us about the situation, and I know that plenty of people have user-run web servers with database-driven stuff installed, such as WordPress. Apparently everyone who needed to was able to set various parameters so that they could talk to their MySQL anyway.

(This is probably not surprising, since 'configure your database settings' is likely a standard part of the install process for a lot of software. It does seem to be part of WordPress's setup, for example.)

On the one hand, that this slipped past us is a bit awkward (although understandable; it's not as if this makes PHP not load at all). On the other hand, it doesn't seem to have done any real harm and it means that we can apparently discard our entire php.ini customization scheme and make our lives simpler, since clearly it's not actually necessary in practice.

(I stumbled over this in the process of preparing our user-run webserver system for an upgrade to 18.04. How I noticed it actually involve another bit of quiet configuration drift, although that's story for another entry.)

QuietConfigurationDrift written at 22:59:54; Add Comment


Our problem with (Amanda) backups of many files, especially incrementals

Our fileserver-based filesystems have a varying number of inodes in use on them, ranging from not very many (often on filesystems with not a lot of space used) to over 5.7 million. Generally our Amanda backups have no problems handling the filesystems with not too many inodes used, even when they're quite full, but the filesystems with a lot of inodes used seem to periodically give our backups a certain amount of heartburn. This seems to be especially likely if we're doing incremental backups instead of full ones.

(We have some filesystems with 450 GB of space used in only a few hundred inodes. The filesystems with millions of inodes used tend to have a space used to inodes used ratio from around 60 KB per inode up to 200 KB or so, so they're also generally quite full, but clearly being full by itself doesn't hurt us.)

Our Amanda backups use GNU Tar to actually read the filesystem and generate the backup stream. GNU Tar works through the filesystem and thus the general Unix POSIX filesystem interface, like most backup systems, and thus necessarily has some general challenges when dealing with a lot of files, especially during incremental backups.

When you work through the filesystem, you can only back up files by opening them and you can only check if a file needs to be included in an incremental backup by stat()ing it to get its modification time and change time. Both of these activities require the Unix kernel and filesystem to have access to the file's inode; if you have a filesystem with a lot of inodes, this will generally mean reading it off the disk. On HDs, this is reasonably likely to be a seek-limited activity, although fortunately it clearly requires less than one seek per inode.

Reading files is broadly synchronous but in practice the kernel will start doing readahead for you almost immediately. Doing stat()s is equally synchronous, and then things get a bit complicated. Stat() probably doesn't have any real readahead most of the time (for ZFS there's some hand waving here because in ZFS inodes are more or less stored in files), but you also get 'over-reading' where more data than you immediately need is read into the kernel's cache, so some number of inodes around the one you wanted will be available in RAM without needing further disk fetches. Still, during incremental backups of a filesystem with a lot of files where only a few of them have changed, you're likely to spend a lot of time stat()ing files that are unchanged, one after another, with only a few switches to read()ing files. On full backups, GNU Tar is probably switching back and forth between stat() and read() as it backs up each file in turn.

(On a pragmatic level it's clear that we have more problems with incrementals than with full backups.)

I suspect that you could speed up this process somewhat by doing several stat()s in parallel (using multiple threads), but I doubt that GNU Tar is ever going to do that. Traditionally you could also often get a speedup by sorting things into order by inode number, but this may or may not work on ZFS (and GNU Tar may already be doing it). You might also get a benefit by reading in several tiny files at once in parallel, but for big files you probably might as well read them one at a time and enjoy the readahead.

I'm hoping that all of this will be much less of a concern and a problem when we move from our current fileservers to our new ones, which have local SSDs and so are going to be much less affected by a seek-heavy worklog (among other performance shifts). However this is an assumption; we might find that there are bottlenecks in surprising places in the whole chain of software and hardware involved here.

(I have been tempted to take a ZFS copy of one of our problem filesystems, put it on a test new fileserver, and see how backing it up goes. But for various reasons I haven't gone through with that yet.)

PS: Now you know why I've recently been so interested in knowing where in a directory hierarchy there were a ton of files (cf).

ManyFilesBackupProblem written at 23:07:03; Add Comment

(Previous 10 or go back to August 2018 at 2018/08/20)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.