Wandering Thoughts archives

2018-11-30

I've learned that sometimes the right way to show information is a simple one

When I started building some Grafana dashboards, I of course reached for everyone's favorite tool, the graph. And why not? Graphs are informative and beyond that, they're fun. It simply is cool to fiddle around for a bit and have a graph of your server's network bandwidth usage or disk bandwidth right there in front of you, to look at the peaks and valleys, to be able to watch a spike of activity, and so on.

For a while I made pretty much everything a graph; for things like bandwidth, this was obviously a graph of the rate. Then one day I was looking at a graph of filesystem read and write activity on one of our dashboards, with one filesystem bouncing up here and another one bouncing up there over Grafana's default six hour time window, and I found myself wondering which of these filesystems was the most active one over the entire time period. In theory the information was in the graph; in practice, it was inaccessible.

As an experiment, I added a simple bar graph of 'total volume over the time range'. It was a startling revelation. Not only did it answer my question, but suddenly things that had been buried in the graphs jumped out at me. Our web server turned out to use our old FTP area far much more than I would have guessed, for example. The simple bar graph also made it much easier to confirm things that I thought I was seeing in the more complex and detailed graphs. When one filesystem looked like it was surprisingly active in the over-time graph, I could look down to the bar graph and confirm that yes, it was (and also see how much its periodic peaks of activity added up to).

Since that first experience I have become much more appreciative of the power of simple ways to show summary information. Highly detailed graphs have an important place and they're definitely showing us things we didn't know, but simple summaries also reveal things too.

(I'd love the ability to get ad-hoc simple summaries from more complex graphs. I don't need 'average bandwidth over the graph's entire time range' very often, but sometimes I'd rather like to have it rather than having to guess by eyeball. It's sort of a pity that you can't give Grafana graphs alternate visualizations that you can cycle through, or otherwise have two (or more) panels share the same space so you can flip between them. As it stands, we have some giant dashboards.)

SimpleGraphsAdvantage written at 01:15:42; Add Comment

2018-11-25

How we monitor our Prometheus setup itself

On Mastodon, I said:

When you have a new alerting and monitoring system, 'who watches the watchmen' becomes an interesting and relevant question. Especially when the watchmen have a lot of separate components and moving parts.

If we had a lot of experience with Prometheus, we probably wouldn't worry about this; we'd be able to assume that everything was just going to work reliably. But we're very new with Prometheus, and so we get to worry about its reliability in general and also the possibility that we'll quietly break something in our configuration or how we're operating things (and we have, actually). So we need to monitor Prometheus itself. If Prometheus was a monolithic system, this would probably be relatively easy, but instead instead our overall Prometheus environment has a bunch of separate pieces, all of which can have things go wrong.

A lot of how we're monitoring for problems is probably basically standard in Prometheus deployments (or at least standard in simple ones, like ours). The first level of monitoring and alerts is things inside Prometheus:

  • We alert on unresponsive host agents (ie, Prometheus node_exporter) as part of our general checking for and alerting on down hosts; this will catch when a configured machine doesn't have the agent installed or it hasn't been started. The one thing it won't catch is a production machine that hasn't been added to our Prometheus configuration. Unfortunately there's no good automated way in our environment to tell what is and isn't a production machine, so we're just going to have to rely on remembering to add machines to Prometheus when we put them into production.

    (This alert uses the Prometheus 'up' metric for our specific host agent job setting.)

  • We also alert if Prometheus can't talk to a number of other metrics sources it's specifically configured to pull from, such as Grafana, Pushgateway, the Blackbox agent itself, Alertmanager, and a couple of instances of an Apache metrics exporter. This is also based on the up metric, excluding the ones for host agents and for all of our Blackbox checks (which generate up metrics themselves, which can be distinguished from regular up metrics because the Blackbox check ones have a non-empty probe label).

  • We publish some system-wide information for temperature sensor readings and global disk space usage for our NFS fileservers, so we have checks to make sure that this information is both present at all and not too old. The temperature sensor information is published through Pushgateway, so we leverage its push_time_seconds metric for the check. The disk space usage information is published in a different way, so we rely on its own 'I was created at' metric.

  • We publish various per-host information through the host agent's textfile collector, where you put files of metrics you want to publish in a specific directory, so we check to make sure that these files aren't too stale through the node_textfile_mtime_seconds metric. Because we update these files at varying intervals but don't want to have complex alerts here, we use a single measure for 'too old' and it's a quite conservative number.

    (This won't detect hosts that have never successfully published some particular piece of information at all, but I'm currently assuming this is not going to happen. Checking for it would probably be complicated, partly because we'd have to bake in knowledge about what things hosts should be publishing.)

All of these alerts require their own custom and somewhat ad-hoc rules. In general writing all of these checks feels like a bit of a slog; you have to think about what could go wrong, and then how you could check for it, and then write out the actual alert rule necessary. I was sort of tempted to skip writing the last two sets of alerts, but we've actually quietly broken both the global disk space usage and the per-host information publication at various times.

(In fact I found out that some hosts weren't updating some information by testing my alert rule expression in Prometheus. I did a topk() query on it and then went 'hey, some of these numbers are really much larger than they should be'.)

This leaves checking Prometheus itself, and also a useful check on Alertmanager (because if Alertmanager is down, Prometheus can't send out the alert it detects). In some places the answer to this would be a second Prometheus instance that cross-checks the first and a pair of Alertmanagers that both of them talk to and that coordinate with each other through their gossip protocol. However, this is a bit complicated for us, so my current answer is to have a cron job that tries to ask Prometheus for the status of Alertmanager. If Prometheus answers and says Alertmanager is up, we conclude that we're fine; otherwise, we have a problem somewhere. The cron job currently runs on our central mail server so that it depends on the fewest other parts of our infrastructure still working.

(Mechanically this uses curl to make the query through Prometheus's HTTP API and then jq to extract things from the answer.)

We don't currently have any checks to make sure that Alertmanager can actually send alerts successfully. I'm not sure how we'd craft those, because I'm not sure Alertmanager exposes the necessary metrics. Probably we should try to write some alerts in Prometheus and then have a cron job that queries Prometheus to see if the alerts are currently active.

(Alertmanager exposes a count of successful and failed deliveries for the various delivery methods, such as 'email', but you can't find out when the last successful or failed notification was for one, or whether specific receivers succeeded or failed in some or all of their notifications. There's also no metrics exposed for potential problems like 'template expansion failure', which can happen if you have an error somewhere in one of your templates. If the error is in a rarely used conditional portion of a template, you might not trip over it for a while.)

PrometheusSelfMonitoring written at 01:46:43; Add Comment

2018-11-20

When Prometheus Alertmanager will tell you about resolved alerts

We've configured Prometheus to notify us when our alerts clear up. Recently, we got such a notification email and its timing attracted my attention because while the alert in it had a listed end time of 9:43:55, the actual email was only sent at 9:45:40, which is not entirely timely. This got me curious about what affects how soon you get notified about alerts clearing in Prometheus, which led to two significant surprises. The overall summary is that the current version of Alertmanager can't give you prompt notifications of resolved alerts, and on top of that right now Alertmanger's 'ended-at' times are generally going to be late by as many as several minutes.

Update: It turns out that the second issue is fixed in the recently released Alertmanager 0.15.3. You probably want to upgrade to it.

Alertmanager's relatively minimal documentation describes the group_interval setting as follows:

How long to wait before sending a notification about new alerts that are added to a group of alerts for which an initial notification has already been sent. (Usually ~5m or more.)

If you read this documentation, you would expect that this is a 'no more often than' interval. By this I mean that you have to wait at least group_interval before Alertmanager will send another notification, but that once this time has elapsed, Alertmanager will immediately send a new alert out. This turns out to not be the case. Instead, this is a tick interval; Alertmanager only sends a new notification every group_interval (if there is anything new). For example, if you have a group_interval of 10m and a new alert in the group shows up 11 minutes after the first notification, Alertmanager will not send out a notification until 20 minutes, which is the next 'tick' of 'every 10 minutes'. Resolved alerts are treated just the same as new alerts here.

This makes a certain amount of sense in Alertmanager's apparent model of the alerting world, but it's quite possibly not what you either expect or want. Certainly it's not what we want, and it's probably going to cause us to change our group_interval settings to basically be the same (or shorter) than our group_wait settings.

For the rest of this, I'm going to go through the flow of what happens when an alert ends, as far as I can tell.

When a Prometheus alert triggers and enters the firing state, Prometheus sends the alert to Alertmanager. As covered in Alertmanager's API documentation, Prometheus will then re-send the alert every so often. At the moment, Prometheus throttles itself to re-sending a particular alert only once a minute, although you can change this with a command line option. Re-sending an alert doesn't change the labels (one hopes), but it can change the annotations; Prometheus will re-create them every time (as part of re-evaluating the alert rule) and Alertmanager always uses the most recently received annotations if it needs to generate a new notification.

(Of course this doesn't matter if your annotations are constant. Our annotations can include the current state of affairs, which means that a succession of alert notifications can have different annotation text as, say, the reported machine room temperature fluctuates.)

When a firing alert's rule condition stops being true, Prometheus doesn't just drop and delete the alert (although this is what it looks like in the web UI). Instead, it sets the rule's 'ResolvedAt' time, switches it to a state that Prometheus internally labels as 'inactive', and keeps it around for another 15 minutes. During these 15 minutes, Prometheus will continue to send it to Alertmanager, which is how Alertmanager theoretically learns that it's been resolved. The first (re-)send after an alert has been resolved is not subject to the regular 60 second minimum re-send interval; it happens immediately, so in theory Alertmanager should immediately know that the alert is resolved. As a side note, the annotations on a resolved alert will be the annotations from the last pre-resolution version of it.

(It turns out that Prometheus always sends alerts to Alertmanager with an 'EndsAt' time filled in. If the rule has been resolved, this is the 'ResolvedAt' time; if it hasn't been resolved, the 'EndsAt' is an internally calculated timeout that's some distance into the future. This may be relevant for alert notification templates. This also appears to mean that Alertmanager's resolve_timeout setting is unused, because the code makes it seem like it's only used for alerts without their own EndsAt time.)

Then we run into an Alertmanager issue that is probably fixed by this recent commit, where the actual resolved alert that Prometheus sent to Alertmanager effectively gets ignored and Alertmanager fails to notice that the alert is actually resolved. Instead, the alert has to reach its earlier expiry time before it becomes resolved, which is generally going to be about three minutes from the last time Prometheus sent the still-firing alert to Alertmanager. In turn, that time may be anywhere up to nearly 60 seconds before Prometheus decided that the alert was resolved.

(Prometheus will often evaluate the alert rule more often than once every 60 seconds, but if the alert rule is still true, it will only send that result to Alertmanager once every 60 seconds.)

This Alertmanager delay in recognizing that the alert is resolved combines in an unfortunate way with the meaning of group_interval, because it can make you miss the group_interval 'tick' and then have your 'alert is resolved' notification delayed until the next tick, however many minutes away it is. To minimize this time, you need to reduce group_interval down to whatever you can tolerate and then set --rules.alert.resend-delay down relatively low, say 20 or 30 seconds. With a 20 second resend delay, the expiry timeout is only a minute, which means a bit less than a minute's delay at most before Alertmanager notices that your alert has resolved.

(You also need your evaluation_interval to be no more than your resend delay.)

(When the next Alertmanager release comes out with this bug fixed, you can stop setting your resend delay but you'll still want to have a low group_interval. It seems unlikely that its behavior will change, given the discussion in issue 1510.)

PS: If you want a full dump of what Alertmanager thinks of your current alerts, the magic trick is:

curl -s http://localhost:9093/api/v1/alerts | jq .

This gives you a bunch more information than Alertmanager's web UI will show. It does exclude alerts that Alertmanager thinks are resolved but hasn't deleted yet, though.

PrometheusAlertsClearingTime written at 01:31:20; Add Comment

2018-11-13

Our pragmatic attachment to OpenBSD PF for our firewall needs

Today on Twitter, I asked:

#Sysadmin people: does anyone have good approaches for a high-performance 10G OpenBSD firewall (bridging or routing)? Is the best you can do still 'throw the fastest single-core CPU you can find at it'?

A number of people made the reasonable suggestion of looking into FreeBSD or Linux instead of OpenBSD for our 10G Ethernet firewall needs. We have done some investigation of this (and certainly our Linux machines have no problem with 10G wire speeds, even with light firewall rules in place) but it's not a very attractive solution. The problem is that we're very attached to OpenBSD PF for pragmatic reasons.

At this point, we've been using OpenBSD based firewalls with PF for fifteen years or more. In the process we've built up a bunch of familiarity with the quirks of OpenBSD and of PF, but more importantly we've ended up with thousands of lines of PF rulesets, some in relatively complicated firewall configurations, and all of which are only documented implicitly in the PF rules themselves because, well, that's what we wrote our firewall rules in.

Moving to anything other than OpenBSD PF means both learning a new rule language and translating our current firewall rulesets to that language. We'd need to do this for at least the firewalls that need to migrate to 10G (one of which is our most complicated firewall), and we'd probably want to eventually do it for all firewalls, just so that we didn't have to maintain expertise in two different firewall languages and environments. We can do this if we have to, but we would very much rather not; OpenBSD works well for us in our environment and we have a solid, reliable setup (including pfsync).

(We don't use CARP, but we count on pfsync to maintain hot spare firewalls in a 'ready to be made live' state. Having pfsync has made shifting between live and hot spare firewalls into something that users barely notice, where in the old pre-pfsync days a firewall shift required scheduled downtimes because it broke everyone's connections. One reason we shift between live and hot spare firewalls is if we think the live firewall needs a reboot or some hardware work.)

We also genuinely like PF; it seems to operate at about the right level of abstraction for what we want to do, and we rarely find ourselves annoyed at it. We would probably not be enthused about trying to move to something that was either significantly higher level or significantly lower level. And, barring our issues with getting decent 10G performance, OpenBSD PF has performed well and been extremely solid for us; our firewalls are routinely up for more than a year and generally we don't have to think about them. Anything that proposes to supplant OpenBSD in a firewall role here has some quite demanding standards to live up to.

PS: For our purposes, FreeBSD PF is a different thing than OpenBSD PF because it hasn't picked up the OpenBSD features and syntax changes since 4.5, and we use any number of those in our PF rules (you have to, since OpenBSD loves changing PF syntax). Regardless of how well FreeBSD PF works and how broadly familiar it would be, we'd have to translate our existing rulesets from OpenBSD PF to FreeBSD PF. This might be easier than translating them to anything else, but it would still be a non-trivial translation step (with a non-trivial requirement for testing the result).

OpenBSDPFAttachment written at 23:48:51; Add Comment

2018-11-11

Easy configuration for lots of Prometheus Blackbox checks

Suppose, not entirely hypothetically, that you want to do a lot of Prometheus Blackbox checks, and worse, these are all sorts of different checks (not just the same check against a lot of different hosts). Since the only way to specify a lot of Blackbox check parameters is with different Blackbox modules, this means that you need a bunch of different Blackbox modules. The examples of configuring Prometheus Blackbox probes that you'll find online all set the Blackbox module as part of the scrape configuration; for example, straight from the Blackbox README, we have this in their example:

- job_name: 'blackbox'
  metrics_path: /probe
  params:
    module: [http_2xx]
  [...]

You can do this for each of the separate modules you need to use, but that means many separate scrape configurations and for each separate scrape configuration you're going to need those standard seven lines of relabeling configuration. This is annoying and verbose, and it doesn't take too many of these before your Prometheus configuration file is so overgrown with many Blackbox scrapes that it's hard to see anything else.

(It would be great if Prometheus could somehow macro-ize these or include them from a separate file or otherwise avoid repeating everything for each scrape configuration, but so far, no such luck. You can't even move some of your scrape configurations into a separate included file; they all have to go in the main prometheus.yml.)

Fortunately, with some cleverness in our relabeling configuration we can actually embed the name of the module we want to use into our Blackbox target specification, letting us use one Blackbox scrape configuration for a whole bunch of different modules. The trick is that what's necessary for Blackbox checks is that by the end of setting up a particular scrape, the module parameter is in the __param_module label. Normally it winds up there because we set it in the param section of the scrape configuration, but we can also explicitly put it there through relabeling (just as we set __address__ by hand through relabeling).

So, let's start with nominal declared targets that look like this:

- ssh_banner,somehost:25
- http_2xx,http://somewhere/url

This encodes the Blackbox module before the comma and the actual Blackbox target after it (you can use any suitable separator; I picked comma for how it looks).

Our first job with relabeling is to split this apart into the module and target URL parameters, which are the magic __param_module and __param_target labels:

relabel_configs:
  - source_labels: [__address__]
    regex: ([^,]*),(.*)
    replacement: $1
    target_label: __param_module
  - source_labels: [__address__]
    regex: ([^,]*),(.*)
    replacement: $2
    target_label: __param_target

(It's a pity that there's no way to do multiple targets and replacements in one rule, or we could make this much more compact. But I'm probably far from the first person to observe that Prometheus relabeling configurations are very verbose. Presumably Prometheus people don't expect you to be doing very much of it.)

Since we're doing all of our Blackbox checks through a single scrape configuration, we won't normally be able to easily tell which module (and thus which check) failed. To make life easier, we explicitly save the Blackbox module as a new label, which I've called probe:

  - source_labels: [__param_module]
    target_label: probe

Now the rest of our relabeling is essentially standard; we save the Blackbox target as the instance label and set the actual address of our Blackbox exporter:

  - source_labels: [__param_target]
    target_label: instance
  - target_label: __address__
    replacement: 127.0.0.1:9115

All of this works fine, but there turns out to be one drawback of putting all or a lot of your blackbox checks in a single scrape configuration, which is that you can't set the Blackbox check interval on a per-target or per-module basis. If you need or want to vary the check interval for different checks (ie, different Blackbox modules) or even different targets, you'll need to use separate scrape configurations, even with all of the extra verbosity that that requires.

(As you might suspect, I've decided that I'm mostly fine with a lot of our Blackbox checks having the same frequency. I did pull ICMP ping checks out into a separate scrape configuration so that we can do them a lot more frequently.)

PS: If you wanted to, you could go further than this in relabeling; for instance, you could automatically add the :25 port specification on the end of hostnames for SSH banner checks. But it's my view that there's a relatively low limit on how much of this sort of rewriting one should do. Rewriting to avoid having a massive prometheus.yml is within my comfort limit here; rewriting just avoid putting a ':25' on hostnames is not. There is real merit to being straightforward and sticking as close to normal Prometheus practice as possible, without extra magic.

(I think that the 'module,real-target' format of target names I've adopted here is relatively easy to see and understand even if you don't know how it works, but I'm biased and may be wrong.)

PrometheusBlackboxBulkChecks written at 22:35:04; Add Comment

2018-11-10

Why Prometheus turns out not be our ideal alerting system

What we want out of an alert system is relatively straightforward (and was probably once typically for sysadmins who ran machines). We would like to get notified once and only once for any new alert that shows up (and for some of them to get notified again when they go away), and we'd also like these alerts to be aggregated together to some degree so we aren't spammed to death if a lot of things go wrong at once.

(It would be ideal if the degree of aggregation was something we could control on the fly. If only a few machines have problems we probably want to get separate emails about each machine, but if a whole bunch of machines all suddenly have problems, please, just send us one email with everything.)

Unfortunately Prometheus doesn't do this, because its Alertmanager has a fundamentally different model of how alert notification should work. Alertmanager's core model is that instead of sending you new alerts, it will send you the entire current state of alerts any time that state changes. So, if you group alerts together and initially there are two alerts in a group and then a third shows up later, Alertmanager will first notify you about the initial two alerts and then later re-notify you with all three alerts. If one of the three alerts clears and you've asked to be notified about cleared alerts, you'll get another notification that lists the now-cleared alert and the two alerts that are still active. And so on.

(One way to put this is to say that Alertmanager is sort of level triggered instead of edge triggered.)

This is not a silly or stupid thing for Alertmanager to do, and it has some advantages; for instance, it means that you only need to read the most recent notification to get a full picture of everything that's currently wrong. But it also means that if you have an escalating situation, you may need to carefully read all of the alerts in each new notification to realize this, and in general you risk alert fatigue if you have a lot of alerts that are grouped together; sooner or later the long list of alerts is just going to blur together. Unfortunately this describes our situation, especially if we try to group things together broadly.

(Alertmanager also sort of assumes other things, for example that you have a 24/7 operations team who deal with issues immediately. If you always deal with issues when they come up, you don't need to hear about an alert clearing because you almost certainly caused that and if you didn't, you can see the new state on your dashboards. We're not on call 24/7 and even when we're around we don't necessarily react immediately, so it's quite possible for things to happen and then clear up without us even looking at anything. Hence our desire to hear about cleared alerts, which is not the Alertmanager default.)

I consider this an unfortunate limitation in Alertmanager. Alertmanager internally knows what alerts are new and changed (since that's part of what drives it to send new notifications), but it doesn't expose this anywhere that you can get at it, even in templating. However I suspect that the Prometheus people wouldn't be interested in changing this, since I expect that distinguishing between new and old alerts doesn't fit their model of how alerting should be done.

On a broader level, we're trying to push a round solution into a square hole and this is one of the resulting problems. Prometheus's documentation is explicit about the philosophy of alerting that it assumes; basically it wants you to have only a few alerts, based on user-visible symptoms. Because we look after physical hosts instead of services (and to the extent that we have services we have a fair amount of them), we have a lot of potential alerts about a lot of potential situations.

(Many of these situations are user visible, simply because users can see into a lot of our environment. Users will notice if any particular general access login or compute server goes down, for example, so we have to know about it too.)

Our current solution is to make do. By grouping alerts only on a per-host basis, we hope to keep the 'repeated alerts in new notifications' problem down to a level where we probably won't miss significant new problems, and we have some hacks to create one time notifications (basically, we make sure that some alerts just can't group together with anything else, which is more work than you'd think).

(It's my view that using Alertmanager to inhibit 'less severe' alerts in favour of more severe ones is not a useful answer for us for various reasons beyond the scope of this entry. Part of it is that I think maintaining suitable inhibition rules would take a significant amount of care in both the Alertmanager configuration and the Prometheus alert generation, because Alertmanager doesn't give you very much power for specifying what inhibits what.)

Sidebar: Why we're using Prometheus for alerting despite this

Basically, we don't want to run a second system just for alerting unless we really have to, especially since a certain number of alerts are naturally driven from information that Prometheus is collecting for metrics purposes. If we can make Prometheus work for alerting and it's not too bad, we're willing to live with the issues (at least so far).

PrometheusAlertsProblem written at 23:35:56; Add Comment

2018-11-09

Getting CPU utilization breakdowns efficiently in Prometheus

I wrote before about getting a CPU utilization breakdown in Prometheus, where I detailed building up a query that would give us a correct 0.0 to 1.0 CPU utilization breakdown. The eventual query is:

(sum(irate(node_cpu_seconds_total {mode!="idle"} [1m])) without (cpu)) / count(node_cpu_seconds_total) without (cpu)

(As far as using irate() here goes, see rate() versus irate().)

This is a beautiful and correct query, but as it turns out you may not want to actually use it. The problem is that in practice, it's also an expensive query when evaluated over a sufficient range, especially if you're using some version of it for multiple machines in the same graph or Grafana dashboard. In some reasonably common cases, I saw Prometheus query durations of over a second for our setup. Once I realized how slow this was, I decided to try to do better.

The obvious way to speed up this query is to precompute the number that's essentially a constant, namely the number of CPUs (the thing we're dividing by). To make my life simpler, I opted to compute this so that we get a separate metric for each mode, so we don't have to use group_left in the actual query. The recording rule we use is:

- record: instance_mode:node_cpus:count
  expr: count(node_cpu_seconds_total) without (cpu)

(The name of this recording rule metric is probably questionable, but I don't understand the best practices suggestions here.)

This cuts out a significant amount of the query cost (anywhere from one half to two thirds or so in some of my tests), but I was still left with some relatively expensive versions of this query (for instance, one of our dashboards wants to display the amount of non-idle CPU utilization across all of our machines). To do better, I decided to try to pre-compute the sum() of the CPU modes across all CPUs, with this recording rule:

- record: instance_mode:node_cpu_seconds_total:sum
  expr: sum(node_cpu_seconds_total) without (cpu)

In theory this should provide basically the same result with a clear saving in Prometheus query evaluation time. In practice this mostly works but occasionally there are some anomalies that I don't understand, where a rate() or irate() of this will exceed 100% (ie, will return a result greater than the number of CPUs in the machine). These excessive results are infrequent and you do save a significant amount of Prometheus query time, which means that there's a tradeoff to be made here; do you live with the possibility of rare weird readings in order to get efficient general trends and overviews, or do you go for complete correctness even at the sake of higher CPU costs (and graphs that take a bit of time to refresh or generate themselves)?

(If you know that you want a particular resolution of rate() a lot, you can pre-compute that (or pre-compute an irate()). But you have to know the resolution, or know that you want irate(), and you may not, especially if you're using Grafana and its magic $__interval template variable.)

I've been going back and forth on this question since I discovered this issue. Right now my answer is that I'm defaulting to correct results even at more CPU cost unless the CPU cost becomes a real, clear problem. But we have the luxury that our dashboards aren't likely to be used very much.

Sidebar: Why I think the sum() in this recording rule is okay

The documentation for both rate() and irate() tells you to always take the rate() or irate() before sum()'ing, in order to detect counter resets. However, in this case all of our counters are tied together; all CPU usage counters for a host will reset at the same time, when the host reboots, and so rate() should still see that reset even over a sum().

(And the anomalies I've seen have been over time ranges where the hosts involved haven't been rebooting.)

I have two wild theories for why I'm seeing problems with this recording rule. First, it could be that the recording rule is summing over a non-coherent set of metric points, where the node_cpu_seconds_total values for some CPUs come from one Prometheus scrape and others come from some other scrape (although one would hope that metrics from a single scrape appear all at once, atomically). Second, perhaps the recording rule is being evaluated twice against the same metric points from the same scrape, because it is just out of synchronization with a slow scrape of a particular node_exporter. This would result in a flat result for one point of the recording rule and then a doubled result for another one, where the computed result actually covers more time than we expect.

(Figuring out which it is is probably possible through dedicated extraction and processing of raw metric points from the Prometheus API, but I lack the patience or the interest to do this at the moment. My guess is currently the second theory, partly based on some experimentation with changes().)

PrometheusCPUStatsII written at 23:40:14; Add Comment

2018-11-08

The future of our homedir-based mail server system design

In a comment on my entry on our self-serve system for autoreplies, I was asked a very good question:

Do you think it will ever be possible for you to move to a non-homedir-based mail server at all?

My answer is that I no longer think it would be a good thing for us to move to a non-homedir-based mail system.

Most mail systems segregate all mail storage and mail processing away from regular user files. As the commentator is noting, our mail system doesn't work this way and instead does things like allow users to have .forward files in their home directories. Sometimes this causes us difficulties, and so the question here is a sensible one. In the past I would have told you that we would eventually have to move to an IMAP-only environment where mail handling and storage was completely segregated from user home directories. Today I've changed my mind; I now think that we should not move mail out of people's home directories. In fact I would like more of their mail to live under their home directories; in an IMAP-only environment, I would like to put people's INBOXes somewhere in there too, instead of in the shared /var/mail filesystem that we have today.

The reasons for this ultimately come down to that eternal issue of storage allocation, plus the fact that any number of our users have quite a lot of email (gigabytes and gigabytes of it). No matter what you do about email, it has to live somewhere, and someone has to pay for the storage space, the backups, and so on. In our environment, how we allocate storage in general is that people get however much disk space they're willing to pay for. There's various obvious good reasons to stick with this for mail storage space, and once we're doing that there are good reasons to stick with our standard model of providing disk space when providing mail folder space, including that we already have an entire collection of systems for managing it, backing it up, and so on. Since we use ZFS pools, in theory this mail storage space doesn't have to go in people's home directories; we could make separate 'mail storage' filesystems for every person and every group. In practice, we already have home directory filesystems.

It's possible that IMAP-only mail storage in some dedicated format would be more efficient and faster than what we currently have (most people keep their mail in mbox format files). In practice we don't have a mail environment that is that demanding (and since we're moving an an all-SSD setup for all storage, our IO rates are about to get much better anyway).

As far as things like .forwards go, my view is that this is a pragmatic tradeoff. Supporting .forwards ties our hands to some degree, but it also means that we don't have to build and commit to a user accessible server side mail filtering system, with all of the many questions it would raise. As with mail folder storage, using people's home directories and all of the regular Unix infrastructure is the easiest and most obvious approach.

PS: Our Unix-focused approach is not one I would necessarily recommend for other environments. It works here for various reasons, including that we already have general Unix login servers for people to use and that we don't have that many users.

MailAndHomedirs written at 23:54:27; Add Comment

2018-11-06

Our self-serve system for 'vacation' autoreplies and its surprising advantage

In the old days, it was just broadly and tacitly assumed that everyone around here could and would learn (and use) Unix to get things done in our environment, and so we could provide services purely through traditional Unix command line things. This has been less and less true for years, and so there has been a slow drive to provide access to various services in ways that don't require logging in and using a shell. One of the traditional painful experiences for people was setting up a vacation or out of the office autoreply for email, because the traditional Unix way of doing this involves editing your .forward to stick in a magic vacation command, or going even further to use procmail to filter some things out before you run vacation. This became not very popular with our users, and the Points of Contact spent a certain amount of their time helping people deal with the whole mess.

So, many years ago (it turns out to be just over a decade), we added a special feature to our Exim configuration to handle this automatically. If you had both $HOME/.vacation.msg (with your vacation message itself) and the magic flag file $HOME/VACATION-ACTIVE, our central Exim mail server took care of generating the autoreply for you. If you removed VACATION-ACTIVE, this stopped. Later we added a little web CGI frontend so that you could write your vacation message in a HTML text box and toggle things on and off without ever having to touch the Unix command line.

Perhaps unsurprisingly, this feature has turned out to be quite popular. Almost no one runs vacation from their .forward files any more (either directly or through procmail); pretty much everyone uses the VACATION-ACTIVE magic, even people who are perfectly familiar with Unix (and I can't blame them, it's a lot easier than fiddling around with .forward).

It turns out that there's another advantage, one that we hadn't really noticed until recently, which is that how autoreplies are generated is under our full control. The program or script that gets run, the arguments it gets run with, and any filtering it does on what gets autoreplied to are all ours to both decide and change as we need to, because the users just tell us their intention (by creating the files) instead of specifying the mechanics, the way they would be doing with .forward files. We don't even provide an officially supported mechanism for controlling how frequently people get autoreplies; instead it's officially just some appropriate interval.

(For a long time this was not something we needed and so we never really noticed it. For reasons beyond the scope of this entry, changing parts of how our autoreply system works has become necessary, so I'm suddenly very happy about this advantage and how many people use this system. The few people who still use vacation through .forwards are going to be harder to deal with.)

Sidebar: The mechanics of this in Exim

The important mechanics are an Exim router and a transport to go with it. The guts of the router are:

user_vacation:
  [...]
  driver = accept
  transport = vacation_delivery
  check_local_user
  require_files = $local_part:$home/VACATION-ACTIVE:$home/.vacation.msg
  unseen = true
  [...]

The important bit is the require_files directive, which checks for them (as the user, because of NFS permissions issues).

The core of the transport is just:

vacation_delivery:
  driver = pipe
  [...]
  command = /our/vacation/command $local_part
  [...]

(I have left out various less important and more obvious directives.)

If you do this, you'll need to decide whether to do your filtering on what email doesn't get autoreplies in Exim, in the vacation command you run (possibly with the help of information passed from Exim to it in environment variables), or both.

OurSelfserveAutoreplies written at 23:32:09; Add Comment

2018-11-05

rate() versus irate() in Prometheus (and Grafana)

Prometheus's PromQL query language has two quite similar functions for calculating the rate of things (well, of counters), rate() and irate(). When I was starting out writing PromQL things, I found people singing the praises of each of them and sometimes suggesting that you avoid the other as misleading (eg, and). In particular, it's often said that you should use irate() lest you miss brief activity spikes, which is both true and not true depending on how exactly you use irate(). To explain that, I need to start with what these two functions do and go on to the corollaries.

rate() is the simpler function to describe. Ignoring things like counter resets, rate() gives you the per second average rate of change over your range interval by using the first and the last metric point in it (whatever they are, and whatever their timestamps are). Since it is the average over the entire range, it necessarily smooths out any sudden spikes; the only things that matter are the start and the end values and the time range between them.

As for irate(), I'll quote straight from the documentation:

[irate()] calculates the per-second instant rate of increase of the time series in the range vector. This is based on the last two data points.

(Emphasis mine.)

In other words, irate() is the per second rate of change at the end of your range interval. Everything else in your range interval is ignored (if there are more than two data points in it).

There are some immediate corollaries that follow from this. First, there's no point in giving irate() a particularly large range interval; all you need is one that's a large enough to insure that it has two points in it. You'll need this to be somewhat larger than twice your scrape interval, but there's little point in going further than three or four times that. Second, irate() is not going to be very useful on metrics that change less often than you scrape them, because some of the time the last two points that irate() will use are the same metric update, just scraped twice, and irate() will therefor report no change.

(Consider, for example, a metric that you expose through the node exporter's textfile collector and generate once a minute from cron, while your Prometheus configuration scrapes the node exporter itself once every fifteen seconds. Three out of every four metric points collected by Prometheus are actually the same thing; only every fourth one represents a genuine new data point. Similar things can happen with metrics scraped from Pushgateway.)

Obviously, the difference between rate() and irate() basically vanish when the range interval gets sufficiently small. If your range interval for rate() only includes two metric points, it's just the same as irate(). However, it's easier to make irate() reliable at small range intervals; if you use rate(), it may be chancy to insure that your range interval always has two and only two points. irate() automatically arranges that for you by how it works.

Often when we use either rate() or irate(), we want to graph the result. Graphing means moving through time with query steps and that means we get into interactions between the query step and both the range interval and the function you're using. In particular, as the query step grows large enough, irate() will miss increasingly large amounts of changes. This is because it is the instant rate of change at the end of your range interval (using the two last metric points). When the query steps include more than two points in each interval, you lose the information from those extra points. As an extreme example, imagine a query step of five minutes and a metric that updates every thirty seconds. If you use irate(), you're only seeing the last minute out of every five minute slice of time; you have no idea of what happened in the other four minutes, including if there was an activity spike in them. If you use rate() instead, you can at least have some visibility into the total changes across those five minutes even if you don't capture any short term activity spikes.

There are two unfortunate corollaries of this for graphing. First, the range interval you likely want to use for rate() depends partly on your query step. For many graphs, there is no fixed one size fits all range interval for rate(). Depending on what you want to see, you might want a rate() range interval of either the query step (which will give you the rate() of disjoint sets of metric points) or somewhat more than the query step (which will give you somewhat overlapping metric points, and perhaps smooth things out). The second is that whether you want to use rate() or irate() depends on whether your query step is small or large. With a small query step and thus a high level of detail, you probably want to use irate() with a range interval that's large enough to cover three or four metric points. But for large query steps and only broad details, irate() will throw away huge amounts of information and you need to use rate() instead.

This is where we come to Grafana. Current versions of Grafana offer a $__interval Grafana templating variable that is the current query step, and you can plug that into a rate() expression. However, Grafana offers no way to switch between rate() and irate() depending on the size of this interval, and it also ties together the minimum interval and the minimum query step (as I sort of grumbled about in this entry). This makes Grafana more flexible than basic interactive Prometheus usage, since you can make your rate() range intervals auto-adjust to fit the query steps for your current graphs and their time ranges. However you can't entirely get a single graph that will show you both fine details (complete with momentary spikes) over small time ranges and broad details over large time ranges.

(People who are exceedingly clever might be able to do something with Grafana's $__interval_ms and multiple metric calculations.)

For our purposes, I think we want to use rate() instead of irate() in our regular Grafana dashboard graphs because we're more likely to be looking at broad overviews with fairly wide timescales. If we're looking at something in fine detail, we can turn to Prometheus and by-hand irate() based queries and graphs.

PS: Some of this is grumbling. For many practical purposes, a Grafana graph with a rate() range interval and query step that's down to twice your Prometheus metric update interval is basically as good as an irate() based graph at your metric update interval. And if you regularly need a Grafana dashboard for fine-grained examination of problems, you can build one that specifically uses irate() in places where it's actually useful.

Sidebar: The minimum range interval versus the minimum query step

Suppose that you have a continuously updating metric that Prometheus scrapes every fifteen seconds. To do a rate() or irate() of this, you need at least two metric points and thus a range interval of thirty seconds (at least; in practice you need a somewhat larger interval). However, you get a new metric value roughly every fifteen seconds and thus a potentially new rate of change every fifteen seconds as well (due to your new metric point). This means it's reasonable to look at a graph with a query step of fifteen seconds and an interval of thirty seconds.

(And you can't go below an interval of thirty seconds or so here, because once your range durations only include one metric point, both rate() and irate() stop returning anything and your graph disappears.)

PrometheusRateVsIrate written at 22:44:01; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.