Wandering Thoughts

2018-12-12

One situation where you absolutely can't use irate() in Prometheus

This is a story about me shooting myself in the foot repeatedly, until I finally noticed.

We have Linux NFS client machines, and we would like to know NFS client performance information about them so we can see things which filesystems they use the most. The Linux kernel provides copious per-mount information on NFS mounts and the Prometheus node exporter can turn all of it into metrics, but for us the raw node exporter stats are just too overwhelming in our environment; a typical machine generates over 68,000 different node_mountstats metrics values. So we wrote our own program to digest the raw mountstats metrics and do things like sum all of the per NFS operation information together for a given mount. We run the program from cron once a minute and publish its information through the node exporter's extremely handy textfile collector.

(The node exporter's current mountstats collector also has the same misunderstanding about the xprt: NFS RPC information that I did, and so reports it on a per-mount basis.)

For a while I've been trying to use the cumulative total RPC time with the total RPC count to generate an average time per RPC graph. Over and over I've plugged in some variation on 'cumulative seconds / cumulative operations' for various sources of both numbers, put an irate() around both, graphed this, and gotten either nothing at all or maybe a few points. Today I really dug into this and the penny finally dropped while I was brute-forcing things. The problem was that I was reflexively using irate() instead of rate().

The one situation where you absolutely can't use irate() is that you can't apply irate() to oversampled metrics, metrics that Prometheus is scraping more often than they're being generated. We're generating NFS metrics once a minute but scraping the node exporter every fifteen seconds, so three out of four of our recorded NFS metrics are the same. irate() very specifically uses the last two data points in your sample interval, and when you're oversampling, those two data points are very often going to be the exact same actual metric and so have no change.

In other words, a great deal of my graph was missing because it was all 'divide by zero' NaNs, since the irate() of the 'cumulative operations' count often came out as zero.

(Oversampled metric is my term. There may be an official term for 'scraped more often than it's generated' that I should be using instead.)

Looking back, I specifically mentioned this issue in my entry on rate() versus irate(), but apparently it didn't sink in thoroughly enough. Especially I didn't think through the consequences of dividing by the irate() of an oversampled metric, which is where things really go off the rails.

(If you merely directly graph the irate() of an oversampled metric, you just get a (very) spiky graph.)

Part of what happened is that when I was writing my exploratory PromQL I was thinking in terms of how often the metric was generated, not how often it was scraped, so I didn't even consciously consider that I had an oversampled metric. For example, I was careful to use 'irate(...[3m])', so I would be sure to have at least two valid metric points for a once a minute metric. This is probably a natural mindset to have, but that just means it's something I'm going to have to try to keep in mind for the future for node exporter textfile metrics and Pushgateway metrics.

Sidebar: Why I reflexively reach for irate()

Normally irate() versus rate() is a choice of convenience and what time range and query step you're working on. Since the default Prometheus graph is one hour, I'm almost always exploring with a small enough query step that irate() is the easier way to go. Even when I'm dealing with an oversampled metric, I can get sort of get away with this if I'm directly graphing the irate() result; it will just be (very) spiky, it won't be empty or clearly bogus.

PrometheusWhenNotIrate written at 01:53:32; Add Comment

2018-12-05

The brute force cron-based way of flexibly timed repeated alerts

Suppose, not hypothetically, that you have a cron job that monitors something important. You want to be notified relatively fast if your Prometheus server is down, so you run your cron job frequently, say once every ten minutes. However, now we have the problem that cron is stateless, so if our Prometheus server goes down and our cron job starts alerting us, it will re-alert us every ten minutes. This is too much noise (at least for us).

There's a standard pattern for dealing with this in cron jobs that send alerts; once the alert happens, you create a state file somewhere and as long as your current state is the same as the state file, you don't produce any output or send out your warning or whatever. But this leads to the next problem, which is that you alert once and are then silent forever afterward, leaving it to people to remember that the problem (still) exists. It would be better to re-alert periodically, say once every hour or so. This isn't too hard to do; you can check to see if the state file is more than an hour old and just re-send the alert if it is.

(One way to do this is with 'find <file> -mmin +... -print'. Although it may not be Unixy, I do rather wish for olderthan and newerthan utilities as a standard and widely available thing. I know I can write them in a variety of ways, but it's not the same.)

But this isn't really what we want, because we aren't around all of the time. Re-sending the alert once an hour in the middle of the night or the middle of the weekend will just give us a big pile of junk email to go through when we get back in to the office; instead we want repeats only once every hour or two during weekdays.

When I was writing our checker script, I got to this point and started planning out how I was going to compare against the current hour and day of weeek in the script to know when I should clear out the state file and so on. Then I had a flash of the obvious and realized that I already had a perfectly good tool for flexibly specifying various times and combinations of time conditions, namely cron itself. The simple way to reset the state file and cause re-alerts at whatever flexible set of times and time patterns I want is to do it through crontab entries.

So now I have one cron entry that runs every ten minutes for the main script, and another cron entry that clears the state file (if it exists) several times a day during the weekday. If we decide we want to be re-notified once a day during the weekend, that'll be easy to add as another cron entry. As a bonus, everyone here understands cron entries, so it will be immediately obvious when things run and what they do in a way that it wouldn't be if all of this was embedded in a script.

(It's also easy for anyone to change. We don't have to reach into a script; we just change crontab lines, something we're already completely familiar with.)

As it stands this is slightly too simplistic, because it clears the state file without caring how old it is. In theory we could generate an alert shortly before the state file is due to cleared, clear the state file, and then immediately re-alert. To deal with that I decided to go the extra distance and only clear the state file if it was at least a minimum age (using find to see if it was old enough, because we make do with the tools Unix gives us).

(In my actual implementation, the main script takes a special argument that makes it just clear the state file. This way only the script has to know where the state file is or even just what to do to clear the 'do not re-alert' state; the crontab entry just runs 'check-promserver --clear'.)

RepeatingAlertsViaCron written at 01:23:02; Add Comment

2018-11-30

I've learned that sometimes the right way to show information is a simple one

When I started building some Grafana dashboards, I of course reached for everyone's favorite tool, the graph. And why not? Graphs are informative and beyond that, they're fun. It simply is cool to fiddle around for a bit and have a graph of your server's network bandwidth usage or disk bandwidth right there in front of you, to look at the peaks and valleys, to be able to watch a spike of activity, and so on.

For a while I made pretty much everything a graph; for things like bandwidth, this was obviously a graph of the rate. Then one day I was looking at a graph of filesystem read and write activity on one of our dashboards, with one filesystem bouncing up here and another one bouncing up there over Grafana's default six hour time window, and I found myself wondering which of these filesystems was the most active one over the entire time period. In theory the information was in the graph; in practice, it was inaccessible.

As an experiment, I added a simple bar graph of 'total volume over the time range'. It was a startling revelation. Not only did it answer my question, but suddenly things that had been buried in the graphs jumped out at me. Our web server turned out to use our old FTP area far much more than I would have guessed, for example. The simple bar graph also made it much easier to confirm things that I thought I was seeing in the more complex and detailed graphs. When one filesystem looked like it was surprisingly active in the over-time graph, I could look down to the bar graph and confirm that yes, it was (and also see how much its periodic peaks of activity added up to).

Since that first experience I have become much more appreciative of the power of simple ways to show summary information. Highly detailed graphs have an important place and they're definitely showing us things we didn't know, but simple summaries also reveal things too.

(I'd love the ability to get ad-hoc simple summaries from more complex graphs. I don't need 'average bandwidth over the graph's entire time range' very often, but sometimes I'd rather like to have it rather than having to guess by eyeball. It's sort of a pity that you can't give Grafana graphs alternate visualizations that you can cycle through, or otherwise have two (or more) panels share the same space so you can flip between them. As it stands, we have some giant dashboards.)

SimpleGraphsAdvantage written at 01:15:42; Add Comment

2018-11-25

How we monitor our Prometheus setup itself

On Mastodon, I said:

When you have a new alerting and monitoring system, 'who watches the watchmen' becomes an interesting and relevant question. Especially when the watchmen have a lot of separate components and moving parts.

If we had a lot of experience with Prometheus, we probably wouldn't worry about this; we'd be able to assume that everything was just going to work reliably. But we're very new with Prometheus, and so we get to worry about its reliability in general and also the possibility that we'll quietly break something in our configuration or how we're operating things (and we have, actually). So we need to monitor Prometheus itself. If Prometheus was a monolithic system, this would probably be relatively easy, but instead instead our overall Prometheus environment has a bunch of separate pieces, all of which can have things go wrong.

A lot of how we're monitoring for problems is probably basically standard in Prometheus deployments (or at least standard in simple ones, like ours). The first level of monitoring and alerts is things inside Prometheus:

  • We alert on unresponsive host agents (ie, Prometheus node_exporter) as part of our general checking for and alerting on down hosts; this will catch when a configured machine doesn't have the agent installed or it hasn't been started. The one thing it won't catch is a production machine that hasn't been added to our Prometheus configuration. Unfortunately there's no good automated way in our environment to tell what is and isn't a production machine, so we're just going to have to rely on remembering to add machines to Prometheus when we put them into production.

    (This alert uses the Prometheus 'up' metric for our specific host agent job setting.)

  • We also alert if Prometheus can't talk to a number of other metrics sources it's specifically configured to pull from, such as Grafana, Pushgateway, the Blackbox agent itself, Alertmanager, and a couple of instances of an Apache metrics exporter. This is also based on the up metric, excluding the ones for host agents and for all of our Blackbox checks (which generate up metrics themselves, which can be distinguished from regular up metrics because the Blackbox check ones have a non-empty probe label).

  • We publish some system-wide information for temperature sensor readings and global disk space usage for our NFS fileservers, so we have checks to make sure that this information is both present at all and not too old. The temperature sensor information is published through Pushgateway, so we leverage its push_time_seconds metric for the check. The disk space usage information is published in a different way, so we rely on its own 'I was created at' metric.

  • We publish various per-host information through the host agent's textfile collector, where you put files of metrics you want to publish in a specific directory, so we check to make sure that these files aren't too stale through the node_textfile_mtime_seconds metric. Because we update these files at varying intervals but don't want to have complex alerts here, we use a single measure for 'too old' and it's a quite conservative number.

    (This won't detect hosts that have never successfully published some particular piece of information at all, but I'm currently assuming this is not going to happen. Checking for it would probably be complicated, partly because we'd have to bake in knowledge about what things hosts should be publishing.)

All of these alerts require their own custom and somewhat ad-hoc rules. In general writing all of these checks feels like a bit of a slog; you have to think about what could go wrong, and then how you could check for it, and then write out the actual alert rule necessary. I was sort of tempted to skip writing the last two sets of alerts, but we've actually quietly broken both the global disk space usage and the per-host information publication at various times.

(In fact I found out that some hosts weren't updating some information by testing my alert rule expression in Prometheus. I did a topk() query on it and then went 'hey, some of these numbers are really much larger than they should be'.)

This leaves checking Prometheus itself, and also a useful check on Alertmanager (because if Alertmanager is down, Prometheus can't send out the alert it detects). In some places the answer to this would be a second Prometheus instance that cross-checks the first and a pair of Alertmanagers that both of them talk to and that coordinate with each other through their gossip protocol. However, this is a bit complicated for us, so my current answer is to have a cron job that tries to ask Prometheus for the status of Alertmanager. If Prometheus answers and says Alertmanager is up, we conclude that we're fine; otherwise, we have a problem somewhere. The cron job currently runs on our central mail server so that it depends on the fewest other parts of our infrastructure still working.

(Mechanically this uses curl to make the query through Prometheus's HTTP API and then jq to extract things from the answer.)

We don't currently have any checks to make sure that Alertmanager can actually send alerts successfully. I'm not sure how we'd craft those, because I'm not sure Alertmanager exposes the necessary metrics. Probably we should try to write some alerts in Prometheus and then have a cron job that queries Prometheus to see if the alerts are currently active.

(Alertmanager exposes a count of successful and failed deliveries for the various delivery methods, such as 'email', but you can't find out when the last successful or failed notification was for one, or whether specific receivers succeeded or failed in some or all of their notifications. There's also no metrics exposed for potential problems like 'template expansion failure', which can happen if you have an error somewhere in one of your templates. If the error is in a rarely used conditional portion of a template, you might not trip over it for a while.)

PrometheusSelfMonitoring written at 01:46:43; Add Comment

2018-11-20

When Prometheus Alertmanager will tell you about resolved alerts

We've configured Prometheus to notify us when our alerts clear up. Recently, we got such a notification email and its timing attracted my attention because while the alert in it had a listed end time of 9:43:55, the actual email was only sent at 9:45:40, which is not entirely timely. This got me curious about what affects how soon you get notified about alerts clearing in Prometheus, which led to two significant surprises. The overall summary is that the current version of Alertmanager can't give you prompt notifications of resolved alerts, and on top of that right now Alertmanger's 'ended-at' times are generally going to be late by as many as several minutes.

Update: It turns out that the second issue is fixed in the recently released Alertmanager 0.15.3. You probably want to upgrade to it.

Alertmanager's relatively minimal documentation describes the group_interval setting as follows:

How long to wait before sending a notification about new alerts that are added to a group of alerts for which an initial notification has already been sent. (Usually ~5m or more.)

If you read this documentation, you would expect that this is a 'no more often than' interval. By this I mean that you have to wait at least group_interval before Alertmanager will send another notification, but that once this time has elapsed, Alertmanager will immediately send a new alert out. This turns out to not be the case. Instead, this is a tick interval; Alertmanager only sends a new notification every group_interval (if there is anything new). For example, if you have a group_interval of 10m and a new alert in the group shows up 11 minutes after the first notification, Alertmanager will not send out a notification until 20 minutes, which is the next 'tick' of 'every 10 minutes'. Resolved alerts are treated just the same as new alerts here.

This makes a certain amount of sense in Alertmanager's apparent model of the alerting world, but it's quite possibly not what you either expect or want. Certainly it's not what we want, and it's probably going to cause us to change our group_interval settings to basically be the same (or shorter) than our group_wait settings.

For the rest of this, I'm going to go through the flow of what happens when an alert ends, as far as I can tell.

When a Prometheus alert triggers and enters the firing state, Prometheus sends the alert to Alertmanager. As covered in Alertmanager's API documentation, Prometheus will then re-send the alert every so often. At the moment, Prometheus throttles itself to re-sending a particular alert only once a minute, although you can change this with a command line option. Re-sending an alert doesn't change the labels (one hopes), but it can change the annotations; Prometheus will re-create them every time (as part of re-evaluating the alert rule) and Alertmanager always uses the most recently received annotations if it needs to generate a new notification.

(Of course this doesn't matter if your annotations are constant. Our annotations can include the current state of affairs, which means that a succession of alert notifications can have different annotation text as, say, the reported machine room temperature fluctuates.)

When a firing alert's rule condition stops being true, Prometheus doesn't just drop and delete the alert (although this is what it looks like in the web UI). Instead, it sets the rule's 'ResolvedAt' time, switches it to a state that Prometheus internally labels as 'inactive', and keeps it around for another 15 minutes. During these 15 minutes, Prometheus will continue to send it to Alertmanager, which is how Alertmanager theoretically learns that it's been resolved. The first (re-)send after an alert has been resolved is not subject to the regular 60 second minimum re-send interval; it happens immediately, so in theory Alertmanager should immediately know that the alert is resolved. As a side note, the annotations on a resolved alert will be the annotations from the last pre-resolution version of it.

(It turns out that Prometheus always sends alerts to Alertmanager with an 'EndsAt' time filled in. If the rule has been resolved, this is the 'ResolvedAt' time; if it hasn't been resolved, the 'EndsAt' is an internally calculated timeout that's some distance into the future. This may be relevant for alert notification templates. This also appears to mean that Alertmanager's resolve_timeout setting is unused, because the code makes it seem like it's only used for alerts without their own EndsAt time.)

Then we run into an Alertmanager issue that is probably fixed by this recent commit, where the actual resolved alert that Prometheus sent to Alertmanager effectively gets ignored and Alertmanager fails to notice that the alert is actually resolved. Instead, the alert has to reach its earlier expiry time before it becomes resolved, which is generally going to be about three minutes from the last time Prometheus sent the still-firing alert to Alertmanager. In turn, that time may be anywhere up to nearly 60 seconds before Prometheus decided that the alert was resolved.

(Prometheus will often evaluate the alert rule more often than once every 60 seconds, but if the alert rule is still true, it will only send that result to Alertmanager once every 60 seconds.)

This Alertmanager delay in recognizing that the alert is resolved combines in an unfortunate way with the meaning of group_interval, because it can make you miss the group_interval 'tick' and then have your 'alert is resolved' notification delayed until the next tick, however many minutes away it is. To minimize this time, you need to reduce group_interval down to whatever you can tolerate and then set --rules.alert.resend-delay down relatively low, say 20 or 30 seconds. With a 20 second resend delay, the expiry timeout is only a minute, which means a bit less than a minute's delay at most before Alertmanager notices that your alert has resolved.

(You also need your evaluation_interval to be no more than your resend delay.)

(When the next Alertmanager release comes out with this bug fixed, you can stop setting your resend delay but you'll still want to have a low group_interval. It seems unlikely that its behavior will change, given the discussion in issue 1510.)

PS: If you want a full dump of what Alertmanager thinks of your current alerts, the magic trick is:

curl -s http://localhost:9093/api/v1/alerts | jq .

This gives you a bunch more information than Alertmanager's web UI will show. It does exclude alerts that Alertmanager thinks are resolved but hasn't deleted yet, though.

PrometheusAlertsClearingTime written at 01:31:20; Add Comment

2018-11-13

Our pragmatic attachment to OpenBSD PF for our firewall needs

Today on Twitter, I asked:

#Sysadmin people: does anyone have good approaches for a high-performance 10G OpenBSD firewall (bridging or routing)? Is the best you can do still 'throw the fastest single-core CPU you can find at it'?

A number of people made the reasonable suggestion of looking into FreeBSD or Linux instead of OpenBSD for our 10G Ethernet firewall needs. We have done some investigation of this (and certainly our Linux machines have no problem with 10G wire speeds, even with light firewall rules in place) but it's not a very attractive solution. The problem is that we're very attached to OpenBSD PF for pragmatic reasons.

At this point, we've been using OpenBSD based firewalls with PF for fifteen years or more. In the process we've built up a bunch of familiarity with the quirks of OpenBSD and of PF, but more importantly we've ended up with thousands of lines of PF rulesets, some in relatively complicated firewall configurations, and all of which are only documented implicitly in the PF rules themselves because, well, that's what we wrote our firewall rules in.

Moving to anything other than OpenBSD PF means both learning a new rule language and translating our current firewall rulesets to that language. We'd need to do this for at least the firewalls that need to migrate to 10G (one of which is our most complicated firewall), and we'd probably want to eventually do it for all firewalls, just so that we didn't have to maintain expertise in two different firewall languages and environments. We can do this if we have to, but we would very much rather not; OpenBSD works well for us in our environment and we have a solid, reliable setup (including pfsync).

(We don't use CARP, but we count on pfsync to maintain hot spare firewalls in a 'ready to be made live' state. Having pfsync has made shifting between live and hot spare firewalls into something that users barely notice, where in the old pre-pfsync days a firewall shift required scheduled downtimes because it broke everyone's connections. One reason we shift between live and hot spare firewalls is if we think the live firewall needs a reboot or some hardware work.)

We also genuinely like PF; it seems to operate at about the right level of abstraction for what we want to do, and we rarely find ourselves annoyed at it. We would probably not be enthused about trying to move to something that was either significantly higher level or significantly lower level. And, barring our issues with getting decent 10G performance, OpenBSD PF has performed well and been extremely solid for us; our firewalls are routinely up for more than a year and generally we don't have to think about them. Anything that proposes to supplant OpenBSD in a firewall role here has some quite demanding standards to live up to.

PS: For our purposes, FreeBSD PF is a different thing than OpenBSD PF because it hasn't picked up the OpenBSD features and syntax changes since 4.5, and we use any number of those in our PF rules (you have to, since OpenBSD loves changing PF syntax). Regardless of how well FreeBSD PF works and how broadly familiar it would be, we'd have to translate our existing rulesets from OpenBSD PF to FreeBSD PF. This might be easier than translating them to anything else, but it would still be a non-trivial translation step (with a non-trivial requirement for testing the result).

OpenBSDPFAttachment written at 23:48:51; Add Comment

2018-11-11

Easy configuration for lots of Prometheus Blackbox checks

Suppose, not entirely hypothetically, that you want to do a lot of Prometheus Blackbox checks, and worse, these are all sorts of different checks (not just the same check against a lot of different hosts). Since the only way to specify a lot of Blackbox check parameters is with different Blackbox modules, this means that you need a bunch of different Blackbox modules. The examples of configuring Prometheus Blackbox probes that you'll find online all set the Blackbox module as part of the scrape configuration; for example, straight from the Blackbox README, we have this in their example:

- job_name: 'blackbox'
  metrics_path: /probe
  params:
    module: [http_2xx]
  [...]

You can do this for each of the separate modules you need to use, but that means many separate scrape configurations and for each separate scrape configuration you're going to need those standard seven lines of relabeling configuration. This is annoying and verbose, and it doesn't take too many of these before your Prometheus configuration file is so overgrown with many Blackbox scrapes that it's hard to see anything else.

(It would be great if Prometheus could somehow macro-ize these or include them from a separate file or otherwise avoid repeating everything for each scrape configuration, but so far, no such luck. You can't even move some of your scrape configurations into a separate included file; they all have to go in the main prometheus.yml.)

Fortunately, with some cleverness in our relabeling configuration we can actually embed the name of the module we want to use into our Blackbox target specification, letting us use one Blackbox scrape configuration for a whole bunch of different modules. The trick is that what's necessary for Blackbox checks is that by the end of setting up a particular scrape, the module parameter is in the __param_module label. Normally it winds up there because we set it in the param section of the scrape configuration, but we can also explicitly put it there through relabeling (just as we set __address__ by hand through relabeling).

So, let's start with nominal declared targets that look like this:

- ssh_banner,somehost:25
- http_2xx,http://somewhere/url

This encodes the Blackbox module before the comma and the actual Blackbox target after it (you can use any suitable separator; I picked comma for how it looks).

Our first job with relabeling is to split this apart into the module and target URL parameters, which are the magic __param_module and __param_target labels:

relabel_configs:
  - source_labels: [__address__]
    regex: ([^,]*),(.*)
    replacement: $1
    target_label: __param_module
  - source_labels: [__address__]
    regex: ([^,]*),(.*)
    replacement: $2
    target_label: __param_target

(It's a pity that there's no way to do multiple targets and replacements in one rule, or we could make this much more compact. But I'm probably far from the first person to observe that Prometheus relabeling configurations are very verbose. Presumably Prometheus people don't expect you to be doing very much of it.)

Since we're doing all of our Blackbox checks through a single scrape configuration, we won't normally be able to easily tell which module (and thus which check) failed. To make life easier, we explicitly save the Blackbox module as a new label, which I've called probe:

  - source_labels: [__param_module]
    target_label: probe

Now the rest of our relabeling is essentially standard; we save the Blackbox target as the instance label and set the actual address of our Blackbox exporter:

  - source_labels: [__param_target]
    target_label: instance
  - target_label: __address__
    replacement: 127.0.0.1:9115

All of this works fine, but there turns out to be one drawback of putting all or a lot of your blackbox checks in a single scrape configuration, which is that you can't set the Blackbox check interval on a per-target or per-module basis. If you need or want to vary the check interval for different checks (ie, different Blackbox modules) or even different targets, you'll need to use separate scrape configurations, even with all of the extra verbosity that that requires.

(As you might suspect, I've decided that I'm mostly fine with a lot of our Blackbox checks having the same frequency. I did pull ICMP ping checks out into a separate scrape configuration so that we can do them a lot more frequently.)

PS: If you wanted to, you could go further than this in relabeling; for instance, you could automatically add the :25 port specification on the end of hostnames for SSH banner checks. But it's my view that there's a relatively low limit on how much of this sort of rewriting one should do. Rewriting to avoid having a massive prometheus.yml is within my comfort limit here; rewriting just avoid putting a ':25' on hostnames is not. There is real merit to being straightforward and sticking as close to normal Prometheus practice as possible, without extra magic.

(I think that the 'module,real-target' format of target names I've adopted here is relatively easy to see and understand even if you don't know how it works, but I'm biased and may be wrong.)

PrometheusBlackboxBulkChecks written at 22:35:04; Add Comment

2018-11-10

Why Prometheus turns out not be our ideal alerting system

What we want out of an alert system is relatively straightforward (and was probably once typically for sysadmins who ran machines). We would like to get notified once and only once for any new alert that shows up (and for some of them to get notified again when they go away), and we'd also like these alerts to be aggregated together to some degree so we aren't spammed to death if a lot of things go wrong at once.

(It would be ideal if the degree of aggregation was something we could control on the fly. If only a few machines have problems we probably want to get separate emails about each machine, but if a whole bunch of machines all suddenly have problems, please, just send us one email with everything.)

Unfortunately Prometheus doesn't do this, because its Alertmanager has a fundamentally different model of how alert notification should work. Alertmanager's core model is that instead of sending you new alerts, it will send you the entire current state of alerts any time that state changes. So, if you group alerts together and initially there are two alerts in a group and then a third shows up later, Alertmanager will first notify you about the initial two alerts and then later re-notify you with all three alerts. If one of the three alerts clears and you've asked to be notified about cleared alerts, you'll get another notification that lists the now-cleared alert and the two alerts that are still active. And so on.

(One way to put this is to say that Alertmanager is sort of level triggered instead of edge triggered.)

This is not a silly or stupid thing for Alertmanager to do, and it has some advantages; for instance, it means that you only need to read the most recent notification to get a full picture of everything that's currently wrong. But it also means that if you have an escalating situation, you may need to carefully read all of the alerts in each new notification to realize this, and in general you risk alert fatigue if you have a lot of alerts that are grouped together; sooner or later the long list of alerts is just going to blur together. Unfortunately this describes our situation, especially if we try to group things together broadly.

(Alertmanager also sort of assumes other things, for example that you have a 24/7 operations team who deal with issues immediately. If you always deal with issues when they come up, you don't need to hear about an alert clearing because you almost certainly caused that and if you didn't, you can see the new state on your dashboards. We're not on call 24/7 and even when we're around we don't necessarily react immediately, so it's quite possible for things to happen and then clear up without us even looking at anything. Hence our desire to hear about cleared alerts, which is not the Alertmanager default.)

I consider this an unfortunate limitation in Alertmanager. Alertmanager internally knows what alerts are new and changed (since that's part of what drives it to send new notifications), but it doesn't expose this anywhere that you can get at it, even in templating. However I suspect that the Prometheus people wouldn't be interested in changing this, since I expect that distinguishing between new and old alerts doesn't fit their model of how alerting should be done.

On a broader level, we're trying to push a round solution into a square hole and this is one of the resulting problems. Prometheus's documentation is explicit about the philosophy of alerting that it assumes; basically it wants you to have only a few alerts, based on user-visible symptoms. Because we look after physical hosts instead of services (and to the extent that we have services we have a fair amount of them), we have a lot of potential alerts about a lot of potential situations.

(Many of these situations are user visible, simply because users can see into a lot of our environment. Users will notice if any particular general access login or compute server goes down, for example, so we have to know about it too.)

Our current solution is to make do. By grouping alerts only on a per-host basis, we hope to keep the 'repeated alerts in new notifications' problem down to a level where we probably won't miss significant new problems, and we have some hacks to create one time notifications (basically, we make sure that some alerts just can't group together with anything else, which is more work than you'd think).

(It's my view that using Alertmanager to inhibit 'less severe' alerts in favour of more severe ones is not a useful answer for us for various reasons beyond the scope of this entry. Part of it is that I think maintaining suitable inhibition rules would take a significant amount of care in both the Alertmanager configuration and the Prometheus alert generation, because Alertmanager doesn't give you very much power for specifying what inhibits what.)

Sidebar: Why we're using Prometheus for alerting despite this

Basically, we don't want to run a second system just for alerting unless we really have to, especially since a certain number of alerts are naturally driven from information that Prometheus is collecting for metrics purposes. If we can make Prometheus work for alerting and it's not too bad, we're willing to live with the issues (at least so far).

PrometheusAlertsProblem written at 23:35:56; Add Comment

2018-11-09

Getting CPU utilization breakdowns efficiently in Prometheus

I wrote before about getting a CPU utilization breakdown in Prometheus, where I detailed building up a query that would give us a correct 0.0 to 1.0 CPU utilization breakdown. The eventual query is:

(sum(irate(node_cpu_seconds_total {mode!="idle"} [1m])) without (cpu)) / count(node_cpu_seconds_total) without (cpu)

(As far as using irate() here goes, see rate() versus irate().)

This is a beautiful and correct query, but as it turns out you may not want to actually use it. The problem is that in practice, it's also an expensive query when evaluated over a sufficient range, especially if you're using some version of it for multiple machines in the same graph or Grafana dashboard. In some reasonably common cases, I saw Prometheus query durations of over a second for our setup. Once I realized how slow this was, I decided to try to do better.

The obvious way to speed up this query is to precompute the number that's essentially a constant, namely the number of CPUs (the thing we're dividing by). To make my life simpler, I opted to compute this so that we get a separate metric for each mode, so we don't have to use group_left in the actual query. The recording rule we use is:

- record: instance_mode:node_cpus:count
  expr: count(node_cpu_seconds_total) without (cpu)

(The name of this recording rule metric is probably questionable, but I don't understand the best practices suggestions here.)

This cuts out a significant amount of the query cost (anywhere from one half to two thirds or so in some of my tests), but I was still left with some relatively expensive versions of this query (for instance, one of our dashboards wants to display the amount of non-idle CPU utilization across all of our machines). To do better, I decided to try to pre-compute the sum() of the CPU modes across all CPUs, with this recording rule:

- record: instance_mode:node_cpu_seconds_total:sum
  expr: sum(node_cpu_seconds_total) without (cpu)

In theory this should provide basically the same result with a clear saving in Prometheus query evaluation time. In practice this mostly works but occasionally there are some anomalies that I don't understand, where a rate() or irate() of this will exceed 100% (ie, will return a result greater than the number of CPUs in the machine). These excessive results are infrequent and you do save a significant amount of Prometheus query time, which means that there's a tradeoff to be made here; do you live with the possibility of rare weird readings in order to get efficient general trends and overviews, or do you go for complete correctness even at the sake of higher CPU costs (and graphs that take a bit of time to refresh or generate themselves)?

(If you know that you want a particular resolution of rate() a lot, you can pre-compute that (or pre-compute an irate()). But you have to know the resolution, or know that you want irate(), and you may not, especially if you're using Grafana and its magic $__interval template variable.)

I've been going back and forth on this question since I discovered this issue. Right now my answer is that I'm defaulting to correct results even at more CPU cost unless the CPU cost becomes a real, clear problem. But we have the luxury that our dashboards aren't likely to be used very much.

Sidebar: Why I think the sum() in this recording rule is okay

The documentation for both rate() and irate() tells you to always take the rate() or irate() before sum()'ing, in order to detect counter resets. However, in this case all of our counters are tied together; all CPU usage counters for a host will reset at the same time, when the host reboots, and so rate() should still see that reset even over a sum().

(And the anomalies I've seen have been over time ranges where the hosts involved haven't been rebooting.)

I have two wild theories for why I'm seeing problems with this recording rule. First, it could be that the recording rule is summing over a non-coherent set of metric points, where the node_cpu_seconds_total values for some CPUs come from one Prometheus scrape and others come from some other scrape (although one would hope that metrics from a single scrape appear all at once, atomically). Second, perhaps the recording rule is being evaluated twice against the same metric points from the same scrape, because it is just out of synchronization with a slow scrape of a particular node_exporter. This would result in a flat result for one point of the recording rule and then a doubled result for another one, where the computed result actually covers more time than we expect.

(Figuring out which it is is probably possible through dedicated extraction and processing of raw metric points from the Prometheus API, but I lack the patience or the interest to do this at the moment. My guess is currently the second theory, partly based on some experimentation with changes().)

PrometheusCPUStatsII written at 23:40:14; Add Comment

2018-11-08

The future of our homedir-based mail server system design

In a comment on my entry on our self-serve system for autoreplies, I was asked a very good question:

Do you think it will ever be possible for you to move to a non-homedir-based mail server at all?

My answer is that I no longer think it would be a good thing for us to move to a non-homedir-based mail system.

Most mail systems segregate all mail storage and mail processing away from regular user files. As the commentator is noting, our mail system doesn't work this way and instead does things like allow users to have .forward files in their home directories. Sometimes this causes us difficulties, and so the question here is a sensible one. In the past I would have told you that we would eventually have to move to an IMAP-only environment where mail handling and storage was completely segregated from user home directories. Today I've changed my mind; I now think that we should not move mail out of people's home directories. In fact I would like more of their mail to live under their home directories; in an IMAP-only environment, I would like to put people's INBOXes somewhere in there too, instead of in the shared /var/mail filesystem that we have today.

The reasons for this ultimately come down to that eternal issue of storage allocation, plus the fact that any number of our users have quite a lot of email (gigabytes and gigabytes of it). No matter what you do about email, it has to live somewhere, and someone has to pay for the storage space, the backups, and so on. In our environment, how we allocate storage in general is that people get however much disk space they're willing to pay for. There's various obvious good reasons to stick with this for mail storage space, and once we're doing that there are good reasons to stick with our standard model of providing disk space when providing mail folder space, including that we already have an entire collection of systems for managing it, backing it up, and so on. Since we use ZFS pools, in theory this mail storage space doesn't have to go in people's home directories; we could make separate 'mail storage' filesystems for every person and every group. In practice, we already have home directory filesystems.

It's possible that IMAP-only mail storage in some dedicated format would be more efficient and faster than what we currently have (most people keep their mail in mbox format files). In practice we don't have a mail environment that is that demanding (and since we're moving an an all-SSD setup for all storage, our IO rates are about to get much better anyway).

As far as things like .forwards go, my view is that this is a pragmatic tradeoff. Supporting .forwards ties our hands to some degree, but it also means that we don't have to build and commit to a user accessible server side mail filtering system, with all of the many questions it would raise. As with mail folder storage, using people's home directories and all of the regular Unix infrastructure is the easiest and most obvious approach.

PS: Our Unix-focused approach is not one I would necessarily recommend for other environments. It works here for various reasons, including that we already have general Unix login servers for people to use and that we don't have that many users.

MailAndHomedirs written at 23:54:27; Add Comment

(Previous 10 or go back to November 2018 at 2018/11/06)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.