Wandering Thoughts archives

2020-02-29

OpenBSD versus Prometheus (and Go)

We have a decent number of OpenBSD machines that do important things (and that have sometimes experienced problems like running out of disk space), and we have a Prometheus based metrics and monitoring system. The Prometheus host agent has enough support for OpenBSD to be able to report on critical metrics, including things like local disk space. Despite all of this, after some investigation I've determined that it's not really sensible to even try to deploy the host agent on our OpenBSD machines. This is due to a combination of factors that have at their root OpenBSD's lack of ABI stability.

Prometheus and its host agent are written in Go. On OpenBSD, the host agent uses Go's cgo feature to call native C libraries, which means that it can't readily be cross-compiled and must be build natively on an OpenBSD machine. Because of OpenBSD's lack of ABI stability, any given version of Go officially supports only a relatively narrow range of OpenBSD versions (generally the versions that were supported at the time it was released), and in my experiments often doesn't build or work on OpenBSD versions outside of that. Often Go binaries built with a particular version of Go will only work on a narrow range of OpenBSD versions, that being the versions that it supports. Trying to use them on either a too old or a too new OpenBSD runs into either new things that aren't supported on the old versions or old things that aren't supported on the new ones.

(These things can be fundamental issues, like types of runtime dynamic loading relocation that aren't supported on older OpenBSD versions.)

Current versions of the Prometheus host agent generally won't build with versions of Go that are too old (nor will past versions, although what the minimum Go version is drops as you go back to older versions of the host agent). On OpenBSD, this means that a given version of the host agent effectively has a narrow range of OpenBSD versions it will ever run on. It will probably not run on newer OpenBSDs after a while, and it definitely won't run on older ones. And finally, historically, metrics have been renamed and shuffled around between versions of the host agent.

This means that if you have a bunch of OpenBSD machines of various different versions (as we do), at a minimum you must run different versions of the host agent on different versions of OpenBSD and they will expose some metrics with different names, which makes it hard to create unified dashboards, alerts, and so on. When a new version of the host agent comes out, you'll only be able to upgrade some of your OpenBSD machines to use it, not all of them. And in practice the range of OpenBSD versions where a particular version of host agent works well seems to be even narrower than the range that it will run on.

(All of this assumes that you can even manage to build old versions of Go and old versions of the host agent on older OpenBSD machines. I was not entirely successful at this, even for Go versions that were nominally supported on my OpenBSD versions.)

Running a whole collection of different versions of the Prometheus host agent on different machines, and having to freeze magical golden binaries that we may or may not be able to rebuild later, is not a very attractive proposition. Even if we ignored our old OpenBSD machines and only deployed the latest host agent on the latest, currently supported OpenBSD version, we would sooner or later not be able to upgrade the host agent any more and then we would have metric drift (or we would have to stop running the host agent on those machines). For a relatively minor increase in observability on machines that we almost never have problems in the first place, the whole thing isn't worth it.

(It's possible that the upcoming 1.0 release of the host agent will promise to stop changing the current metric names, which would change the calculations here a bit. I suppose I should ask about that on the Prometheus mailing list.)

OpenBSDVsPrometheusAndGo written at 19:04:23; Add Comment

2020-02-27

Some alert inhibition rules we use in Prometheus Alertmanager

One of the things you can do with Alertmanager is to make one alert inhibit notifications for some other alerts. This is set up through inhibition rules. You can find some generic examples in sample Alertmanager configuration files, and today I'm writing up two specific inhibition rules that we use.

First off, inhibition rules are very closely tied to the structure of your alerts, your labels, and your overall system; they can't be understood or written outside of that. This is because all of those determine both what alerts you want to suppress what other alerts and how you do that matching. In our case, we group and aggregate alerts by host, and all alerts have a 'cstype' label that says what type of alert they are (a per-host alert, a temperature alert, a disk space alert, etc) and host alerts have a 'cshost' label that is the host's canonical host name (more or less).

We have some hosts that have multiple network interfaces that we check; for instance, one of our backup servers is on our non-routed firewalls network so that it can back up our firewalls. We check both the backup server's main interface and its firewall interface, because otherwise we might not notice in time if it only dropped off the firewall network (we'd find out only when the nightly backups failed). At the same time, when the host goes down we don't want to get two sets of alerts, one for its main interface and one for its firewall interface. So we have an inhibition rule like this:

- target_match:
    cshost: HOST.fw.sandbox
  source_match:
    cshost: HOST
  equal: ['cstype', 'alertname', 'sendto', 'send']

The source match defines the alert that will suppress other alerts; the target match defines what other alerts will be suppressed. So this is saying that an alert for HOST can inhibit notification for alerts for the firewall sandbox version of HOST. But not all alerts; in order to be inhibited, the alerts must be of the same type, be the same alert (the alertname), and be going to the same destination (our 'sendto' label) and in the same way (our 'send' label). All of that is set by the 'equal:' portion of the inhibition rule.

A somewhat more complicated case is our special 'there is a large scale problem' alert that inhibits per-host alerts. The top level inhibition rule is simple:

- source_match:
    cstype: largescale
    scale: all
  target_match:
    cstype: host

This says that an alert marked as being a large scale alert for all hosts inhibits all 'host' type alerts, which is all of our per-host alerts. Other types of alerts will be allowed through (eg temperature alerts), if they can still be triggered from the data that's still available to Prometheus. Because this inhibits across two completely different types of alerts, we don't have any labels we can sensibly check label equality on in an 'equals:' portion.

(In a large scale problem we probably can't talk to temperature sensors, get disk usage information, and so on, so any problems there will go unnoticed anyway.)

For reasons outside the scope of this entry, we also have another large scale problem alert if there are enough problems with the machines in our SLURM compute cluster. The existence of this alert creates a problem; if there a real large scale problem, we will have enough problematic SLURM nodes to also trigger this alert. So we also inhibit the more specific SLURM large scale alert when the general large scale alert is active:

- source_match:
    cstype: largescale
    scale: all
  target_match_re:
    cstype: largescale
    scale: 'slurm'

We have to use 'scale' here in the target match (and have it at all), because if we left it out Alertmanager would happily allow our global large scale problems alert to inhibit itself. (The use of 'target_match_re' instead of 'target_match' is a historical relic and should be changed.)

(We could use 'alertname' instead of our own custom label to tell the two apart, but I refer a custom label to make things explicit.)

Finally, the 'large scale SLURM problems' alert looks like the general one but has to apply only to the SLURM nodes, not to all machines. We currently do this by regular expression matching instead of trying to have a suitable label on everything to do with those machines:

- source_match:
    cstype: largescale
    scale: slurm
  target_match_re:
    cstype: "(host|notify)"
    cshost: "(cpunode|gpunode|cpuramnode|amdgpunode).*"

(Here the use of target_match_re is required, since we really are matching regular expressions.)

This inhibits both per-host alerts and our reboot notifications. We don't inhibit reboot notifications in our large scale problems inhibition, because it's handy to get some sign that machines are back, but this isn't interesting for SLURM nodes.

There is a subtle bit of behavior here. Inhibition only stops notifications for an alert; the alert continues to exist in general inside Alertmanager, and in particular it can inhibit other alerts. So when we have a large scale problem, the large scale alert inhibits notification about the large scale SLURM alert and in turn the large scale SLURM alert inhibits our reboot notifications for the SLURM nodes. This is sufficiently tricky that I should probably add a comment about it to the Alertmanager configuration file.

PrometheusAlertsOurInhibitions written at 22:00:31; Add Comment

2020-02-26

The magic settings to make a bar graph in Grafana

Suppose, not hypothetically, that you want to make a Grafana panel that shows a bar chart (or graph), because bar charts are a great way to show a summary of something instead of getting lost in the details of a standard graph over time. Grafana fully supports this, but exactly how you do it is sufficiently non-obvious that I've now forgotten and had to re-discover it at least twice now. So now I'm writing it down so I can find it again.

(This is sort of covered in the documentation, but you have to know where to look and part of it isn't clear.)

Bar charts aka bar graphs are a form of graph, so they fall under the 'graph' panel type. Although current versions of Grafana have a bar gauge panel type, you probably don't want to use it for this and it's certainly currently not the normal approach. To create a bar graph, you want three settings together:

  • In the 'Query' tab, set the duration of each of your queries to 'instant' and leave the format for each of your queries as the default of 'time series'.

  • In the 'Visualization' tab, set the draw mode to 'bars' instead of anything else, then the crucial magic step is to set the X-axis mode to 'series' instead of the default of 'time'. Since we're doing an instant query, the X-axis 'value' can be left at 'total' or changed to 'current' to be clearer.

    (I believe that if you did a non-instant query, the various options for 'value' here would determine what value the bars showed for each separate series.)

If you set the query to be instant but change nothing else, you get an empty or almost entirely empty graph (with the actual values maybe visible on the far right, since 'instant' is also 'right now'). If you also switch the draw mode to 'bars', you get your single bar value for each series smeared across the entire graph, which looks rather confusing.

I was going to gripe about Grafana not deducing that you want a bar graph when you set an instant query, but there are enough moving parts here that it's not that simple. Even with an instant query you could perhaps do a histogram X-axis instead of a bar graph, for example. I wish Grafana at least put up a warning to the effect that this combination doesn't make any sense and you want to change the X-axis mode.

PS: If you set the X-axis mode to series but don't change the draw mode to 'bars', you get dots (or points) at the top of what would be each bar. This might be what you want but probably not. You can set multiple draw modes, which will give you things like bars with points at the top.

GrafanaMakeBarGraph written at 23:08:29; Add Comment

2020-02-23

Our (unusual) freedom to use alerts as notifications

Many guides to deciding what to alert on draw a strong distinction between alerts and less important things (call them 'notifications'). The distinction generally ultimately comes about because alerts will disturb the people who are on call outside of working hours, and that should be reserved for serious things that they can and should take action on. This is often threaded through assumptions and guidelines in metrics and alerting systems; for example, Prometheus implicitly follows this in their guide to alerting, and the philosophy document they link to assumes that alerts will page people and so should be minimized.

Our alerts in our Prometheus setup don't follow this. I've already written up our reboot notifications, which are implemented as special Prometheus alerts that explicitly call themselves 'notifications' and are handled specially, but it extends beyond this in our alerts. We generate 'alerts' for things that we merely want to keep track of; one example is our automated reboots of hung Dell C6220 blades (which alert as if a machine went down and then came back, because it did), and these alerts are just the same as we would get for any machine that went down and then came back up.

(This is also part of why we have set Alertmanager to also send us email about resolved alerts (cf). Paging people to tell them something is now over would probably not be well received.)

The reason we have this freedom is not that we've done clever design in our Prometheus setup to avoid paging people for such notifications. Instead, it's because no one is on call here and so these notification alerts are not disturbing anyone when they trigger outside of working hours (even inside of working hours, they're just another email message). If we started having people on call, we would have to change this so that only genuine alerts paged people.

(This doesn't mean that limiting what we alert on is unimportant. Even for alerts that are just notifying us about things, we have to both genuinely care about the thing and find the notification useful. Generally our notifications are kept down so that they fire only rarely unless we're having real problems, such as a lot of Dell C6220 blades crashing all the time for some reason. And we might filter those out if they started becoming ubiquitous, or perhaps take the blades out of service on the grounds that they're now too unreliable.)

This blurring of alerts and notifications is not without its hazards, most obviously if we become acclimatized to notifications and treat a real problem (an alert) as merely a less important notification that can be let sit for a bit. But it's also been important for what alerts we actually create; our freedom to 'alert' on things that are perhaps not an immediate crisis allows us to watch more things and to be less cautious and conservative about the levels of things to alert on (and accept some false positives in the name of, say, alerting early about things that are real problems).

(I mentioned our situation with alerts not paging us back in why we generate alert notifications about rebooted machines, but I didn't think or talk about the impact it's had on what we alert about. It's sort of a 'fish in water' thing; I didn't think about how it affected what we alert on until recently.)

AlertsAsNotificationsFreedom written at 01:39:41; Add Comment

2020-02-19

Load average is now generally only a secondary problem indicator

For a long time I've been in the habit of considering a high load average (or an elevated one) to be a primary indicator of problems. It was one of the first numbers I looked at on a system to see how it was, I ran xloads on selected systems to watch it more or less live, I put it on Grafana dashboards, and we've triggered alerts on it for a long time (well before our current metrics and alert setup was set up). But these days I've been moving away from that, because of things like how our login server periodically has brief load average spikes and our IMAP server's elevated load average has no clear cause or impact.

When I started planning this entry, I was going to ask if load average even matters any more. But that's going too far. In a good number of situations, looking at the load average will tell you a fair bit about whether you have a significant problem or perhaps the system is operating as expected but close to its limits. For instance, if a machine has a high CPU usage, it might be a single process that is running a lot (which could be expected), or it could be that you have more running processes than the machine can cope with; the load average will help you tell which is which. But a low load average doesn't mean the machine is fine and a high load average doesn't mean it's in trouble. You need to look for primary problem indicators first, and then use load average to assess how much of a problem you have.

(There are echoes of Brendan Gregg's USE method here. In USE terms, I think that load average is mostly a crude measure of saturation, not necessarily of utilization.)

Despite my shifting view on this, we're probably going to keep using load average in our alerts and our dashboards. It provides some information and more importantly it's what we're used to; there's value in keeping with history, assuming that the current state of things isn't too noisy (which it isn't; our load average alerts are tuned to basically never go off). But I'm running fewer xloads and spending less time actually looking at load average, unless I want to know about something I know is specifically reflected in it.

LoadAverageSecondarySign written at 23:37:41; Add Comment

How and why we regularly capture information about running processes

In a recent entry, I mentioned that we periodically capture ps and top output on our primary login server, and in fact we do it on pretty much all of our servers. There are three parts to this; the history of how we wound up here, how we do it, and why we've come to do it as a routine thing on our servers.

We had another monitoring system before our current Prometheus based one. One of its handy features was that when it triggered a load average alert, the alert email would include 'top' output rather than just have the load average. Often this led us right to the cause (generally a user running some CPU-heavy thing), even if it had gone away by the time we could look at the server. Prometheus can't do this in any reasonable way, so I did the next best thing by setting up a system to capture 'top' and 'ps' information periodically and save it on the machine for a while. The process information wouldn't be right in the email any more, but at least we could still go look it up later.

Mechanically, this is a cron job and a script that runs every minute and saves 'top' and 'ps' output to a file called 'procs-<HH>:<MM>' (eg 'procs-23:10') in a specific local directory for this purpose (in /var on the system). Using a file naming scheme based on the hour and minute the cron job started and overwriting any current file with that name means that we keep the last 24 hours of data (under normal circumstances). The files are just plain text files without any compression, because disk space is large these days and we don't need anything fancier. On a busy server this amounts to 230 MBytes or so for 24 hours of data; on less active servers it's often under 100 MBytes.

Our initial reason for doing this was to be able to identify users with CPU-consuming processes, so we started out only deploying this on our login servers, our general access compute servers (that anyone can log in to at any time), and a few other machines like our general web server. However, over time it became clear that being able to see what was running (and using CPU and RAM) around some time was useful even on servers that aren't user accessible, so we now install the cron job, script, local data directory, and so on on pretty much all of our machines. We don't necessarily look at the information the system captures all that often, but it's a cheap precaution to have in place.

(We also use Unix process accounting on many machines, but that doesn't give you the kind of moment in time snapshot that capturing 'top' and 'ps' output does.)

OurProcessInfoCapturing written at 00:13:17; Add Comment

2020-02-16

With sudo, complex argument validation is best in cover scripts

Suppose, as a not entirely hypothetical case, that you want to allow some people to run 'zfs destroy' to delete only ZFS snapshots (since ZFS cannot delegate this through its own permission system). You can tell ZFS snapshots apart from other ZFS objects because ZFS snapshots all have '@' in their names. There are two approaches to enforcing this restriction on 'zfs destroy' arguments. The first is to write a suitable sudoers rule that carefully constraints the arguments to 'zfs destroy' (see Michael's comment on this entry for one attempt). The second is to write a cover script that takes the snapshot names, validates them itself, and runs 'zfs destroy' on the suitably validated results. My view is that you should generally use cover scripts to do complex argument validation for sudo'd commands, not sudoers.

The reason for this is pretty straightforward and boils down to whitelisting being better than blacklisting. A script is in the position to have minimal arguments and only allow through what it has carefully determined is safe. Using sudoers to only permit some arguments to an underlying general purpose command is usually in the position of trying to blacklist anything bad (sometimes explicitly and sometimes implicitly, as in Michael's match pattern that blocks a nominal snapshot name with a leading '-'). General purpose commands are usually not written so that their command line arguments are easy to filter and limit; instead they often have quite a lot of general arguments that can interact in complex ways. If you only want to have a limited subset of arguments accepted, creating a cover script that only accepts those arguments is the simple approach.

Cover scripts also have the additional advantage that they can simplify the underlying commands in ways that reduce the chance of errors and make it clearer what you're doing. This is related to the issue of command error distance, although in these cases often your sudo setup is intended to block the dangerous operation in the first place. Still, the principle of fixing low command error distances with cover scripts applies here.

(Of course the downside is that now people have to remember the script instead of the actual command. But if you're extending sudo permissions to people who would not normally use the command at all, you have to train them about it one way or another.)

SudoersAndCoverScripts written at 02:16:35; Add Comment

2020-02-08

Ways that I have lost the source code for installed programs

When I compared Go to Python for small scale sysadmin tools, I said that one useful property for Python was that you couldn't misplace the source code for your deployed programs. This property recently hit home for me when I decided that maybe I should rebuild my private binary of my feed reader for reasons beyond the scope of this entry, and had to try to find the source code that my binary was built from. So here is an incomplete list of the ways to lose (in some sense) the source code for your installed programs, focusing on things that I have actually seen happen either at work or for my personal programs.

I don't think I've ever lost source code by deleting it. This is less because I keep careful track of what source code we need to keep and more because I tend to be a software packrat. Deleting things is work, disk space is cheap, we might as well keep it around (and then at some point it's old enough to be of historical interest). We have deleted configuration files for a now-dead machine that were also being used by another machine, though, so we've come close.

The straightforward ways of losing source code by either forgetting where it is or by moving it from where it originally was to some 'obvious' other spot are two sides of the same coin and can be hard to tell apart from each other when you're looking for the source (although if you run 'strings' on an executable to get some likely paths and then they aren't there, things probably got moved). A dangerous and time-wasting variation on this is to start out with the source code in /some/where/prog, build it, rename the source directory to /some/where/prog-old, and reuse /some/where/prog for a new version of the program that you were working on but didn't install.

A variant of this is to wind up with several different source code directories for the program (with different versions in them) with no clear indication of which directory and version was used to build the installed program. If directories have been renamed, the strings embedded in the executable may not help. If you're lucky you left the .o files and so on sitting around so you can at least match up the date of the installed program with the dates of the .o files to figure out which it was.

Another way to lose source code is to start to change or update the code in your nicely organized source directory without taking some sort of snapshot of its state as of when you built the binary you're using. This one has bitten me repeatedly in the past, when I had the source directory I built from but it was no longer in the same state and I had no good way to go back. There are all sorts of ways to wind up here. You can be in the process of making changes, or you can have decided to merge together several divergent versions from different systems (with different patches and changes) into one all-good master version.

(Merging together disparate versions was especially an issue in the days before distributed version control systems. We had a locally written MTA that was used by multiple groups across multiple systems, and of course we wound up with a whole cloud of copies of source code, of various vintages and with various local changes.)

The final way of 'losing' source code that I've encountered is to have the unaltered source code in a known place that you can find, but for it to no longer build in a current environment. All sorts of things can cause this if you have sufficiently old programs; compilers can change what they accept, header files change, your program only works with an old version of some library where you have the necessary shared library objects but not the headers, the program's (auto)configuration system no longer works, and so on. In the past a big source of this was programs that weren't portable to 64 bit environments, but old code can have all sorts of other issues as well.

LosingSourceCodeWays written at 00:01:03; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.