Wandering Thoughts

2019-08-09

Turning off DNSSEC in my Unbound instances

I tweeted:

It has been '0' days since DNSSEC caused DNS resolution for perfectly good DNS names to fail on my machine. Time to turn DNSSEC validation off, which I should have done long ago.

I use Unbound on my machines, from the Fedora package, so this is not some questionable local resolver implementation getting things wrong; this is a genuine DNSSEC issue. In my case, it was for www.linuxjournal.com, which is in my sources of news because it's shutting down. When I tried to visit it from my home machine, I couldn't get an answer for its IP address. Turning on verbose Unbound logging gave me a great deal of noise, in which I could barely make out that Unbound was able to obtain A and AAAA records but then was going on to try DNSSEC and clearly something was going wrong. Turning of DNSSEC fixed it, once I did it in the right way.

NLNet Labs has a Howto on turning off DNSSEC in Unbound that provides a variety of ways to do this, starting from setting 'val-permissive-mode: yes' all the way up to disabling the validator module. My configuration has had permissive mode set to yes for years, but that was apparently not good enough to deal with this situation, so I have now removed the validator module from my Unbound module configuration. In fact I have minimized it compared to the Fedora version.

The Fedora 29 default configuration for Unbound modules is:

module-config: "ipsecmod validator iterator"

I had never heard of 'ipsecmod' before, but it turns out to be 'opportunistic IPSec support', as described in the current documentation for unbound.conf; I will let you read the details there. Although configured as a module in the Fedora version, it is not enabled ('ipsecmod-enabled' is set off); however, I have a low enough opinion of unprompted IPSec to random strangers that I removed the module entirely, just in case. So my new module config is just:

module-config: "iterator"

(Possibly I could take that out too and get better performance.)

In the Fedora Unbound configuration, this can go in a new file in /etc/unbound/local.d. I called my new file 'no-dnssec.conf'.

(There were a variety of frustrating aspects to this experience and I have some opinions on DNSSEC as a whole, but those are for another entry.)

UnboundNoDNSSEC written at 20:56:17; Add Comment

2019-08-01

How not to set up your DNS (part 24)

I'll start with the traditional illustration of DNS results:

; dig +short mx officedepot.se.
50 odmailgate.officedepot.com.
40 officedepot.com.s10b2.psmtp.com.
30 officedepot.com.s10b1.psmtp.com.
20 officedepot.com.s10a2.psmtp.com.
10 officedepot.com.s10a1.psmtp.com.

What I can't easily illustrate is that none of the hostnames under psmtp.com exist. Instead, it seems that officedepot.com has shifted its current mail handling to outlook.com, based on their current MX. While odmailgate.officedepot.com resolves to an IP address, 205.157.110.104, that IP address does not respond on port 25 and may not even be in service any more.

(It is not Office Depot's problem that we're trying to mail officedepot.se, of course; it is due to a prolific spammer hosted out of scaleway.com that is forging the envelope sender of their spam email from 'bounce@<various domains>', including officedepot.se and 'cloud.scaleway.com'.)

This does point out an interesting risk factor in shifting your mail system handling when you have a lot of domains, possibly handled by different groups of people. In an ideal world you would remember all of the domains that you accept mail for and get in touch with the people who handle their DNS to change everything, but in this world things can fall through the cracks. I suspect it's especially likely to happen for places that have enough domains that handling adding and removing them has been automated.

(It's been a while since the last installment; for various reasons I don't notice other people's DNS issues very often these days. I actually ran across a DNS issue in 2017 that I was going to post, but I ran into this issue and never finished the entry.)

HowNotToDoDNSXXIV written at 12:16:04; Add Comment

2019-07-21

Why we're going to be using Certbot as our new Let's Encrypt client

We need a new Let's Encrypt client to replace acmetool, and I'm on record as not particularly liking Certbot; it lacks some features that are important to us, it's a pretty big program, and it's quite ornate (and there's the issue of the EFF trying to get you to sign up for their mailing list when you register a Let's Encrypt account with an email address). But despite that, Certbot is going to be our future Let's Encrypt client unless we uncover some fatal problem as we finalize how we're going to operate it.

The reason why is very simple; I never want to go through changing clients again, because changing clients is very disruptive and a lot of work. We're forced to change clients now because our previous client of choice has stopped being maintained and hasn't kept up with Let's Encrypt's changes. Certbot is pretty much the closest thing Let's Encrypt has to an official client, so the odds are very good that it will keep up with any Let's Encrypt changes, and probably also any other changes needed to keep working on popular Linuxes such as various versions of Ubuntu LTS.

(Let's Encrypt officially recommends Certbot and has for some time.)

Certbot is not my ideal Let's Encrypt client. But it is a workable client (and we can make it more workable with a cover script), and it's extremely likely to stay that way for as long as we want to use Let's Encrypt. This is good enough to make it my choice.

(On a pragmatic basis, Certbot also seems to be the closest I can get to acmetool in a client that is written in a way that I'm okay with. In particular, as someone who has dealt with OpenSSL and written things in Bash, my view is that I don't think either are the right foundation for a Let's Encrypt client that I want to entrust our systems to. I admire the spirit of aggressive minimalism that makes people write Let's Encrypt clients with little or no dependencies, but that isn't what's important to us.)

Sidebar: I don't regret picking acmetool way back when

Back when I initially picked acmetool, my usage case was different and Certbot was significantly more work and more intrusive to install than it is today. Carrying over using acmetool when we switched to Let's Encrypt was natural, and it worked well. Also, acmetool is a very simple client to use and in the beginning that was important to us because we weren't sold on the benefits of Let's Encrypt; a complex install and operation process wouldn't have been half as attractive, and we might have kept on using manually obtained TLS certificates (especially after we could get free ones through the university's central IT).

In short, acmetool has worked great for years and was the no hassle client we needed at the start. Especially at the time when we started using it, I don't think there was a better alternative for us.

CertbotWhyOurChoice written at 22:14:44; Add Comment

2019-07-18

Switching Let's Encrypt clients is currently quite disruptive

On Twitter, I said:

At the moment, changing between Let's Encrypt clients appears to be about as disruptive as changing to or from Let's Encrypt and another CA. Certificate paths change, software must be uninstalled and installed, operational practices revised, and nothing can be moved over easily.

I didn't mention that you are probably going to have to get reissued certificates unless you like doing a lot of work, but that's true too. Let's Encrypt makes this easy, but some people may run into rate limits here.

This is on my mind because we're replacing acmetool with something else (almost certainly Certbot for reasons beyond the scope of this entry), so I've been thinking about the mechanics of the switch. Unfortunately there are a lot of them. Acmetool and Certbot do almost everything differently; they put their certificates in different places, they have different command lines and setup procedures, and Certbot needs special handling during some system installs that acmetool doesn't.

So to transition a machine we're going to have to install Certbot (or whatever client), install our Certbot customizations (we need at least a hook script or two), uninstall acmetool to remove its cron job and Apache configuration snippet, set up the Apache configuration snippet that Certbot needs, register a new account, request certificates, and then update the configuration of all of our TLS-using programs to the new certificate locations. Then the setup instructions for the machine need to be revised to perform the Certbot install and setup instead of the current acmetool one. We get to repeat this for every system we have that uses Let's Encrypt. All of this requires manual work; it's not something we can really automate in a sensible amount of time (at least not safely, cf).

(Then when we need new TLS certificates we'll have to use different commands to get them, and if we run into rate limits we'll have to use different ways to deal with the situation.)

There are multiple causes for this. One of them is simply that clients are different, with different command lines (and Certbot has some very ornate ones, which we'll almost certainly fix with a cover script that provides our standard local options). But a big one is that clients have not standardized even where and how they store data about certificates and Let's Encrypt accounts, much less anything more. As a result, for example, as far as I know there's no official way to import current certificates and accounts into Certbot, or extract them out afterward. Your Let's Encrypt client, whatever it is, is likely to be a hermetically sealed world that assumes you're starting from scratch and you'll never want to leave.

(It would be nice if future clients could use Certbot's /etc/letsencrypt directory structure for storing your TLS certificates and keys. At least then switching clients wouldn't require updating all of the TLS certificate paths in configuration files for things like Apache, Exim, and Dovecot.)

LetsEncryptClientChangeHassle written at 23:10:45; Add Comment

2019-07-14

We're going to be separating our redundant resolving DNS servers

We have a number of OpenBSD machines in various roles; they're our firewalls, our resolving DNS servers as well as our public authoritative DNS server, and so on. For pretty much all of these, we actually have two identical servers per role in a hot spare setup, so that we can rapidly recover from various sorts of failures. For our firewalls, switching from one to another takes manual action (we have to change which one is plugged into the live network, although their firewall state is synchronized with pfsync so that a switch is low impact). For our DNS resolvers, we have both on the network and list both addresses in our /etc/resolv.conf, because this works perfectly fine with DNS servers.

(All of our machines list the same resolver first, which we consider a feature for reasons beyond the scope of this entry. Our routing firewalls don't use CARP for various reasons, some of them historical, but in practice it doesn't matter, as we haven't had a firewall hardware failure. When we have switched firewalls, it's been for software reasons.)

All of this sounds great, except for the bit where I haven't mentioned that these redundant resolving DNS servers are racked next to each other (one on top of the other), plugged into the same rack PDU, and connected to the same leaf switch. We have great protection against server failure, which is what we designed for, but after we discovered that switches can wind up in weird states after power failures it no longer feels quite so sufficient, since working DNS is a crucial component of our environment (as we found out in an earlier power failure).

(Most of our paired redundant servers are racked up this way because it's the most convenient option. They're installed at the same time, generally worked on at the same time, and they need the same network connections. For firewalls, in fact, you need to switch their network cables back and forth to change which is the live one.)

So, as the title of this entry says, we're now going to be separating our resolving DNS servers, both physically and for their network connection, so that the failure of a single rack PDU or leaf switch can't take both of them offline. Unfortunately we can't put one DNS server directly on the same switch as our fileservers; the fileserver switch is a 10G-T switch with a very limited supply of ports.

(Now that I write this entry the obvious question is whether all of our fileservers should be on the same 10G-T switch. Probably it's harmless, because our entire environment will grind to a halt if even a single fileserver drops off the network.)

PS: I suspect that our resolving DNS servers are the only redundant pair that are important to separate this way, but it's clearly something we should think about. We could at least add some extra redundancy for our VPN servers by separating the pairs, and that might be important during a serious problem.

SeparatingOurDNSResolvers written at 22:21:22; Add Comment

2019-07-13

Our switches can wind up in weird states after a power failure

We've had two power failures so far this year, which is two more than we usually have. Each has been a learning experience, because both times around our overall environment failed to come back up afterward. The first time around the problem was DNS, due to a circular dependency that we still don't fully understand. The second time around, what failed was much more interesting.

Three things failed to come back up after the second power failure. The more understandable and less fatal problem was that our OpenBSD external bridging firewall needed some manual attention to deal with a fsck issue. By itself this just cut us off from the external world. Much worse, two of our core switches didn't fully boot up; instead, they stopped in their bootloader and waiting for someone to tell them to continue. Since the switches didn't boot and apply their configuration, they didn't light up their ports and none of our leaf switches could pass traffic around. The net effect was to create little isolated pools of machines, one pool per leaf switch.

(Then naturally most of these pools didn't have access to our DNS servers, so we also had DNS problems. It's always DNS. But no one would have gotten very far even with DNS, because all of our fileservers were isolated on their own little pool on a 10G-T switch.)

We've never seen this happen before (and certainly it didn't happen in prior power outages and scheduled shutdowns), so we've naturally theorized that the power failure wasn't a clean one (either during the loss of power or when it came back) and this did something unusual to the switches. It's more comforting to think that something exceptional happened than that this is a possibility that's always lurking there even in clean power loss and power return situations.

(While we shut down all of our Unix servers in advance for scheduled power shutdowns, we've traditionally left all of our switches powered on and just assumed that they'd come back cleanly afterward. We probably won't change that for the next scheduled power shutdown, but we may start explicitly checking that the core switches are working right before we start bringing servers up the next day.)

That we'd never seen this switch behavior before also complicated our recovery efforts, because we initially didn't recognize what had gone wrong with the switches or even what the problem with our network was. Even once my co-worker recognized that something was anomalous about the switches, it took a bit of time to figure out what the right step to resolve it was (in this case, to tell the switch bootloader to go ahead and boot the main OS).

(The good news is that the next time around we'll be better prepared. We have a console server that we access the switch consoles through, and it supports informational banners when you connect to a particular serial console. The consoles for the switches now have a little banner to the effect of 'if you see this prompt from the switch it's stuck in the bootloader, do the following'.)

PS: What's likely booting here is the switch's management processor. But the actual switching hardware has to be configured by the management processor before it lights up the ports and does anything, so we might as well talk about 'the switch booting up'.

SwitchesAndPowerGlitch written at 23:58:51; Add Comment

2019-07-12

Reflections on almost entirely stopping using my (work) Yubikey

Several years ago (back in 2016), work got Yubikeys for a number of us for reasons beyond the scope of this entry. I got designated as the person to figure out how to work with them, and in my usual way with new shiny things, I started using my Yubikey's SSH key for lots of additional things over and above their initial purpose (and I added things to my environment to make that work well). For a long time since then, I've had a routine of plugging my Yubikey in when I got in to work, before I unlocked my screen the first time. The last time I did that was almost exactly a week ago. At first, I just forgot to plug in the Yubikey when I got in and didn't notice all day. But after I noticed that had happened, I decided that I was more or less done with the whole thing. I'm not throwing the Yubikey away (I still need it for some things), but the days when I defaulted to authenticating SSH with the Yubikey SSH key are over. In fact, I should probably go through and take that key out of various authorized_keys files.

The direct trigger for not needing the Yubikey as much any more and walking away from it are that I used it to authenticate to our OmniOS fileservers, and we took the last one out of service a few weeks ago. But my dissatisfaction has been building for some time for an assortment of reasons. Certainly one part of it is that the big Yubikey security issue significantly dented my trust in the whole security magic of a hardware key, since using a Yubikey actually made me more vulnerable instead of less (well, theoretically more vulnerable).

Another part of it is that for whatever reason, every so often the Fedora SSH agent and the Yubikey would stop talking to each other. When this happened various things would start failing and I would have to manually reset everything, which obviously made relying on Yubikey based SSH authentication far from the transparent experience of things just working that I wanted. At some points, I adopted a ritual of locking and then un-locking my screen before I did anything that I knew required the Yubikey.

Another surprising factor is that I had to change where I plugged in my Yubikey, and the new location made it less convenient. When I first started using my Yubikey I could plug it directly into my keyboard at the time, in a position that made it very easy to see it blinking when it was asking for me to touch it to authenticate something. However I wound up having to replace that keyboard (cf) and my new keyboard has no USB ports, so now I have to plug the Yubikey into the USB port at the edge of one of my Dell monitors. This is more awkward to do, harder to reach and touch the Yubikey's touchpad, and harder to even see it blinking. The shift in where I had to plug it in made everything about dealing with the Yubikey just a bit more annoying, and some bits much more annoying.

(I have a few places where I currently use a touch authenticated SSH key, and these days they almost always require two attempts, with a Yubikey reset in the middle because one of the reliable ways to have the SSH agent stop talking to the Yubikey is not to complete the touch authentication stuff in time. You can imagine how enthused I am about this.)

On the whole, the most important factor has been that using the Yubikey for anything has increasingly felt like a series of hassles. I think Yubikeys are still reasonably secure (although I'm less confident and trusting of them than I used to be), but I'm no longer interested in dealing with the problems of using one unless I absolutely have to. Nifty shiny things are nice when they work transparently; they are not so nice when they don't, and it has surprised me how little it took to tip me over that particular edge.

(It's also surprised me how much happier I feel after having made the decision and carrying it out. There's all sorts of things I don't have to do and deal with and worry about any more, at least until the next occasion when I really need the Yubikey for something.)

YubikeyMostlyDropped written at 01:27:37; Add Comment

2019-07-05

My plan for two-stage usage of Certbot when installing web server hosts

Let me start with our problem. When you request TLS certificates through Certbot, you must choose between standalone authentication, where Certbot runs an internal web server to handle the Let's Encrypt HTTP challenge, or webroot authentication, where Certbot puts files in a magic location under the web server's webroot. You can only choose one, which is awkward if you want a single universal process that works on all your hosts, and this choice is saved in the certificate's configuration; it will automatically be used on renewal by default. The final piece is that Apache refuses to start up if there are missing TLS certificates.

All of this creates a problem when installing a host that runs Apache. What you would like to do is perform the install (including the your specific Apache configuration), request the TLS certificates using standalone authentication since Apache can't start yet, and then start Apache and switch to webroot authentication for certificate renewals (so that Certbot can actually renew things now that Apache is using port 80). This would be trivial if Certbot provided a command to change the configured renewal method for a certificate, but as far as I can see they don't. While you can specify the authentication method when you ask for a certificate renewal, this doesn't by itself update the configuration; instead, Certbot only changes the renewal method when you actually renew the certificate.

This means that one way around this would be to request our TLS certificates with standalone authentication, then once Apache was up and running, immediately renew them using webroot authentication purely for the side effect of updating the certificate's configuration. The problem with this (at least in our environment) is that we risk running into Let's Encrypt rate limits, although perhaps not as much as I thought. However, there is a trick we can play to avoid that, because we don't need the first certificate to be trusted. It only exists to bootstrap Apache, and Apache doesn't validate the certificate chain of your certificates. This means that we can ask Certbot to get test certificates instead of real Let's Encrypt ones (using standalone authentication), start Apache, then immediately 'renew' them as real Let's Encrypt certificates using webroot authentication, which will as a side effect update the certificate's configuration.

(Of course in many real situations the actual procedure is 'restore or copy /etc/letsencrypt from the current production machine'.)

This is not as smooth and fluid a process as acmetool offers, and you have to ask for the certificates twice, with different magic command line options. I'm not certain it's worth writing a cover script to simplify this a bit, but perhaps it is, since we also need magic options for registration.

(With appropriate work in the script, you wouldn't even need to list all of the hostnames a second time, just tell it to renew everything as a real certificate now.)

PS: Realizing this trick and working this out makes me feel a fair bit happier about using Certbot. This particular problem was the largest, most tangled obstacle I could see, so I'm glad to have gotten past it.

CertbotTwoStageDeploys written at 22:23:40; Add Comment

2019-06-28

Using Prometheus's statsd exporter to let scripts make metrics updates

One of the things that's very useful about Prometheus is that it's pretty easy to write little ad-hoc scripts or programs that generate metrics and then publish them. You can use either the host agent's 'textfile' collector, which is a good fit for host-specific metrics on a host where you're already running the host agent, or you can have your program publish them through Pushgateway, including by just having your script pipe its output to an appropriate curl command. However, there is one situation that this doesn't cover, and that is when your scripts want to update a metric instead of generate it from scratch. For example, if you have a cron job that runs periodically to processes some variable number of things and you want a running count of how many things you have processed (instead of a gauge of how many have been processed on the last job). To use jargon, your script or program has stateless observations (eg 'I processed three things this time') and you want to convert them into ongoing metrics, which are necessarily stateful.

In short, you want to be able to update metrics, not just create or re-create them from scratch. Ideally you want these updates to be more or less atomic, so that you don't have to worry about 'read modify write' races if you have several instances of your script or program running at once, all trying to make an update.

The good news is that the Prometheus statsd exporter can do this for you, and it is actually very convenient to use. The statsd protocol itself is focused around exactly this sort of incremental updates to metrics and the statsd exporter will turn those into Prometheus metrics for us, and the protocol itself is text-based so we don't need any special client (especially since the Prometheus exporter speaks a TCP based version). For extra convenience, the Prometheus statsd exporter supports an extended statsd format with tags (also) that will let us directly attach labels and label text, rather than having to configure the statsd exporter to turn some portions of the statsd metric names into Prometheus label values.

The basic use is pretty straightforward. With the statsd exporter running on localhost, you can do:

echo 'our.counter:3|c|#lbl1:val1,lbl2:val2' | nc localhost 9125
echo 'our.counter:2|c|#lbl1:val1,lbl2:val2' | nc localhost 9125

This will create or update a Prometheus counter metric with the .'s turned into Prometheus '_'s and the labels we asked for:

our_counter {lbl1="val1", lbl2="val2"} 5

(You can also use '+3' instead of plain '3' to make things more obvious.)

For gauges, there is a magic trick, which is that '+<N>' increases the gauge and '-<N>' decreases it, while a plain number just sets the value:

echo 'our.gauge:3|g|#label:val' | nc localhost 9125
echo 'our.gauge:-5|g|#label:val' | nc localhost 9125
echo 'our.gauge:+4|g|#label:val' | nc localhost 9125

The result is a gauge metric:

our_gauge {label="val"} 2

As you can see, gauges can go negative. As is the Prometheus practice, counters can never decrease; the statsd exporter will reject attempts to do so (ie, statsd updates with negative values). These rejections are normally silent, but you can get the exporter to report them at log level 'debug'.

(Since there's no easy way to change the type of a metric after it's created, you want to be a bit careful about what you make something. If you send in a statsd metric with the wrong type, it's rejected.)

The Prometheus statsd exporter can also generate quantiles and histograms from raw observations, which statsd generally calls 'timers'. Due to the statsd protocol, your numbers are assumed to be in milliseconds and the exporter divides the value by a thousand to create seconds-based metrics, as is the usual Prometheus custom. You'll have to scale your numbers appropriately if you don't actually have milliseconds. As covered in the exporter's documentation on statsd timers, the default result without any configuration is a quantile with 0.5, 0.9, and 0.99; currently these have acceptable errors settings of 0.05, 0.01, and 0.001 respectively, although that's not documented and might change. Anything else requires some degree of configuration of the statsd exporter.

(As far as I can tell, you don't need any statsd exporter configuration here unless you either want some histograms or you want to change the quantiles. The statsd exporter supports a TTL for metrics, where they go away if they haven't been updated in long enough, but the not entirely documented default is that there is no TTL and all metrics live forever, as with Pushgateway. See the section on configuring global defaults.)

An example of this is:

echo 'our.summary:500|h|#label:val' | nc localhost 9125
echo 'our.summary:200|h|#label:val' | nc localhost 9125
echo 'our.summary:50|h|#label:val' | nc localhost 9125

With the default configuration, this results in the following Prometheus metrics:

our_summary {label="val", quantile="0.5"} 0.2
our_summary {label="val", quantile="0.9"} 0.5
our_summary {label="val", quantile="0.99"} 0.5
our_summary_sum {label="val"} 0.75
our_summary_count {label="val"} 3

According to the documentation, you can use any of the 'ms', 'h', and 'd' statsd metric types for this. If I was doing this with times, I would probably use 'ms' to try to remind myself that the raw numbers I output had to be in milliseconds instead of seconds. Otherwise I would probably use 'h' and put a comment in the script about why I was multiplying everything by 1000.

So far I'm assuming that we'll use the statsd exporter purely for this approach for updating Prometheus metrics. If I wanted to both import genuine statsd metrics into Prometheus and use the statsd exporter as a way for scripts to update Prometheus metrics, I think I'd run two instances and configure them independently. You could probably mix the two uses in one instance, but keeping them separate just seems simpler and more straightforward.

(This is where I admit that I haven't actually used the statsd exporter for real yet, since I just discovered this today (also). But I think we have some things that would benefit from this, and so I'm tempted to start running the statsd exporter even with no metrics so that it's easy to add metrics updates to random scripts and programs as I touch them.)

PrometheusStatsdForMetricsUpdates written at 22:42:37; Add Comment

2019-06-21

One of the things a metrics system does is handle state for you

Over on Mastodon, I said:

Belated obvious realization: using a metrics system for alerts instead of hand-rolled checks means that you can outsource handling state to your metrics systems and everything else can be stateless. Want to only alert after a condition has been true for an hour? Your 'check the condition' script doesn't have to worry about that; you can leave it to the metrics system.

This sounds abstract, so let me make it concrete. We have some self serve registration portals that work on configuration files that are automatically checked into RCS every time the self-serve systems do something. As a safety measure, the automated system refuses to do anything if the file is either locked or has uncommitted changes; if it touches the file, it might collide with other things being done to it. These files can also be hand-edited, for example to remove an entry, and when we do this we don't always remember that we have to commit the file.

(Or we may be distracted, because we are trying to work fast to lock a compromised account as soon as possible.)

Recently, I was planning out how to detect this situation and send out alerts for it. Given that we have a Prometheus based metrics and alerting system, one approach is to have a hand rolled script that generates an 'all is good' or 'we have problems' metric, feed that into Prometheus, let Prometheus grind it through all of the gears of alert rules and so on, and wind up with Alertmanager sending us email. But this seems like a lot of extra work just to send email, and it requires a new alert rule, and so on. Using Prometheus also constrains what additional information we can put in the alert email, because we have to squeeze it all through the narrow channel of Prometheus metrics, the information that an alert rule has readily available, and so on. At first blush, it seemed simpler to just have the hand rolled checking script send the email itself, which would also let the email message be completely specific and informative.

But then I started thinking about that in more detail. We don't want the script to be hair trigger, because it might run while we were in the middle of editing things (or the automated system was making a change); we need to wait a bit to make sure the problem is real. We also don't want to send repeat emails all the time, because it's not that critical (the self-serve registration portals aren't used very frequently). Handling all of this requires state, and that means something has to handle that state. You can handle state in scripts, but it gets complicated. The more I thought about it, the more attractive it was to let Prometheus handle all of that; it already has good mechanisms for 'only trigger an alert if it's been true for X amount of time' and 'only send email every so often' and so on, and it's worried about more corner cases than I have.

The great advantage of feeding 'we have a problem/we have no problem' indications into the grinding maw of Prometheus merely to have it eventually send us alert email is that the metrics system will handle state for us. The extra custom things that we need to write, our highly specific checks and so on, are spared from worrying about all of those issues, which makes them simpler and more straightforward. To use jargon, the metrics system has enabled a separation of concerns.

PS: This isn't specific to Prometheus. Any metrics and alerting system has robust general features to handle most or even all of these issues. And Prometheus itself is not perfect; for example, it's awkward at best to set up alerts that trigger only between certain times of the day or on certain days of the week.

MetricsSystemHandlesState written at 00:21:02; Add Comment

(Previous 10 or go back to June 2019 at 2019/06/19)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.