Wandering Thoughts archives

2020-06-29

How Prometheus Blackbox's TLS certificate metrics would have reacted to AddTrust's root expiry

The last time around I talked about what Blackbox's TLS certificate expiry metrics are checking, but it was all somewhat abstract. The recent AddTrust root expiry provides a great example to make it concrete. As a quick summary, the Blackbox exporter provides two metrics, probe_ssl_earliest_cert_expiry for the earliest expiring certificate and probe_ssl_last_chain_expiry_timestamp_seconds for the latest expiring verified chain of certificates.

If your TLS server included the expiring AddTrust root certificate as one of the chain certificates it was providing to clients, the probe_ssl_earliest_cert_expiry metric would have counted down and your alarms would have gone off, despite the fact that your server certificate itself wasn't necessarily expiring. This would have happened even if the AddTrust certificate wasn't used any more and its inclusion was just a vestige of past practices (for example if you had a 'standard certificate chain set' that everything served). In this case this would have raised a useful alarm, because the mere presence of the AddTrust certificate in your server's provided chain caused problems in some (or many) TLS libraries and clients.

(Browsers were fine, though.)

Even if your TLS server included the AddTrust certificate in its chain and your server certificate could use it for some verified chains, the probe_ssl_last_chain_expiry_timestamp_seconds would not normally have counted down. Most or perhaps all current server certificates could normally be verified through another chain that expired later, which is what matters here. If probe_ssl_last_chain_expiry_timestamp_seconds had counted down too, it would mean that your server certificate could only be verified through the AddTrust certificate for some reason.

Neither metric would have told you if the AddTrust certificate was actually being used by your server certificate through some verified chain of certificates, or if it was now completely unnecessary. Blackbox's TLS metrics don't currently provide any way of knowing that, so if you need to monitor the state of your server certificate chains you'll need another tool.

(There's a third party SSL exporter, but I don't think it does much assessment of chain health, or give you enough metrics to know if a server provided chain certificate is unnecessary.)

If you weren't serving the AddTrust root certificate and had a verified chain that didn't use it, but some clients required it to verify your server certificate, neither Blackbox metric would have warned you about this. Because you weren't serving the certificate, probe_ssl_earliest_cert_expiry would not have counted down; it includes only TLS certificates you actually serve, not all of the TLS certificates required to verify all of your currently valid certificate chains. And probe_ssl_last_chain_expiry_timestamp_seconds wouldn't have counted down because there was an additional verified chain besides the one that used the AddTrust root certificate.

(In general it's very difficult to know if some client is going to have a problem with your certificate chains, because there are many variables. Including outright programming bugs, which were part of the problem with AddTrust. If you want to be worried, read Ryan Sleevi's Path Building vs Path Verifying: Implementation Showdown.)

PrometheusBlackboxVsAddTrust written at 22:53:14; Add Comment

2020-06-25

What Prometheus Blackbox's TLS certificate expiry metrics are checking

One of the things that the Prometheus Blackbox exporter can do is connect to services that use TLS and harvest enough certificate information from them to let you monitor and alert on soon to expire TLS certificates. Traditionally, this was a single metric, probe_ssl_earliest_cert_expiry, but in the 0.17.0 release a second one was added, probe_ssl_last_chain_expiry_timestamp_seconds. TLS certificate expiry issues have been on my mind because of the mess from the AddTrust root expiry, and recently I read a pair of articles by Ryan Sleevi on TLS certificate path building and verifying (part 2), which taught me that this issue isn't at all simple. After all this, I wound up wondering exactly what these two Blackbox exporter metrics were checking.

When you connect to a TLS server, it sends one or more certificates to you, generally at least two, in what is commonly called a certificate chain. These server sent certificates don't include the Certificate Authority's root certificate, because you need to already have that, and they don't actually have to form a single chain or even be related to each other. Normally they should be a chain (and be in a specific order), but people make all sorts of configuration errors and decisions in the certificates that they send. The Blackbox exporter's probe_ssl_earliest_cert_expiry metric is the earliest expiry of any of these server sent certificates. I'm not certain if it's filtered of invalid certificates, but it definitely doesn't exclude self-signed certificates.

(Specifically, it is the earliest expiry of a certificate in the Go crypto.tls.ConnectionState's PeerCertificates.)

When you actually verify a TLS certificate (if you do it correctly) you wind up with one or more valid paths between the server certificate and some roots of trust; we can call these the verified chains. These chains will use the server certificate that the server sent you, but they may not use all or even any of the other certificates. To work out the probe_ssl_last_chain_expiry_timestamp_seconds metric, the Blackbox exporter first goes over every verified chain to find out the earliest expiry time of any certificate in it and the picks the latest such chain expiry time. These verified chains do include the CA root certificates, which don't necessarily expire regardless of their nominal expiry time. If there are no verified chains at all, such as if you're dealing with a self-signed certificate, the Blackbox exporter currently makes this metric be an extremely large and useless negative number.

(The verified chains come from the Go crypto.tls.ConnectionState's VerifiedChains. If there are no verified chains, the metric is the zero value of Go's time.Time turned into a time in the Unix epoch. Since this zero value is more than a thousand years before January 1st 1970 UTC, it winds up very negative. This is potentially a bug and may change someday.)

Normally there will always be a verified chain, because otherwise the Blackbox TLS probe would fail entirely. You have to specifically set insecure_skip_verify to true in the Blackbox configuration in order to accept self-signed certificates or other chain problems.

So what do these metrics mean, beyond their technical details? If the earliest certificate expiry is soon, it doesn't necessarily mean that your TLS server certificate itself is about to expire, but it does mean that some TLS certificate your server is providing to people is about to. Either you're serving an unnecessary intermediate TLS certificate, or some number of your users are about to have a problem. Either is an issue that you should fix, especially since an expired certificate that's not necessary may still make many TLS libraries fail to verify your server certificate.

(This is part of what happened with the AddTrust expiry. A surprisingly large number of TLS libraries had to be patched to just skip it.)

The last chain expiry is the point at which you definitely will have problems, because no one at all will be able to build a verified chain for your server certificate. A last chain expiry that's well into the future is not a guarantee that you'll be free of problems until then, unless you know that there's only one valid chain that can be formed from your server certificate. If there are multiple chains, not all clients may able to use all chains so some of them could be stuck on chains that might expire earlier. The Blackbox exporter doesn't currently have a metric for the earliest expiring verified chain, but perhaps it should.

(Normally all verified certificate chains will have the same expiry time, because the shortest lifetime certificate on them should be the server's certificate itself. If there are multiple chains and there's a difference between the latest and the earliest chain expiry time, you may be about to have an exciting time (although it's not your fault).)

PrometheusBlackboxTLSExpiry written at 23:52:16; Add Comment

2020-06-20

The additional complications in DNS updates that secondary DNS servers add

I was recently reading Julia Evans' What happens when you update your DNS? (which is a great clear explanation of what it says), and it brought back some painful memories of the old days (which are still the current days for some people), which I might as well share.

Today, most DNS services that people deal with are managed DNS providers. When you enter a DNS update into your DNS provider's API or website for this, magical things happen behind the scenes in the DNS provider's infrastructure and your update normally goes live more or less immediately on all of the authoritative DNS servers involved in answering queries for your domain. In this environment, where your changes appear on your authoritative DNS servers effectively instantly, the only thing that matters for how fast your changes are visible is how long the various recursive DNS servers on the Internet have cached your existing information, as Julia Evans covers.

However, authoritative DNS servers didn't originally work that way and even today things don't necessarily quite work out that way if you run your own DNS service using straightforward DNS servers like NSD or the venerable Bind. The original DNS server environment had the idea of primary and secondary authoritative DNS servers. The primary DNS servers got all of the data for your zone from files on their disk (or more recently perhaps from a database or some network data source), and the secondary DNS servers got the data for your zone by copying it from a primary DNS server (possibly one that wasn't advertised publicly, which is often called a 'stealth master'), generally with an AXFR. Effectively your secondary authoritative DNS servers were (and are) a cache.

(You could have multiple primary servers, at which point it was up to you to make sure they were all using the same DNS zone data. The very simple way to do this was to rsync the data files around to everyone before having the DNS servers reload zones.)

Any time that you have what is effectively a cache, you should be asking about cache invalidation and refreshing; DNS servers are no exception. The original answer to this is in the specifications of the DNS SOA record, which has (zone) refresh, retry (of a failed refresh), and expire times, and a zone serial number so that secondaries could tell when their copy of the zone was out of date compared to the DNS primary. Every refresh interval, a secondary would check the SOA serial number on its primary and fetch an update if necessary. If it couldn't talk to the primary for long enough, it would declare the zone stale and stop answering queries from its cached data.

This meant that DNS updates had two timers on their propagation around, once you made them. First they had to propagate from the primary to all of the secondaries, which was based on the SOA refresh time. Once all secondaries were answering queries using the new DNS data, recursive DNS servers could still have old queries cached for up to the query TTL. In the worst case, where you make a change just after a refresh and a recursive DNS server queried your last secondary just before its refresh timer went off, your update might not reach everyone until the sum of the entry's TTL and the zone's SOA refresh.

(Adding a new DNS record could have a similar delay but here the first time was the SOA minimum value, which in theory set the TTL for negative replies. More or less.)

Having to wait for secondary DNS servers to hit their refresh timers to update has various issues. Obviously it slows down DNS updates, but it also means that there's a potentially significant amount of time when your various authoritative DNS servers are giving different answers to queries. All of this was recognized relatively early on and led to RFC 1996, which created the DNS NOTIFY mechanism, which lets primary servers send a special DNS NOTIFY message to secondaries.

When you update your primary servers, they signal the secondary servers that a zone change has (probably) happened. Generally the secondaries will then immediately try to transfer the updated zone over so they can use it to answer queries. A DNS NOTIFY doesn't guarantee that the secondaries are promptly up to date, but it makes it much more likely, and there is some protection against the NOTIFY being dropped in transit between the primary and the secondaries. In practice this seems to work fairly well, especially in network environments where the primaries and secondaries are close to each other (in network terms). However it's still not guaranteed, so if you have a monitoring system, it's worth having a check for the SOAs on your zones not being out of sync for too long between your primaries and secondaries.

(DNS providers hopefully have similar internal monitoring.)

Normally your primary DNS server software will automatically send out DNS NOTIFY messages to appropriate secondary servers if you tell it to reload things. You can generally manually trigger sending them even without a zone change or reload; one use of this is making sure that a particular secondary (or all of them) gets a little prod to try doing an update.

PS: Since we run our DNS ourselves here, this whole area remains an issue that we have to think about and remember some aspects of. But that's another entry.

PPS: Usually secondary servers have restrictions on who they'll accept DNS NOTIFY messages from, and I believe the messages can optionally be authenticated in some way these days.

DNSUpdatesAndSecondaries written at 19:19:05; Add Comment

2020-06-12

Dual displays contrasting with virtual screens (aka multiple desktops)

At work, I have dual displays on my office desktop, specifically two Dell U2412M monitors (which are 24" diagonal with 1920 x 1200 resolution). This gives me a lot of space to work in, and lets me do things like have a full sized Grafana dashboard on the left one while carpeting the right one with windows that are investigating the problems shown on the dashboard. Of course, given world and local events I'm not at work, I'm working from home. At home I have a nice HiDPI display, but it's a Dell P2715Q which means it's only 27" diagonal (and a 16:9 display compared to the 16:10 of the dual monitors). This is not anywhere near as much space as two displays, and the space doesn't split naturally or as nicely.

One of the things that my window manager supports is what is variously called virtual screens or multiple desktops. I have multiple virtual screens set up on my desktop at work as well as at home, but at work I've generally not used them very often or for much. Generally I would switch virtual screens only if I was interrupted in the middle of something and so needed a whole new set of windows on top of the set that I already had. Otherwise, I did everything on my primary virtual screen, because it had enough room.

This isn't really the case with working from home. Now I'm routinely out of what I consider enough space, and so my work sprawls across multiple virtual screens. Sometimes this is different parts of my work; I might be running virtual machines on one virtual screen and looking at a Grafana dashboard on another. This sort of split across virtual screens is okay, and some people would find it an improvement over putting everything on the primary screen, although I'm not sure I do (having everything iconified in one spot is convenient). However, sometimes my single screen and lack of as much space forces me to split one thing between two virtual screens. The most common case is looking at Grafana dashboards, which really want to be full screen on my display. A full screen dashboard leaves me no room for other windows to investigate things, so I often wind up flipping back and forth between a virtual screen with a Grafana dashboard and a virtual screen where I'm doing something about what the dashboard is telling me. This is, naturally, not the best experience; I can't see both things at once and I lose some context and flow as I flip back and forth.

Even with different parts of my work, it's not infrequently a bit more annoying to switch virtual screens than to have one set of things on one display and another set of things on the other. One area this especially comes up in is reading email as it comes in. At work, my email client de-iconifies on the left side of my right display (more or less in the center of where I look), and I tend to first use the left display for things like terminal windows and work, which means that there's space left for the email client to open up, for me to write replies to email, and so on. At home, the de-iconified email client is competing for space with all sorts of other things, so if email comes in while I'm working I'll often switch to another clean virtual screen to read it. This is more of an interruption than it is on my work dual display.

At the same time, the clean virtual screen that I get at home is in its own way a nicer thing. I can't deny that there's clutter and a bunch of distractions on my primary virtual screen at work, both passive ones (things I could do) and active ones (things I'm currently doing). A forced switch to a different virtual screen at home wipes away all of that and gives me a clean, low distraction slate (at least until I start cluttering up the second virtual screen). The very lack of space that I don't like pushes me to switch virtual screens more often and thus to get that new, uncluttered, lower distraction experience more often.

My current feelings are that virtual screens at home don't make up for not having dual displays. I can get my work done, but it's not as nice an experience as it is at work, and not as flowing (for lack of a better term). I'm cramming too much into too little space, and my virtual screens are mostly a method of trying to get more space (as opposed to, say, trying to keep things organized).

(Some people like using virtual screens to separate various things from each other, but my current view is that I don't want to do that for various reasons beyond the scope of this entry.)

DualDisplayVsMultiDesktop written at 00:11:12; Add Comment

2020-06-06

Why sysadmins don't like changing things, illustrated

System administrators are famously reluctant to change anything unless they have to; once a system works, they like to freeze it that way and not touch it. This is sometimes written off as irrational over-concern, and to be honest sometimes it is; you can make a fetish out of anything. However, it isn't just superstition and fetish. We can say general things like on good systems, you control stability by controlling changes and note that harmless changes aren't always actually harmless, but surely if you take appropriate care you can monitor your systems while applying controlled changes, promptly detect and understand any problems, and either fix them or roll back.

Well, let me tell you a story about that, and about spooky subtle action at a distance. (A story that I mentioned in passing recently.)

We have a Prometheus based monitoring and alerting system, that among other things sends out alert notifications, which come from a Prometheus component called the Alertmanager. Those alert notifications include the start and end times of the alerts (for good reasons), and since we generally deal in local time, these are in local time. Or at least they're supposed to be. Quite recently a co-worker noticed that these times were wrong; after a short investigation, it was obvious that they were in UTC. Further investigation showed that they hadn't always been in UTC time; ever since we started with Prometheus in late 2018 they'd been in local time, as we expected, and then early in May they'd changed to UTC.

We have reasonably good records of what we've changed on our systems, so I could go back to what we'd changed on the day when the alert times switched from local time to UTC, and I could also look at the current state of the system. What I expected to find was one of four changes; the system switching timezones for some reason, an Ubuntu package update of a time related package, an update to Alertmanager itself (with a change related to this behavior), or that the systemd service for Alertmanager was putting it into UTC time. I found none of them. Instead the cause of the timezone shift in our alert messages was an update to the Prometheus daemon, and the actual change in Prometheus was not even in its release notes (I found it only by searching Git commit logs, which led me to here).

Here is an undesirable change in overall system behavior that we didn't notice for some time and that was ultimately caused by us upgrading something that wasn't obviously related to the issue. The actual cause of the behavior change was considered so minor that it didn't appear in the release notes, so even reading them (both before and after the upgrade) didn't give us any sign of problems.

This shows us, once again, that you can't notice all changes in behavior immediately, not in practice, you can't predict them in advance from due diligence like reading release notes and trying things out on test systems, and they aren't always from things that you expect; a change in one place can cause spooky action at a distance. Our alert time stamps are formatted in Alertmanager when it generates alerts, but it turned out through a long chain of actions that a minor detail of how they were created inside Prometheus made a difference in our setup.

ChangeSubtleDangerExample written at 23:46:56; Add Comment

2020-06-05

Why we put alert start and end times in our Prometheus alert messages

As I mentioned in Formatting alert start and end times in Alertmanager messages, we put the alert start times and if applicable the alert end times in the (email) alert messages that we send out. Generally these look like one of these two:

  • for a current alert
    (alert started at 15:16:02 EDT 2020-06-05, likely detected ~90s earlier)

  • for an alert that has ended
    (alert active from 15:16:02 EDT 2020-06-05 to 15:22:02 EDT 2020-06-05)

(The 'likely detected ..' bit is there because most of our Prometheus alert rules have a 'for:' clause, so the alert condition becomes true somewhat before the alert itself starts.)

At the beginning of life with Prometheus and Alertmanager, it may not be obvious why this is useful and sometimes even necessary; after all, the alert message itself already has a time when it was emailed, posted to your communication channel, or whatever.

The lesser reason we do this, especially for alert end times, is that it's convenient to have this information in one place when we're going back through email. If we have a 'this alert is resolved' email, we don't have to search back to see when it started; the information is right there in the final message. There's a similar but smaller convenience with email about the start of single alerts, since you can just directly read off the start time from the text of the message without looking back to however your mail client is displaying the email's sending time.

The larger reason is how Alertmanager works with grouped alerts (which is almost all of our alerts). Alertmanager's core model is that rather than sending you new alerts or resolved alerts (or both), it will send you the entire current state of the group's alerts any time that stage changes. What this means is that if at first alert A is raised, then somewhat later alert B, then finally alert C, you will get an email listing 'alert A is active', then one saying 'alert A and B are active', then a third saying 'alerts A, B, and C are active'.

When you get these emails, you generally want to know what alerts are new and what alerts are existing older alerts. You're probably already looking at the existing alerts, but the new alerts may be for new extra problems that you also need to look at, and they may be a sign that things are getting worse. And this is why you want the alert start times, because they let you tell which alerts are more recent (and more likely to be new ones you haven't seen before) and which ones are older. It's not as good as being clearly told which alerts are new in this message, but it's as good as we can get in the Alertmanager model of the world.

(I don't know if Alertmanager puts the alerts in these messages in any particular order. Even if it does so today, there's no documentation about it so it's not an official feature and may change in the future. It would be nice if Alertmanager used a documented and useful order, or let you sort the alerts based on start and end times.)

PrometheusAlertsWhyTimes written at 22:37:09; Add Comment

2020-06-04

Formatting alert start and end times in Prometheus Alertmanager messages

One of the things that you'll generally wind up doing with Prometheus's Alertmanager is creating custom alert messages (in whatever medium you send alerts, whether that's email or something else). Alertmanager comes with some default templates for various alerting destinations, but they're very generic and not all that useful in many situations. If you're customizing alert messages, one thing you might want to put in is when the alerts started (and possibly when they ended, if you send messages when alerts are 'resolved', ie stop). For instance, you might want a part of the message that looks like this:

  • some text about the alert
    (alert active from 17:52:49 2018-10-21 to 17:58:19 2018-10-21)

In Prometheus alerts, there are two places you can format things, with somewhat different features available in them; you can format data through templating in alert annotations, or through templating in the actual Alertmanager message templates. Both of these use Go templating, but they have somewhat different sets of formatting functions available (generally alert annotations have a richer set of formatting functions).

The start time and possibly the end time of alerts are available in Alertmanager templating as StartsAt and EndsAt names on each alert object (which are in turn accessed through, eg, Alerts). These are Go time.Time values, and so are formatted through their .Format() method. You do this formatting like the following (the range is for illustrative purposes):

{{ range .Alerts.Firing }}
   alert started at {{ .StartsAt.Format "15:04:05 2006-01-02" }}
{{ end }}

There are two things wrong with this formatting example, one of which is visible in the message about. First, it doesn't tell you what time zone the formatted time is in. Second, it doesn't force this time zone to be anything in specific, which matters because Go time.Time values have time zones associated with them that change how a given absolute time is presented. Since we're not forcing a time zone and not displaying the StartsAt's time zone in the formatted time, we have no idea what time this really is. In my example message above, it's '17:52:49' in some time zone, but we have no idea which one.

To deal with both issues, the correct way in Alertmanager templates to format a time is:

{{ .StartsAt.Local.Format "15:04:05 MST 2006-01-02" }}

This will give you a formatted time like '22:16:02 EDT 2020-06-04', which is clear and explicit (if you want, you can be more explicit in various ways; see the Go time formatting documentation.

(Change the .Local to .UTC if you want your alert times in UTC time instead of Alertmanager's local time zone, whatever you've set that to.)

Incidentally, we were previously using the first form, and we had our formatted alert time stamps in our alert email silently change from local time to UTC after we upgraded Prometheus (not Alertmanager) from 2.17.2 to 2.18.1 back in early May. This is because of this commit for issue 7066, which was considered so small that it didn't even appear in the Prometheus release notes. Without the time zone explicitly named in our messages, it took some time before an alert co-worker noticed that the times looked odd (and working out what was going on involved some head scratching).

PrometheusAlertTimeFormatting written at 23:34:10; Add Comment

2020-06-01

Watching the recent AddTrust root CA certificate expiry has been humbling

The news of the recent past is that the old 'AddTrust External CA Root' root certificate expired on May 30th (at 10:48:48 UTC). Before this happened, I confidently told more than one person that one reason I was confident that our TLS certificate environment wasn't affected by this was that our Prometheus based monitoring system specifically looks at all of the TLS certificates in a certificate validation chain, not just only the first ('leaf') certificate for the server itself, and reports the lowest expiry time. Since our alerts had not been going off, we didn't have the AddTrust CA root in our certificate chains. Although we had no problems ourselves, in retrospect this looks naive and has exposed a real issue to think about with TLS certificate monitoring.

Generally what broke because of the AddTrust root expiring is (and was) not current browsers or even current monitoring things like Prometheus (at least when run on reasonably current systems). Instead, it was older software, such as OpenSSL 1.0.1 (via), and older systems using old root certificate bundles. These systems either only had the AddTrust root to rely on, without the modern roots that have supplanted it, or had programming issues (ie, bugs) that caused them to not fall back to try additional certificate chains when they hit the expired AddTrust root CA certificate.

What this points out is that whether TLS validation works can depend on the client (and the client's environment), especially in the face of expired or invalidated certificates somewhere in the chain. So far, we haven't been considering this in monitoring and testing. Our monitoring has been tacitly assuming that if Prometheus' Blackbox checks liked our TLS certificate chains, everything was good for all clients everywhere. So has our testing, more or less; if we're testing a new HTTPS web server or whatever, we'll point a browser at it, see if the browser is happy, and then call it done.

(This is especially questionable because browsers go way out of their way to try to make TLS certificate chains work; they'll use cached intermediate certificates and sometimes even fetch them on the fly.)

This monitoring and testing is very likely safe for all modern client software (browsers, IMAP clients, and also programming tools and environments). But it's likely not universally safe for us. We can have old programs on old operating systems, and we can have client programs where we've needed to specifically configure a certificate chain for some reason. Those and similar things may well fail in the face of an issue similar to this AddTrust one, and without our monitoring and testing flagging it.

I don't have any particular answers for this. For web servers, we often use the SSL Server Test, which reports results for a variety of older browsers as well as current ones (although I'm not certain that that covers certificate chain issues for them or if it's just ciphers). For IMAP servers or the like, well, we'd have to wait for problem reports from people with old clients or something.

(Since we're using Let's Encrypt for everything today, with automatically built certificate chains, it's probably not worth setting up monitoring to look for unnecessary intermediate certificates or badly formed certificate chains. Neither should happen short of a catastrophic malfunction in Certbot or Let's Encrypt.)

CertExpiryHandlingVariety written at 20:53:39; Add Comment

By day for June 2020: 1 4 5 6 12 20 25 29; before June; after June.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.