Wandering Thoughts

2019-06-21

One of the things a metrics system does is handle state for you

Over on Mastodon, I said:

Belated obvious realization: using a metrics system for alerts instead of hand-rolled checks means that you can outsource handling state to your metrics systems and everything else can be stateless. Want to only alert after a condition has been true for an hour? Your 'check the condition' script doesn't have to worry about that; you can leave it to the metrics system.

This sounds abstract, so let me make it concrete. We have some self serve registration portals that work on configuration files that are automatically checked into RCS every time the self-serve systems do something. As a safety measure, the automated system refuses to do anything if the file is either locked or has uncommitted changes; if it touches the file, it might collide with other things being done to it. These files can also be hand-edited, for example to remove an entry, and when we do this we don't always remember that we have to commit the file.

(Or we may be distracted, because we are trying to work fast to lock a compromised account as soon as possible.)

Recently, I was planning out how to detect this situation and send out alerts for it. Given that we have a Prometheus based metrics and alerting system, one approach is to have a hand rolled script that generates an 'all is good' or 'we have problems' metric, feed that into Prometheus, let Prometheus grind it through all of the gears of alert rules and so on, and wind up with Alertmanager sending us email. But this seems like a lot of extra work just to send email, and it requires a new alert rule, and so on. Using Prometheus also constrains what additional information we can put in the alert email, because we have to squeeze it all through the narrow channel of Prometheus metrics, the information that an alert rule has readily available, and so on. At first blush, it seemed simpler to just have the hand rolled checking script send the email itself, which would also let the email message be completely specific and informative.

But then I started thinking about that in more detail. We don't want the script to be hair trigger, because it might run while we were in the middle of editing things (or the automated system was making a change); we need to wait a bit to make sure the problem is real. We also don't want to send repeat emails all the time, because it's not that critical (the self-serve registration portals aren't used very frequently). Handling all of this requires state, and that means something has to handle that state. You can handle state in scripts, but it gets complicated. The more I thought about it, the more attractive it was to let Prometheus handle all of that; it already has good mechanisms for 'only trigger an alert if it's been true for X amount of time' and 'only send email every so often' and so on, and it's worried about more corner cases than I have.

The great advantage of feeding 'we have a problem/we have no problem' indications into the grinding maw of Prometheus merely to have it eventually send us alert email is that the metrics system will handle state for us. The extra custom things that we need to write, our highly specific checks and so on, are spared from worrying about all of those issues, which makes them simpler and more straightforward. To use jargon, the metrics system has enabled a separation of concerns.

PS: This isn't specific to Prometheus. Any metrics and alerting system has robust general features to handle most or even all of these issues. And Prometheus itself is not perfect; for example, it's awkward at best to set up alerts that trigger only between certain times of the day or on certain days of the week.

MetricsSystemHandlesState written at 00:21:02; Add Comment

2019-06-19

A Let's Encrypt client feature I always want for easy standard deployment

On Twitter, I said:

It bums me out that Certbot (the 'official' Let's Encrypt client) does not have a built-in option to combine trying a standalone HTTP server with a webroot if the standalone HTTP server can't start.

(As far as I can see.)

For Let's Encrypt authentication, 'try a standalone server, then fall back to webroot' lets you create a single setup that works in a huge number of cases, including on initial installs before Apache/etc has its certificates and is running.

In straightforward setups, the easiest way to prove your control of a domain to Let's Encrypt is generally their HTTP authentication method, which requires the host (or something standing in for it) to serve a file under a specific URL. To do this, you need a suitably configured web server.

Like most Let's Encrypt clients, Certbot supports both putting the magic files for Let's Encrypt in a directory of your choice (which is assumed to already be configured in some web server you're running) or temporarily running its own little web server to do this itself. But it doesn't support trying both at once, and this leaves you with a problem if you want to deploy a single standard Certbot configuration to all of your machines, some of which run a web server and some of which don't. And on machines that do run a web server, it is not necessarily running when you get the initial TLS certificates, because at least some web servers refuse to start at all if you have HTTPS configured and the TLS certificates are missing (because, you know, you haven't gotten them yet).

Acmetool, my favorite Let's Encrypt client, supports exactly this dual-mode operation and it is marvelously convenient. You can run one acmetool command no matter how the system is configured, and it works. If acmetool can bind to port 80, it runs its own web server; if it can't, it assumes that your webroot setting is good. But, unfortunately, we need a new Let's Encrypt client.

For Certbot, I can imagine a complicated scheme of additional software and pre-request and post-request hooks to make this work; you could start a little static file only web server if there wasn't already something on port 80, then stop it afterward. But that requires additional software and is annoyingly complicated (and I can imagine failure modes). For extra annoyance, it appears that Certbot does not have convenient commands to change the authentication mode associated configured for any particular certificate (which will be used when certbot auto-renews it, unless you hard-code some method in your cron job). Perhaps I am missing something in the Certbot documentation.

(This is such an obvious and convenient feature that I'm quite surprised that Certbot, the gigantic featureful LE client that it is, doesn't support it already.)

LetsEncryptEasyDeployWant written at 01:13:03; Add Comment

2019-06-17

Sometimes, the problem is in a system's BIOS

We have quite a number of donated Dell C6220 blade servers, each of which is a dual socket machine with Xeon E5-2680s. Each E5-2680 is an 8-core CPU with HyperThreads, so before we turned SMT off the machines reported as having 32 CPUs, or 16 if you either turned SMT off or had to disable one socket (and once you have to do both, you're down to 8 CPUs). These days, most of these machines have been put in a SLURM-based scheduling system that provides people with exclusive access to compute servers.

Once upon a time recently, we noticed that the central SLURM scheduler was reporting that one of these machines had two CPUs, not (then) 32. When we investigated, this turned out not to be some glitch in the SLURM daemons or a configuration mistake, but actually what the Linux kernel was seeing. Specifically, as far as the kernel could see, the system was a dual socket system with each socket having exactly one CPU (still Xeon E5-2680s, though). Although we don't know exactly how it happened, this was ultimately due to BIOS settings; when my co-worker went into the BIOS to check things, he found that the BIOS was set to turn off both SMT and all extra cores on each socket. Turning on the relevant BIOS options restored the system to its full expected 32-CPU count.

(We don't know how this happened, but based on information from our Prometheus metrics system it started immediately after our power failure; we just didn't notice for about a month and a half. Apparently the BIOS default is to enable everything, so this is not as simple as a reversion to defaults.)

If nothing else, this is a useful reminder to me that BIOSes can do weird things and can be set in weird ways. If nothing else makes sense, well, it might be in the BIOS. It's worth checking, at least.

(We already knew this about Dell BIOSes, of course, because our Dell R210s and R310s came set with the BIOS disabling all but the first drive. When you normally use mirrored system disks, this is first mysterious and then rather irritating.)

BIOSCoresShutdown written at 23:16:37; Add Comment

2019-06-14

Intel's MDS issues have now made some old servers almost completely useless to us

Over on Twitter, I said:

One unpleasant effect of MDS is that old Intel-based machines (ones with CPUs that will not get microcode updates) are now effectively useless to us, unlike before, because it's been decided that the security risks are too high for almost everything we use machines for.

If Intel releases all of the MDS microcode updates they've promised to do (sometime), this will have only a small impact on our available servers. If they decide not to update some older CPUs they're currently promising updates for, we could lose a significant number of servers.

To have MDS mitigated at all on Intel CPUs, you need updated microcode (and to turn off hyperthreading, and an updated kernel; see eg Ubuntu's MDS page). According to people who have carefully gone through the CPU lists available in PDFs that are linked from Intel's MDS advisory page, Intel is definitely not releasing microcode updates for anything older than the 2nd generation 'Core' "Sandy Bridge" architecture, which started appearing in 2011 or so (so in theory my 2011 home machine could receive a microcode update and be fixed against MDS). In theory they are going to release microcode updates for everything since then.

For reasons beyond the scope of this entry, we have decided that we have almost no roles where we are comfortable deploying an unpatchable machine that is vulnerable to MDS. In normal places this might not be an issue, since they would long since have turned over old server hardware. We are not a normal place; we run hardware into the ground, which means that we often have very old servers. For example, up until we reached this decision about MDS, we were still using Dell 1950s as scratch and test machines.

In theory, the direct effects of MDS on our stock of live servers are modest but present; we're losing a few machines, including one quite nice one (a Westmere Xeon with 512 GB of RAM), and some machines that were on the shelf are now junk. However, at the moment we have a significant number of servers that are based around Sandy Bridge Xeons. Intel has promised microcode updates for these CPUs but has not yet delivered them. If Intel never delivers updated microcode, we'll lose a significant number of machines and pretty much decimate the compute infrastructure we provide to the department.

(A great deal of our current compute machines are donated Dell C6220 blades with Xeon E5-2680s, specifically CPUID 206D7 ones. Don't ask how much work it took to get the raw CPUID values that Intel puts in their PDF.)

If further significant MDS related attacks get discovered and Intel is more restricted with microcode updates, this situation will get worse. We have a significant number of in-service Dell R210 IIs and R310 IIs, and they are almost as old as these C6220s (although some of them have Ivy Bridge generation CPUs instead of Sandy Bridge ones). Losing these would really hurt.

(In general we are not very happy with Intel right now, and we are starting to deploy AMD-based machines where we can. I would be happy if someone started offering decent basic 1U or 2U AMD-based servers at competitive prices.)

IntelMDSKillsOldServers written at 01:13:04; Add Comment

2019-06-10

Keeping your past checklists doesn't help unless you can find them again

I mentioned recently that the one remaining filesystem we need to migrate from our remaining OmniOS fileserver to our Linux fileservers is our central administrative filesystem. This filesystem is a challenge to move because everything depends on it and a number of systems are tangled into it (for example, our fileservers all rsync a partial copy from its master fileserver, which means that they need to know and agree which is the master fileserver and the master fileserver had better not try to replicate a copy from itself). With so many things depending on this filesystem and it so tangled into our systems, we clearly needed to carefully consider and plan the migration.

Of course, this is not the first time we've moved this filesystem between ZFS fileservers; we did it before when we moved from our first generation Solaris fileservers to our our OmniOS fileservers. Since I'm a big fan of checklists and I have theoretically learned a lesson about explicitly saving a copy after things are over, I was pretty sure that I had written up a checklist for the first migration and then kept it. Somewhere. In my large, assorted collection of files, in several directories, under some name.

As you might expect from how I wrote that, my initial attempts to find a copy of our 2015-ish checklist for this migration did not go well. I looked in our worklog system, where ideally it should have been mailed in after the migration was done, and I hunted around in several areas where I would have expected to keep a personal copy. Nothing came up in the searches that I attempted, and I found myself wondering if we had even done a checklist (as crazy as that seemed). If we had done one, it seemed we had lost it since then.

Today, freshly back from vacation, I resorted to a last ditch attempt at brute force. I used 'zpool history' to determine that we had probably migrated the filesystem in mid-February of 2015, and then I went back to the archives of both our worklog system and our sysadmin mailing list (where we coordinate and communicate with each other), and at least scanned through everything from the start of that February. This finally turned up an email with the checklist (in our sysadmin mailing list archives, which fortunately we keep), and once I had that, I could re-search all of my files for a tell-tale phrase from it. And there the checklist was, in a file called 'work/Old/cssite-mail-migration'.

(It had that name because back in 2015 we migrated a bunch of administrative filesystems all at once at the end of things, including our central /var/mail. This time around, we migrated /var/mail very early because we knew 10G instead of 1G would make a real difference for it.)

Sadly, I'm not sure right now how I could have done much better than this round-about way. Explicitly sending the checklist to our worklog system would have helped a bit, but even then I would have had to stumble on the right search terms to find it. Both taxonomy and searching are hard (human) problems; with my searching, I was basically playing a guessing game about what specific terms, commands, or whatever would have been used in the checklist, and evidently I guessed wrong. One possible improvement might be to make a storage directory specifically for checklists, which would at least narrow my searching down (there are a lot of things in 'work/Old').

(Things like the name of the filesystem and 'migration' are not useful, because it shows up in every filesystem migration we do since it's where we put central data about NFS mounts and so on.)

FindableChecklists written at 21:54:30; Add Comment

2019-06-08

Our current approach for updating things like build instructions

At work here, we have a strong culture of documenting everything we do in email, in something that we call 'worklogs'. Worklog messages are sent to everyone in my group, and they are also stored in private, searchable web archive. We also have systematic build instructions for our systems, and unsurprisingly they are also worklog messages. However, they are unusual worklog messages, because they are not only what was done but also what you should do to recreate the system. This means that when we modify a system covered by build instructions, we have to update these build instructions and re-mail them to the worklog system.

For a long time, that was literally how things worked. If and when you modified such a system, part of your job was to go to the worklog archives, search through them, find the most recent build instructions for the system, make a copy, modify the copy, and then mail it back in. If you were making a modification that you weren't sure was final or that we'd want to keep, you had to make an additional note to do this whole update process when the dust settled. If you were in a rush or had other things to do too or weren't certain, it was pretty tempting to postpone all of this work until some convenient later time. Sometimes that didn't happen, or at least didn't happen before a co-worker also modified the machine (with various sorts of confusion possible in the aftermath). A cautious person who wanted to build a copy of a machine for a new Ubuntu version would invariably wind up trawling through our worklog archive to check for additional changes on top of the latest build instructions, and then perhaps have to sort out if we wanted to keep some of them.

At one point, all of this reached a critical mass and we decided that something had to change; build instructions needed to be more reliably up to date. We decided to make a simple change to enable more easy updates; we would commit to keeping the current copy of every set of build instructions in a file in a known spot in our central administrative filesystem, as well as mailing them to the worklog system. That way, we could cut a number of steps off the update process to reduce the friction involved; rather than hunting for the latest version, you could just go edit it, commit it to RCS, and then mail it in to the worklog system.

(The version in the worklog system remains the canonical reference, in part because my co-workers keep printed out copies of the build instructions for various critical systems. An update doesn't really exist until it's been emailed in.)

This modest change is now a couple of years old and I think it's safe to say that it's been a smashing success. Our build instructions are now almost always up to date and it takes much less work to keep them that way. What was a pain in the rear before is now only a couple of minutes of work, often quite simple work. In the common case, you can copy the commands necessary from your existing email message about 'I made the following change to system X', since we always write those when we make changes. As an additional benefit, we don't have to worry about line-wrapping and other mangling happening when we copy email messages around and cut & paste them from the web archive system and so on; the 'real' build instructions live in a text file and never get mangled by any mail-related thing.

In general and in theory I know the power of small changes that reduce friction, but pretty much every time I run into one in practice it surprises me all over again. This is one of the times; I certainly hoped for a change and an improvement, but I didn't have any real idea how large of one it would be.

PS: We also have various 'how-tos' and 'how-this-works' and so on documentation that we keep in the same directory and update in the same way. Basically, any email in our worklog archive that serves as the canonical instructions or explanations for something is a candidate to be enrolled in this system, not just system build instructions.

UpdatingDocumentationApproach written at 00:18:07; Add Comment

2019-06-02

Exploring the start time of Prometheus alerts via ALERTS_FOR_STATE

In Prometheus, active alerts are exposed through two metrics, the reasonably documented ALERTS and the under-documented new metric ALERTS_FOR_STATE. Both metrics have all of the labels of the alert (although not its annotations), and also an 'alertname' label; the ALERTS metric also has an additional 'alertstate' metric. The value of the ALERTS metric is always '1', while the value of ALERTS_FOR_STATE is the Unix timestamp of when the alert rule expression started being true; for rules with 'for' delays, this means that it is the timestamp when they started being 'pending', not when they became 'firing' (see this rundown of the timeline of an alert).

(The ALERTS_FOR_STATE metric is an internal one added in 2.4.0 to persist the state of alerts so that 'for' delays work over Prometheus restarts. See "Persist 'for' State of Alerts" for more details, and also Prometheus issue #422. Because of this, it's not exported from the local Prometheus and may not be useful to you in clustered or federated setups.)

The ALERTS_FOR_STATE metric is quite useful if you want to know the start time of an alert, because this information is otherwise pretty much unavailable through PromQL. The necessary information is sort of in Prometheus's time series database, but PromQL does not provide any functions to extract it. Also, unfortunately there is no good way to see when an alert ends even with ALERTS_FOR_STATE.

(In both cases the core problem is that alerts that are not firing don't exist as metrics at all. There are some things you can do with missing metrics, but there is no good way to see in general when a metric appears or disappears. In some cases you can look at the results of manually evaluating the underlying alert rule expression, but in other cases even this will have a null value when it is not active.)

We can do some nice things with ALERTS_FOR_STATE, though. To start with, we can calculate how long each current alert has been active, which is just the current time minus when it started:

time() - ALERTS_FOR_STATE

If we want to restrict this to alerts that are actually firing at the moment, instead of just being pending, we can write it as:

    (time() - ALERTS_FOR_STATE)
and ignoring(alertstate) ALERTS{alertstate="firing"}

(We must ignore the 'alertstate' label because the ALERTS_FOR_STATE metric doesn't have it.)

You might use this in a dashboard where you want to see which alerts are new and which are old.

A more involved query is one to tell us the longest amount of time that a firing alert has been active over the past time interval. The full version of this is:

max_over_time( ( (time() - ALERTS_FOR_STATE)
                  and ignoring(alertstate)
                         ALERTS{alertstate="firing"}
               )[7d:] )

The core of this is the expression we already saw, and we evaluate it over the past 7 days, but until I thought about things it wasn't clear why this gives us the longest amount of time for any particular alert. What is going on is that while an alert is active, ALERTS_FOR_STATE's value stays constant while time() is counting up, because it is evaluated at each step of the subquery. The maximum value of 'time() - ALERTS_FOR_STATE' happens right before the alert ceases to be active and its ALERTS_FOR_STATE metric disappears. Using max_over_time captures this maximum value for us.

(If the same alert is active several times over the past seven days, we only get the longest single time. There is no good way to see how long each individual incident lasted.)

We can exploit the fact that ALERTS_FOR_STATE has a different value each time an alert activates to count how many different alerts activated over the course of some range. The simplest way to do this is:

changes( ALERTS_FOR_STATE[7d] ) + 1

We have to add one because going from not existing to existing is not counted as a change in value for the purpose of changes(), so an alert that only fired once will be reported as having 0 changes in its ALERTS_FOR_STATE value. I will leave it as an exercise to the reader to extend this to only counting how many times alerts fired, ignoring alerts that only became pending and then went away again (as might happen repeatedly if you have alerts with deliberately long 'for' delays).

(This entry was sparked by a recent prometheus-users thread, especially Julien Pivotto's suggestion.)

PrometheusAlertStartTimeStuff written at 22:15:26; Add Comment

2019-05-24

The problem of paying too much attention to our dashboards

On Mastodon, I said:

Our Grafana dashboards are quite shiny, at least to me (since I built them), but I really should start resisting the compulsive urge to take a look at them all the time just to see what's going on and look at the pretty zigzagging lines.

I have a bad habit of looking at shiny things that I've put together, and dashboards are extremely shiny (even if some of them are almost all text). There are two problems with this, the obvious and the somewhat subtle.

The obvious problem is that, well, I'm spending my time staring somewhat mindlessly at pretty pictures. It's interesting to look at lines wiggle around or collections of numbers, but it's generally not informative. It's especially not informative for our systems because our systems spend almost all of their time working fine, which means that there is no actual relevant information to be had from all of these dashboards. In terms of what I spend (some) time on, I would be better off if we had one dashboard with one box that said 'all is fine'.

This is a general issue with dashboards for healthy environments; if things are fine, your dashboards are probably telling you nothing or at least nothing that is of general interest and importance.

(Your dashboards may be telling you details and at some point you may want access to those details, like how many email messages you typically have in your mailer queues, but they are not generally important.)

The more subtle problem is the general problem of metrics, which is a variant of Goodhart's law. Once you have a metric and you pay attention to the metric, you start to focus on the metric. If you have a dashboard of metrics, it's natural to pay attention to the metrics and to exceptions in the metrics, whether or they actually matter. It may or may not matter that a machine has an unusually high load average, but if it's visible, you're probably going to focus on it and maybe dig into it. Perhaps there is a problem, but often there isn't, especially if you're surfacing a lot of things on your dashboards because they could be useful.

(One of the things behind this is that all measures have some amount of noise and natural variation, but as human beings we have a very strong drive to uncover patterns and meaning in what we see. If you think you see some exceptional pattern, it may or may not be real but you can easily spend a bunch of time trying to find out and testing theories.)

My overall conclusion from my own experiences with our new dashboards and metrics system is that if you have good alerts, you (or at least I) would be better off only looking at dashboards if there is some indications that there are actual problems, or if you have specific questions you'd like to answer. In practice, trawling for 'is there anything interesting' in our dashboards is a great way to spend some time and divert myself down any number of alleyways, most of them not useful ones.

(In a way the worst times are the times when looking at our dashboards actually is useful, because that just encourages me to do it more.)

PS: This is not the first time I've seen the effects of something like this; I wrote about an earlier occasion way back in Metrics considered dangerous.

DashboardAttentionProblem written at 23:15:38; Add Comment

2019-05-20

Understanding how to pull in labels from other metrics in Prometheus

Brian Brazil recently wrote Analyse a metric by kernel version, where he shows how to analyze a particular metric in a new way by, essentially, adding a label from another metric to the metric, in this case the kernel version. His example is a neat trick, but it's also reasonably tricky to understand how it works, so today I'm going to break it down (partly so that I can remember this in six months or a year from now, when my PromQL knowledge has inevitably rusted).

The query example is:

avg without (instance)(
    node_sockstat_TCP_tw 
  * on(instance) group_left(release)
    node_uname_info
)

The simple version of what's happening here is that because node_uname_info's value is always 1, we're using '*' as a do-nothing arithmetic operator so we can essentially do a join between node_sockstat_TCP_tw and node_uname_info to grab a label from the latter. We have to go to these lengths because PromQL does not have an explicit 'just do a join' operator that can be used with group_left.

There are several things in here. Let's start with the basic one, which is the '* on(instance)' portion. This is one to one vector matching with a restriction on what label is being used to match up pairs of entries; we're implicitly restricting the multiplication to pairs of entries with matching 'instance' labels. Normally 'instance' will be the same for all metrics scraped from a single host's node_exporter, so it makes a good label for finding the node_uname_info metric that corresponds to a particular host's node_sockstat_TCP_tw metric.

(We have to use 'on (...)' because not all labels match. After all, we're pulling in the 'release' label from the node_uname_info metric; if it was already available as a label on node_sockstat_TCP_tw, we wouldn't need to do this work at all.)

Next is the group_left, which is being used here for its side effect of incorporating the 'release' label from node_uname_info in the label set of the results. I wrote about the basics of group_left's operation in Using group_* vector matching for database lookups, where I used group_left basically as a database join between a disk space usage metric and an alert level metric that also carried an additional label we wanted to include for who should get alerted. Brian Brazil's overall query here is similar to my case, except that here we don't care about the value that the node_uname_info metric has; we are only interested in its 'release' label.

In an ideal world, we could express this directly in PromQL to say 'match between these two metrics based on instance and then copy over the release label from the secondary one'. In this world, unfortunately group_left and group_right have the limitation that they can only be used with arithmetic and comparison operators. In my earlier entry this wasn't a problem because we already wanted to compare the values of the two metrics, Here, we don't care about the value of node_uname_info at all. Since we need an arithmetic or comparison operator in order to use group_left and we want to ignore the value of node_uname_info, we need an operator that will leave node_sockstat_TCP_tw's value unchanged. Because the value of node_uname_info is always 1, we can simply use '*', as multiplying by one will do nothing here.

(In theory we could instead use a comparison operator, which would naturally leave node_sockstat_TCP_tw's value unchanged (more or less cf). However, in practice it's often tricky to find a comparison operator that will always be true. You might not have any sockets in TIME_WAIT so a '>=' could be false here, for example. Using an arithmetic operator that will have no effect is simpler.)

The case of a secondary metric that's always 1 is the easy case, as we've seen. What about a secondary metric with a label you want that isn't necessarily always 1, and in fact may have an arbitrary value? Fortunately, Brian Brazil has provided the answer to that too. The simple but clever trick is to multiply the metric by zero and then add it:

  node_sockstat_TCP_tw
+ on(instance) group_left(release)
  (node_uname_info * 0)

This works with arbitrary values; multiplying by zero turns the value for the right side to 0, and then adding 0 has no effect on node_sockstat_TCP_tw's value.

As a side note, this illustrates a good reason to have '1' be the value of any metric that exists to publish its labels, as is the case for node_uname_info or metrics that publish, say, the version of your program. The value these metrics have is arbitrary in one sense, but '1' is both conventional and convenient.

PrometheusPullingInLabels written at 22:16:32; Add Comment

2019-05-17

My new favorite tool for looking at TLS things is certigo

For a long time I've used the OpenSSL command line tools to do things like looking at certificates and chasing certificate chains (although OpenSSL is no longer what you want to use to make self-signed certificates). This works, and is in many ways the canonical and most complete way to do this sort of stuff, but if you've ever used the openssl command and its many sub-options you know that it's kind of a pain in the rear. As a result of this, for some years now I've been using Square's certigo command instead.

Certigo has two main uses. My most common case is to connect to some TLS-using service to see what its active certificate and certificate chain is (and try to verify it), as well as some TLS connection details:

$ certigo connect www.cs.toronto.edu:https
** TLS Connection **
Version: TLS 1.2
Cipher Suite: ECDHE_RSA key exchange, AES_128_GCM_SHA256 cipher

** CERTIFICATE 1 **
Valid: 2018-04-17 00:00 UTC to 2020-04-16 23:59 UTC
Subject:
[...]

Certigo will attempt to verify the certificate's OCSP status, but some OCSP verifiers seem to dislike its queries. In particular, I've never seen it succeed with Let's Encrypt certificates; it appears to always report 'ocsp: error from server: unauthorized'.

(Some digging suggest that Certigo is getting this 'unauthorized' response when it queries the OCSP status of the intermediate Let's Encrypt certificate.)

Certigo can connect to things that need STARTTLS using a variety of protocols, including SMTP but unfortunately not (yet) IMAP. For example:

$ certigo connect -t smtp smtp.cs.toronto.edu:smtp

(Fortunately IMAP servers usually also listen on imaps, port 993, which is TLS from the start.)

My other and less frequent use of Certigo is to dump the details of a particular certificate that I have sitting around on disk, with 'certigo dump ...'. If you're dumping a certificate that's in anything except PEM format, you may have to tell Certigo what format it's in.

Certigo also has a 'certigo verify' operation that will attempt to verify a certificate chain that you provide it (against a particular host name). I don't find myself using this very much, because it's not necessarily representative of what either browsers or other sorts of clients are going to do (partly because it uses your local OS's root certificate store, which is not necessarily anything like what other programs will use). Generally if I want to see a client-based view of how a HTTPS server's certificate chain looks, I turn to the SSL server test from Qualys SSL Labs.

All Certigo sub-commands take a '-v' argument to make them report more detailed things. Their normal output is relatively minimal, although not completely so.

Certigo is written in Go and uses Go's standard libraries for TLS, which means that it's limited to the TLS ciphers that Go supports. As a result I tend to not pay too much attention to the initial connection report unless it claims something pretty unusual.

(It also turns out that you can get internal errors in Certigo if you compile it with the very latest development version of Go, which may have added TLS ciphers that Certigo doesn't yet have names for. The moral here is likely that if you compile anything with bleeding edge, not yet released Go versions, you get to keep both pieces if something breaks.)

InspectingTLSWithCertigo written at 22:53:21; Add Comment

(Previous 10 or go back to May 2019 at 2019/05/12)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.