Easy configuration for lots of Prometheus Blackbox checks
Suppose, not entirely hypothetically, that you want to do a lot of Prometheus Blackbox checks, and worse, these are all sorts of different checks (not just the same check against a lot of different hosts). Since the only way to specify a lot of Blackbox check parameters is with different Blackbox modules, this means that you need a bunch of different Blackbox modules. The examples of configuring Prometheus Blackbox probes that you'll find online all set the Blackbox module as part of the scrape configuration; for example, straight from the Blackbox README, we have this in their example:
- job_name: 'blackbox' metrics_path: /probe params: module: [http_2xx] [...]
You can do this for each of the separate modules you need to use, but that means many separate scrape configurations and for each separate scrape configuration you're going to need those standard seven lines of relabeling configuration. This is annoying and verbose, and it doesn't take too many of these before your Prometheus configuration file is so overgrown with many Blackbox scrapes that it's hard to see anything else.
(It would be great if Prometheus could somehow macro-ize these or
include them from a separate file or otherwise avoid repeating
everything for each scrape configuration, but so far, no such luck.
You can't even move some of your scrape configurations into a
separate included file; they all have to go in the main
Fortunately, with some cleverness in our relabeling configuration
we can actually embed the name of the module we want to use into
our Blackbox target specification, letting us use one Blackbox
scrape configuration for a whole bunch of different modules. The
trick is that what's necessary for Blackbox checks is that by the
end of setting up a particular scrape, the module parameter is
__param_module label. Normally it winds up there because
we set it in the
param section of the scrape configuration, but
we can also explicitly put it there through relabeling (just as we
__address__ by hand through relabeling).
So, let's start with nominal declared targets that look like this:
- ssh_banner,somehost:25 - http_2xx,http://somewhere/url
This encodes the Blackbox module before the comma and the actual Blackbox target after it (you can use any suitable separator; I picked comma for how it looks).
Our first job with relabeling is to split this apart into the
target URL parameters, which are the magic
relabel_configs: - source_labels: [__address__] regex: ([^,]*),(.*) replacement: $1 target_label: __param_module - source_labels: [__address__] regex: ([^,]*),(.*) replacement: $2 target_label: __param_target
(It's a pity that there's no way to do multiple targets and replacements in one rule, or we could make this much more compact. But I'm probably far from the first person to observe that Prometheus relabeling configurations are very verbose. Presumably Prometheus people don't expect you to be doing very much of it.)
Since we're doing all of our Blackbox checks through a single scrape
configuration, we won't normally be able to easily tell which module
(and thus which check) failed. To make life easier, we explicitly
save the Blackbox module as a new label, which I've called
- source_labels: [__param_module] target_label: probe
Now the rest of our relabeling is essentially standard; we save the
Blackbox target as the
instance label and set the actual address
of our Blackbox exporter:
- source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 127.0.0.1:9115
All of this works fine, but there turns out to be one drawback of putting all or a lot of your blackbox checks in a single scrape configuration, which is that you can't set the Blackbox check interval on a per-target or per-module basis. If you need or want to vary the check interval for different checks (ie, different Blackbox modules) or even different targets, you'll need to use separate scrape configurations, even with all of the extra verbosity that that requires.
(As you might suspect, I've decided that I'm mostly fine with a lot of our Blackbox checks having the same frequency. I did pull ICMP ping checks out into a separate scrape configuration so that we can do them a lot more frequently.)
PS: If you wanted to, you could go further than this in relabeling;
for instance, you could automatically add the :25 port specification
on the end of hostnames for SSH banner checks. But it's my view
that there's a relatively low limit on how much of this sort of
rewriting one should do. Rewriting to avoid having a massive
prometheus.yml is within my comfort limit here; rewriting just
avoid putting a ':25' on hostnames is not. There is real merit to
being straightforward and sticking as close to normal Prometheus
practice as possible, without extra magic.
(I think that the 'module,real-target' format of target names I've adopted here is relatively easy to see and understand even if you don't know how it works, but I'm biased and may be wrong.)
The needs of Version Control Systems conflict with capturing all metadata
In a comment on my entry Metadata that you can't commit into a VCS is a mistake (for file based websites), Andrew Reilly put forward a position that I find myself in some sympathy with:
Doesn't it strike you that if your VCS isn't faithfully recording and tracking the metadata associated with the contents of your files, then it's broken?
Certainly I've wished for VCSes to capture more metadata than they do. But, unfortunately, I've come to believe that there are practical issues for VCS usage that conflict with capturing and restoring metadata, especially once you get into advanced cases such as file attributes. In short, what most users of a VCS want are actively in conflict with the VCS being a complete and faithful backup and restore system, especially in practice (ie, with limited programming resources to build and maintain the VCS).
The obvious issue is file modification times. Restoring file
modification time on checkout can cause many build systems (starting
make) to not rebuild things if you check out an old version
after working on a recent version. More advanced build systems that
don't trust file modification timestamps won't be misled by this,
but not everything uses them (and not everything should have to).
More generally, metadata has the problem that much of it isn't portable. Non-portable metadata raises multiple issues. First, you need system-specific code to capture and restore it. Then you need to decide how to represent it in your VCS (for instance, do you represent it as essentially opaque blobs, or do you try to translate it to some common format for its type of metadata). Finally, you have to decide what to do if you can't restore a particular piece of metadata on checkout (either because it's not supported on this system or because of various potential errors).
(Capturing certain sorts of metadata can also be surprisingly expensive and strongly influence certain sorts of things about your storage format. Consider the challenges of dealing with Unix hardlinks, for example.)
You can come up with answers for all of these, but the fundamental problem is that the answers are not universal; different use cases will have different answers (and some of these answers may actually conflict with each other; for instance, whether on Unix systems you should store UIDs and GIDs as numbers or as names). VCSes are not designed or built to be comprehensive backup systems, partly because that's a very hard job (especially if you demand cross system portability of the result, which people do very much want for VCSes). Instead they're designed to capture what's important for version controlling things and as such they deliberately exclude things that they think aren't necessary, aren't important, or are problematic. This is a perfectly sensible decision for what they're aimed at, in line with how current VCSes don't do well at handling various sorts of encoded data (starting with JSON blobs and moving up to, say, word processor documents).
Would it be nice to have a perfect VCS, one that captured everything, could restore everything if you asked for it, and knew how to give you useful differences even between things like word processor documents? Sure. But I can't claim with a straight face that not being perfect makes a VCS broken. Current VCSes explicitly make the tradeoff that they are focused on plain text files in situations where only some sorts of metadata are important. If you need to go outside their bounds, you'll need additional tooling on top of them (or instead of them).
(Or, the short version, VCSes are not backup systems and have never claimed to be ones. If you need to capture everything about your filesystem hierarchy, you need a carefully selected, system specific backup program. Pragmatically, you'd better test it to make sure it really does back up and restore unusual metadata, such as file attributes.)
OpenSSH 7.9's new key revocation support is welcome but can't be a full fix
I was reading the OpenSSH 7.9 release notes, as one does, when I ran across a very interesting little new feature (or combination of features):
- sshd(8), ssh-keygen(1): allow key revocation lists (KRLs) to revoke keys specified by SHA256 hash.
- ssh-keygen(1): allow creation of key revocation lists directly from base64-encoded SHA256 fingerprints. This supports revoking keys using only the information contained in sshd(8) authentication log messages.
Any decent security system designed around Certificate Authorities needs a way of revoking CA-signed keys to make them no longer valid. In a disturbingly large number of these systems as people actually design and implement them, you need a fairly decent amount of information about a signed key in order to revoke it (for instance, its full public key). In theory, of course you'll have this information in your CA system's audit records because you'll capture all of it in your audit system when you sign a key. In practice there are many things that can go wrong even if you haven't been compromised.
Fortunately, OpenSSH was never one of these systems; as covered in ssh-keygen(1)'s 'Key Revocation Lists', you could specify keys in a variety of ways that didn't require a full copy of the key's certificate (by serial number or serial number range, by 'key id', or by its SHA1 hash). What's new in OpenSSH 7.9 is that they've reduced the amount of things you need to know in practice, as now you can revoke a key given only the information in your ordinary log messages. This includes but isn't limited to CA-signed SSH keys (as I noticed recently).
(This took both the OpenSSH 7.9 change and an earlier change to log the SHA256 of keys, which happened in OpenSSH 6.8.)
This OpenSSH 7.9 new feature is a very welcome change; it's now
much easier to go from a log message about a bad login to blocking
all future use of that key, including and especially if that key
is a CA-signed key and so you don't (possibly) have a handy copy
of the full public key in someone's
However, this isn't and can't be a full fix for the tradeoff of
having a local CA. The tradeoff is still there,
it's just somewhat easier to deal with either a compromised signed
key or the disaster scenario of a compromised CA (or a potentially
With a compromised key, you can immediately push it into your system for distributing revocation lists (and you should definitely build such a system if you're going to use a local CA); you don't have to go to your CA audit records first to fish out the full key and other information. With a potentially compromised CA, it buys you some time to roll over your CA certificate, distribute the new one, re-issue keys, and so on, without being in a panic situations where you can't do anything but revoke the CA certificate immediately and invalidate everyone's keys. Of course, you may want to do that anyway and deal with the fallout, but at least now you have more options.
(If you believe that your attacker was courteous enough to use unique serial numbers, you can also do the brute force approach of revoking every serial number range except the ones that you're using for known, currently valid keys. Whether or not you want to use consecutive serial numbers or random ones is a good question, though, and if you use random ones, this probably isn't too feasible.)
PS: I continue to believe that if you use a local CA, you should be doing some sort of (offline) auditing to look for use of signed keys or certificates that are not in your CA audit log. You don't even have to be worried that your CA has been compromised, because CA software (and hardware) can have bugs, and you want to detect them. Auditing used keys against issued keys is a useful precaution, and it shouldn't need to be expensive at most people's scale.