We've migrated from Yubikey 2FA to the university's MFA
We have a sensitive host that absolutely has to be protected with multi-factor authentication. When we first set it up in late 2016, the second factor we chose was touch-required SSH keys held on Yubikeys. Recently, we have been switching this host over to the university's institutional multi-factor authentication. The university's MFA uses Duo, so our sensitive host is set up to use Duo's PAM module.
(Integrating Duo with OpenSSH led me to explode what combinations of authentication methods we could and couldn't support with PAM-based MFA.)
Relying on the institutional MFA has some failure modes that using Yubikeys doesn't, since verifying Yubikey SSH key authentication is entirely contained on the host. However, we decided that we could live with these additional failure modes because using the institutional MFA had some significant advantages for us. I can boil these down to three general areas: availability, usability, and manageability for the people involved.
In terms of availability, institutional MFA has the advantage that people already have to routinely use it for all sorts of things, so they had it set up, working, and ready to hand. Our Yubikeys were only ever used for this host, so if you didn't log in to the host for a while, they could wind up not so available and ready. And in general our Yubikeys were yet another thing for people to manage and keep track of, like an extra and rarely used key.
In terms of usability, the institutional MFA is a lot easier to get going and work with for SSH logins, because all it demands from your SSH login session is that you type extra text. Yubikeys required a USB connection and appropriate software to connect to the Yubikey in either your SSH client or your SSH agent. Not all things that can run a SSH client even have USB, and of the computers that do, often the software was an issue.
(In the future, OpenSSH's new(ish) support for FIDO/U2F may help some of this, but only for things that wind up running OpenSSH 8.2 or better. In practice this means it will be years before Windows, macOS, iOS, and Android SSH clients can all reliably take advantage of it.)
In terms of manageability, the institutional MFA has other people who handle all aspects of enrolling people, managing their devices, recording the necessary data for server side authentication, and so on. With Yubikeys, all of that was on us and it wasn't necessarily a smooth and easy process. In fact it was so friction prone that we would never have scaled it up beyond the small group of people who needed to have access to this sensitive server.
The Yubikey solution was simpler, theoretically more reliable, and potentially more secure (and certainly more under our control) than the institutional MFA system is. But in practice both the ease of using and the ease of managing whatever we used for MFA turned out to matter quite a bit, and Yubikeys weren't really good at either of these. Institutional MFA is good enough, it's officially blessed by the university, and it's much easier for everyone to deal with, so it wins in practice.
(I admit that the Yubikey SSH key generation security issue soured me on really trusting some parts of the theoretical Yubikey advantages and shifted my views on where I should generate keys, as well as making me kind of unhappy with Yubikeys in general.)
The OpenSSH server has limits on what user authentication you can use
We've recently deployed multi-factor authentication for SSH access to some especially sensitive machines, which has caused me to become much more familiar with how OpenSSH's sshd and MFA interact with each other and the limits on that. Unfortunately these limits mean that some combinations of authentication methods are not really available (or not available at all). For example, in many situations you can't require people to use their Unix password and then either a public key or MFA authentication.
The simple way to describe the issue is that OpenSSH's public key authentication is done completely separately from PAM (on Linux systems), and OpenSSH has no way to selectively ask PAM to do various things. Roughly speaking, all interaction with PAM is a black box to the OpenSSH server that either passes or fails as the "keyboard-interactive" or "password" authentication method.
There are broadly two ways of doing MFA authentication with OpenSSH; you can do it as people log in to your server (although this gets annoying), or you can do it before hand in some way. The usual approach for the latter is to have an OpenSSH certificate authority that issues short term certificates only if you authenticate to it properly (through, for example, a web front end). Unfortunately the certificate approach is generally much more difficult to deploy, because you need a bunch of moving parts to exist on all of the clients. If you want simple deployment, you need to do all of your MFA at login time. OpenSSH itself has no built in support for MFA, so if you take the 'at login time' path you must put your MFA authentication into your PAM stack.
OpenSSH allows you to require multiple authentication methods but, as mentioned, your entire PAM stack is one authentication method. Since your MFA goes in your PAM stack, it's intrinsically coupled to whatever else the PAM stack does (which almost certainly includes asking for the Unix password). Then since SSH public key authentication isn't exposed through PAM, you can't use PAM's own (rather complicated) conditional features to require public key authentication only in some situations, or to skip only parts of the PAM stack if the user has public key authentication.
Another consequence of your MFA authentication being bolted to the rest
of your PAM authentication is that it's relatively difficult to allow
people using passwords to skip MFA under some circumstances (for example
if they're logging in from your highly secure emergency access server).
If MFA was a separate access method, you could configure this through
Match blocks in your
sshd_config. As it is, perhaps there's some
way to do this reliably through PAM, but it's going to be harder and
less obvious to configure.
In theory all of this could be dealt with if OpenSSH supported an
additional authentication method, or really class of them, that I
will call "
pam:<service>". This would do a keyboard interactive
PAM authentication using the given service name. Then you could put
your MFA PAM configuration into a separate
service and set your SSH server to require, say, 'publickey,password'
or 'password,pam:mfa' (and just 'password' from your emergency
access server, or perhaps 'hostbased,password' for extra security).
Moving averages (and rates) for metrics in Prometheus (and Grafana)
Today, I wanted to check if our external mail gateway (which uses a TLS certificate from Let's Encrypt) was seeing a drop in incoming traffic, since there are various problems cropping up as a result of a Let's Encrypt related TLS root certificate expiry (also). We extract metrics from our Exim logs using mtail, so the raw information was available in metrics, but we get a sufficiently modest amount of external email that our natural external mail arrival rate is relatively variable on a minute to minute basis.
When I do metrics graphs of rates or averages (either in Grafana
for dashboards or in Prometheus to explore things), I normally set
the time interval for things like rate() to the step interval. In
Grafana I use
in Prometheus I will see what it reports the step interval as and
then copy that into the time interval. This means that my graphs
are (theoretically) continuous but without oversampling. However,
this doesn't work well with metrics like our email arrival rate.
If you look at relatively moderate time range like a few hours (such
as '8 am to now'), you have a small time interval and a spiky graph
that's hard to see long term trends in; if you look at a long time
range (a few days), moderate duration trends can disappear into the
mush of long time intervals.
So I had what felt like a clever idea: why not use a time interval that was significantly longer than the step interval for once (here 30 minutes or an hour), which I felt would smooth out short term variability but reveal longer term trends more clearly. When I tried it out, it gave me a more or less readable graph that suggested no particularly visible drops in incoming email volume.
I had just used a moving average, which is a well known statistical technique to "smooth out short-term fluctuations and highlight longer-term trends or cycles", to quote from Wikipedia. Specifically I had a simple moving average, in the form that is sometimes called a trailing average (since it only looks backward from your starting point, instead of looking on either side of it).
A trailing average is easy to do in Prometheus graphs as you're
exploring things; just set your time interval to either some
reasonable value for your purpose or to a value enough higher than
your step interval. If your time interval starts approaching your
step interval (or becomes smaller than it), you've probably zoomed
out to a too large time range. This applies to both explicit averages
over time and to
rate(), which is effectively a per second average
over the time range.
I'm not sure how to do good moving averages in Grafana for dashboards
that you want to work over broad ranges in time. If you set a fixed
time interval, it will be too small if people expand the dashboard's
time interval far enough. Grafana's special
doesn't expand anywhere near wide enough, and as far as I know you
can't do math on Grafana variables like
$__interval or get a
minimum of it and something else. My overall conclusion is that if
I use moving averages in a Grafana graph, I'll probably have to say
that it only works well for time ranges under some value (and I'll
have to poke around to find out what that value is, since the step
interval Grafana will use depends on several factors).
(In general moving averages now feel like something I should pay more attention to and make more use of, although I don't want to get too enthusiastic right now and add too many moving average graphs to our dashboards.)
The shifting view of two factor authentication at the university
Back in 2016, I wrote an entry about how we'd never be able to have everyone use two factor authentication. Recently, the university (and thus necessarily my department) is moving to make two factor authentication required. Given this, you might wonder what happened between 2016 and now, five years later. The simple answer is changing attitudes on smartphones, or if you prefer, a view that two factor authentication is sufficiently important that the administration is willing to force people to go through pain.
Now as in 2016, the fundamental political problem of two factor authentication is that you need a device to be the second factor and this device has to belong to someone and be provided by someone. If the organization has to provide a device to any significant number of people, life gets expensive fast. In 2016 I wrote about this, as an aside:
(We cannot assume that all people have smartphones, either, and delegate our 2FA authentication to smartphone apps.)
The political answer here in 2021 is that we are now assuming that people like graduate students have personal smartphones and can be required to install an authenticator app on those smartphones. For most people, if they don't have a personal smartphone (that they're willing to use for this), the university's answer will not be to provide a hardware two factor token at the university's expense, it will be to shrug and tell them to get one, even if it's a cheap old one that will only be used on wifi networks.
It's my guess that this political answer is feasible because as a practical matter people mostly have smartphones and mostly have become acclimatized to using them for non-personal purposes like stuff involved with being a graduate (or undergraduate) student. Many people read their university email on personal devices, for example. By comparison to some entanglements, an authenticator app is nothing special, and smartphones actually have a decent reputation for limiting what apps can do (much more so than regular computers).
There's also the technical answer that the various things enabled by not having two factor authentication are becoming more and more damaging in various ways. Since 2016, computer security has utterly failed to significantly prevent or mitigate the damage of password compromises in any other way than making passwords by themselves useless. Forced two factor authentication is not the best solution, it's simply the last solution standing, and something has to be done.
(As a necessary disclaimer, I have no specific information on the university administration's decision making process here. I do know that we're increasingly concerned about the damage that phishing and so on is doing.)
It's probably not the hardware, a sysadmin lesson
We just deployed a new OpenBSD 6.9 machine the other day, and after it was deployed we discovered that it seemed to have serious problems with keeping time properly. The OpenBSD NTP daemon would periodically declare that the clock was unsynchronized, when it was adjusting the clock it was frequently adjusting it by what seemed to be very large amounts (by NTP standards), reporting numbers like '-0.244090s', and most seriously every so often the time would wind up completely off by tens of minutes or more. Nothing like this has happened on any of our other OpenBSD machines, especially the drastic clock jumps.
Once we noticed this, we flailed around looking at various things and wound up reforming the machine's NTP setup to be more standard (it was different for historical reasons). But nothing cured the problem, and last night its clock wound up seriously off again. After all of this we started suspecting that there was something wrong with the machine's hardware, or perhaps with its BIOS settings (I theorized wildly that the BIOS was setting it to go into a low power mode that OpenBSD's timekeeping didn't cope with).
Well, here's a spoiler: it wasn't the hardware, or at least the drastic time jumps aren't the hardware. Although we'll only know for sure in a few days, we're pretty sure we've identified their cause, and it's due to some of our management scripts (that are doing things well outside the scope of this entry).
When we have a mysterious problem and we just can't understand it despite all our attempts to investigate things, it's tempting to decide that it's a hardware problem. And sometimes it actually is. But a lot of the time it's actually software, just as a lot of the time what you think has to be a compiler bug is a bug in your code.
(If it's a hardware problem it's not something you can fix, so you can stop spending your time digging and digging into software while getting nowhere and frustrating yourself. This is also the appeal of it being a compiler bug, instead of your bug; if it's your bug, you need to keep on with that frustrating digging to find it.)
Why we care about being able to (efficiently) reproduce machines
One of the broad meta-goals of system administration over the past few decades has been working to be able to reliably reproduce machines, ideally efficiently. It wasn't always this way (why is outside the scope of this entry), but conditions have changed enough so that this became a concern for increasingly many people. As a result, it's a somewhat quiet driver of any number of things in modern (Unix) system administration.
There are a number of reasons why you would want to reproduce a machine. The current machine could have failed (whether it's hardware or virtual) so you need to reconstruct it. You might want to duplicate the machine to support more load or add more redundancy, or to do testing or experimentation. Sometimes you might be reproducing a variant of the machine, such as to upgrade the version of the base Linux or other Unix it uses. The more you can reproduce your machines, the more flexibility you can have with all of these, as well as the more confidence you can have that you understand your machine and what went into it.
One way of reproducing a machine is to take careful notes of everything you ever do to the machine, from the initial installation onward. Then, when you want to reproduce the machine, you just repeat everything you ever did to it. However, this suffers from the same problem as replaying a database's log on startup in order to restore its state; replaying all changes isn't very efficient, and it gets less efficient as time goes on and more changes build up.
(You may also find that some of the resources you used are no longer available or have changed their location or the like.)
The goal of being able to efficiently reproduce machines has led system administration to a number of technologies. One obvious broad area is systems where you express the machine's desired end state and the system makes whatever changes are necessary to get there. If you need to reproduce the machine, you can immediately jump from the initial state to your current final one without having to go through every intermediate step.
(The virtual machine approach where VMs are immutable once created can be seen as another take on this. By forbidding post-creation changes, you fix and limit how much work you may need to "replay".)
There are two important and interrelated ways of making reproducing a machine easier (and often more efficient). The first is to decide to allow some degree of difference between the original version and the reproduction; you might decide that you don't need exactly the same versions of every package or to have every old kernel. The second is to systematically work out what you care about on the machine and then only exactly reproduce that, allowing other aspects of the machine to vary within some acceptable range.
(In practice you'll eventually need to do the second because you're almost certain to need to change the machine in some ways, such as to apply security updates to packages that are relevant to you. And when you "reproduce" the machine using a new version of the base Unix, you very much need to know what you really care about on the machine.)
How many Prometheus metrics a typical host here generates
When I think about new metrics sources for our Prometheus setup, often one of the things on my mind is the potential for a metrics explosion if I add metrics data that might have high cardinality. Now, sometimes high cardinality data is very worth it and sometimes data that might be high cardinality won't actually be, so this doesn't necessarily stop me. But in all of this, I haven't really developed an intuition for what is a lot of metrics (or time series) and what isn't. Recently it struck me is that one relative measuring stick for this is how many metrics (ie time series) a typical host generates here.
Currently, most hosts only run the host agent, although its standard metrics are augmented with locally generated data. On machines that have our full set of NFS mounts, a major metrics source is a local set of metrics for Linux's NFS mountstats. A machine generating these metrics has anywhere between 6,000 to 8,000 odd time series. An Ubuntu Linux machine that doesn't generate these metrics generally has around 1,300 time series.
(Our modern OpenBSD machines, which also support the host agent, have around 150 time series.)
Our valuable disk space usage metrics
have between 7,400 time series, on NFS fileservers where almost every user has some files,
such as the fileserver hosting our
/var/mail, and under 2,000. on
other fileservers. Some fileservers have significantly less, down to
just over 300 on our newest and least used fileserver. Having these
numbers gives me a new perspective on how "high cardinality" these
metrics really are; at most, the metrics from one fileserver are roughly
equivalent to adding another Ubuntu server with all our NFS mounts.
More often they're equivalent to a standalone Ubuntu server.
This equivalence matters to me for thinking about new metrics because I add monitoring for new Ubuntu servers without thinking about it. If a new metrics source is equivalent to another Ubuntu server, I don't really need to think about it either (unless I'm going to do something like add it to each existing server, effectively doubling their metrics load). However, significantly raising the number of host equivalents that we monitor would be an issue, since currently the host agent is collectively our single largest source of metrics by far.
One interesting metrics source is Cloudflare's Linux eBPF exporter, which can be used to get things like detailed histograms of disk read and write IO times. I happen to be doing this on my office workstation, where it generates about 500 time series that cover two SATA SSDs and two NVMe drives. This suggests that it would be entirely feasible to add it to machines of interest, even our NFS fileservers, where at a very rough ballpark I might see about 2,400 new time series per server (each has 18 disks).
(For how to calculate this sort of thing for your own Prometheus setup, see my entry on how big our Prometheus setup is.)
Adding a "host" label to all of your per-host Prometheus metrics
One of the things I've come to believe about labels on Prometheus
metrics is that all metrics for a particular host should have a
label for its hostname. I tend to call this label "host" (eg) in my entries, but when we set up
our setup I actually called it
cshost", with a prefix, to guard against the possibility that
some metrics source would have its own "host" label.
The purpose of this label is to have something that can be used to
straightforwardly group and join across different metrics sources
for the host. Often this will make it convenient to reuse it in
alert messages. Prometheus will generally give each metrics source
its own unique combination of "
job" and "
instance" labels, but
instance" label often has inconvenient extras in it, like
port numbers. Taking all of those extra things out and creating a
unique label for each host makes it much easier to do various things
across all metrics for a host, regardless of their source.
(As part of this, if you send host specific things to Pushgateway from a host, you should make sure it also adds the host label to what it sends in one way or another. What is host specific may depend on what use you want to make of the metrics you push.)
If you automatically generate your list of targets, you can probably
just specify the value for your "host" label along side each generated
target. Otherwise, you'll want to use relabeling
to create these labels from information you already have. For
example, here is our relabeling rule for our host agent job,
which just takes off the port number on the address to create
relabel_configs: - source_labels: [__address__] regex: (.*):9100 replacement: $1 target_label: cshost
When you set this up, I have a small suggestion from our somewhat
painful experience: don't mix fully qualified and unqualified
host names in your "host" labels for the same machines. Our agent
jobs (for node_exporter
and some other per-host agents run on specific hosts) use unqualified
host names, but all of our Blackbox checks use fully
qualified host names; this difference is then passed through to our
cshost" label values. We fix this up in Alertmanager relabeling so that Alertmanager always sees an
cshost" (for our own hosts) and uses this in alert
messages and grouping, but we should have this right from the start
in the metrics themselves.
(The morally right choice is probably to use fully qualified host names everywhere, even if this makes life more annoying.)
How I try out and test new versions of Grafana
For much of the time that we've been running our Prometheus and Grafana setup, upgrading our Grafana to keep up with the latest version has been unexciting. Then Grafana embarked on a steady process of deprecating and eventually removing old panel types, replacing them with new and only theoretically equivalent and fully functional new implementations. I've generally not been a fan, even before Grafana 8.0 (also), and so this created a need to actually try out and test new versions of Grafana so I would have some idea if they made our dashboards explode. Fortunately this is fairly easy in our current environment through some lucky setup choices we made.
(If you're considering such an upgrade and you have old SingleStat panels, you probably want to install the 'grafana-singlestat-panel' plugin, which keeps it available as a plugin. This extremely helpful tip comes from matt abrams on Twitter.)
We run both Prometheus and Grafana on the same server, with both behind an Apache reverse proxy, and we do our alerts through Prometheus Alertmanager, not Grafana. When I first configured Grafana, I could have set it to talk directly to Prometheus as 'localhost:9090', but for various reasons I set it up to go through the Apache reverse proxy using the machine's official web name. One useful effect of this is that I can easily bring up another instance of our Grafana setup on a second machine; if I copy our configuration and Grafana database, it has our normal dashboards and will automatically pull data from our live Prometheus. I can then readily see if everything looks right and directly compare the appearance of our production Grafan and the different version.
(We have standard install instructions for our core metrics server; I do the relevant Apache and Grafana sections on a test virtual machine. If I'm smart, I snapshot the initial post-installation state of the VM before I start playing around with a new Grafana version, so I can revert to a clean setup without reinstalling from scratch.)
With a separate
grafana.db I can then experiment with updating panels
and dashboards for the new version. If it doesn't work out I can revert
to the initial setup by taking another copy of our live
(and under some circumstances consider copying it the other way to save
work). All sorts of experiments are possible.
I could still do a version of this if I had set our Grafana to talk directly to Prometheus (provided that our Prometheus was accessible from outside the machine, which it is); I'd just have to edit the Grafana datasource. I don't think this currently has any other effects on your dashboards, and that's probably not going to change in the future.
If we did alerts through Grafana, I would have to disable them in the test Grafana to avoid potential duplicate alerts. In my view, this is a good reason to have your alerts handled in a separate component that you can selectively disable or omit. Alerts have observable side effects, so you have to be careful when testing them; dashboards generally don't, so you have an easier time.
Using our metrics system when I test systems before deployment
Years ago I wrote that I should document my test plans for our systems and their results, and I've somewhat managed to actually do that (and then the documentation's been used later, for example). Recently it struck me that our metrics system has a role to play in this.
To start with, if I add my test system to our metrics system (even with a hack), our system will faithfully capture all sorts of performance information for it over the test period. This information isn't necessarily as fine-grained as I could gather (it doesn't go down to second by second data), but it's far more broad and comprehensive than I would gather by hand. If I have questions about some aspect of the system's performance when I write up test plan results, it's quite likely that I can get answers for them on the spot by looking in Prometheus (without having to re-run tests while keeping an eye on the metric I've realized is interesting).
(As a corollary of this, looking at metrics provides an opportunity to see if anything is glaringly wrong, such as a surprisingly slow disk.)
In addition, if I'm testing a new replacement for an existing server, having metrics from both systems gives me some opportunity to compare the performance of the two systems. This comparison will always be somewhat artificial (the test system is not under real load, and I may have to do some artificial things to the production system as part of testing), but it can at least tell me about relatively obvious things, and it's easy to look at graphs and make comparisons.
Our current setup keeps metrics for as long as possible (and not downsampling them, which I maintain is a good thing). To the extent that we can keep on doing this, having metrics from the servers when I was testing them will let us compare their performance in testing to their performance when they (or some version of them) is in production. This might turn up anomalies, and generally I'd expect it to teach us about what to look for in the next round of testing.
To get all of this, it's not enough to just add test systems to our
metrics setup (although that's a necessary prerequisite). I'll also
need to document things so we can find them later in the metrics
system. At a minimum I'll need the name used for the test system
and the dates it was in testing while being monitored. Ideally I'll
also have information on the dates and times when I ran various
tests, so I don't have to start at graphs of metrics and reverse
engineer what I was doing at the time. A certain amount of this is
information that I should already be capturing in my notes, but I
should be more systematic about recording timestamps from '
and so on.