Wandering Thoughts

2017-12-09

You don't have to authorize a machine for Let's Encrypt from the machine

A commentator on yesterday's entry brought up the issue of authorizing internal-only machines, ones that are in DNS but that aren't otherwise reachable from the Internet. Although we haven't actually done this, in general it's possible to do Let's Encrypt's authorization for a particular machine on an entirely different machine, even without using the DNS-based authorization method. All you need is that HTTP requests from the Internet go somewhere where you can handle them in something you control.

If the internal host has a public IP, this is going to take a firewall with some redirection rules (and a suitable other host). But you probably have that already. If the internal host has a private IP address, you probably have 'split horizon' DNS so in your Internet-visible DNS you can assign it a public IP that goes to the suitable other host. As far as I know, most Let's Encrypt clients are perfectly happy in this situation; they don't try to check that the host you're running them on is the host <X> that you're requesting a certificate for.

(If you're unlucky enough to have private IP addresses in public DNS (which can happen for odd reasons), well, then you're out of luck for that host.)

This does leave you with the job of transporting the new TLS certificate to the internal host and handling any daemon notifications needed there, but there are lots of solutions for that. 'Propagate file to host <X> and do something if it's changed' is not hard to automate and generally there's a lot of already mature solutions for it (some of which you may already be using). Some Let's Encrypt clients let you run custom scripts on 'certificate updated' events, so you could use this to immediately push the new certificate to the target host.

In the specific case of acmetool, you have a lot of options if you're willing to do some scripting. Acmetool supports running scripts to handle both challenges and 'certificate updated' events. If you want to run acmetool on your internal host, you could have it push the HTTP challenge files to the bastion host that will expose them to Let's Encrypt; if you want to run it on the bastion host, you could have it propagate the new TLS certificates to the internal host either directly or indirectly (by storing them into some internal data store, which the internal system then pulls from).

Sidebar: Clever tricks with the ACME protocol

As I found out, Let's Encrypt's ACME protocol splits up authorizing machines from issuing certificates. This means that it's technically possible to authorize a host from one machine (say, your bastion machine or your DNS server) and then later obtain a certificate for that host from a second machine (say, the internal machine itself, provided it can talk to the Let's Encrypt servers). The two machines involved have to use a common Let's Encrypt account in order to share the authorization, but that's just a matter of having the same account information and private keys on both (although this has some security implications).

However, as far as I know clients don't generally support performing these steps separately, either doing only authorization and then stopping or doing certificate requests and aborting if Let's Encrypt tells you that it requires authorization. An ideal client for this would also track authorization and certificate timeouts separately, so your bastion host or DNS server could run something to make sure that all authorizations were current and then internal hosts would never wind up reporting 'need authorization' errors.

(You might also want to associate different authorizations with different Let's Encrypt accounts and keys, to limit your exposure if an internal host is compromised. With the bastion host, well, you're on your own unless you build something really complicated.)

LetsEncryptIndirectAuthorization written at 18:03:28; Add Comment

We've switched over to using Let's Encrypt as much as possible

Over the years, we've used a whole collection of different TLS CAs. We've preferred free ones where we could, for good reasons, which meant that we've used both ipsCA (until they exploded) and StartSSL (aka StartCom), but we've also paid for TLS certificates when we had to; modern TLS certificates are pretty affordable even for us if we don't go crazy. And these days we even have access to free TLS certificates through the university's central IT. However, we've now switched over to using Let's Encrypt as much as possible; basically it's the first CA we attempt to use, and if it doesn't work for some reason we'd probably turn to the free TLS certificates from central IT (both because they're free and because the process of getting one isn't too painful).

Our main reason for switching to Let's Encrypt isn't that it's free (it's not our only current source of free certificates); instead, as with my personal use, it's become all about the great automation. With Let's Encrypt, getting an initial certificate just requires running a command line program, and once we've worked out how to handle any particular program (since LE's good for more than web servers), we can completely stop worrying about certificate renewals. It just quietly happens and everything works and we don't notice a thing. The LE client that we've wound up using all of the time is Hugo Landau's acmetool, which is what I settled on myself. Acmetool has proven to be reliable and easy to tweak so it supports various programs like Dovecot and Exim.

(Our current approach to satisfying Let's Encrypt challenges is to let HTTP from the Internet through to any machine that needs a TLS certificate, whether or not it normally runs a web server. Acmetool will automatically run its own web server while a challenge is active, if necessary.)

Using acmetool or any other suitable Let's Encrypt client is not the only way of automating TLS certificate updates, but it has the great advantage for us that it comes basically ready to go. In our environment there's almost nothing to build to support new TLS-using programs and almost nothing special to do to set acmetool up on any particular machine (and we have canned directions for the few steps required). People with existing modern automation infrastructure may already have this solved, and so may find Let's Encrypt less compelling than we do.

Almost two years ago I wrote about how we couldn't use Let's Encrypt for production due to rate limits. What's changed since then is that Let's Encrypt's current rate limits specifically exempt certificate renewals from their 'certificates per registered domain' limit. This means that if we can get an initial certificate for a host, we're basically sure to be able to renew it, which is the important thing for us. If the initial issuance fails, that's when we can turn to alternate CAs (but for the names we want it almost never does).

PS: Since automation is such a big motivation for us, what sold us is not Let's Encrypt by itself but acmetool. In a real sense, we're indifferent to what TLS certificate provider is behind acmetool, and if we could get free certificates from central IT just as easily (perhaps literally using LE's ACME protocol), we'd be happy to do just that. But at least for now, Let's Encrypt itself and ACME are conjoined together.

LetsEncryptSwitchover written at 00:27:29; Add Comment

2017-12-01

I'm basically giving up on syslog priorities

I was recently writing a program where I was logging things to syslog, because that's our default way of collecting and handling logs. For reasons beyond the scope of this entry I was writing my program in Go, and unfortunately Go's standard syslog package makes it relatively awkward to deal with varying syslog priorities. My first pass at the program dutifully slogged through the messy hoops to send various different messages with different priorities, going from info for routine events, to err for reporting significant but expected issues, and ending up at alert for things like 'a configuration file is broken and I can't do anything'. After staring at the resulting code for a while with increasingly unhappy feelings, I ripped all of it out in favour of a much simpler use of basic Go logging that syslogged everything at priority info.

At a theoretical level, this is clearly morally wrong. Syslog priorities have meanings and the various sorts of messages my program can generate are definitely of different importance to us; for example, we care far more about 'a configuration file is broken' than 'I did my thing with client machine <X>'. At a practical level, though, syslog priorities have become irrelevant and thus unimportant. For a start, we make almost no attempt to have our central syslog server split messages up based on their priority. The most we ever look at is different syslog facilities, and that's only because it helps reduce the amount of messages to sift through. We have one file that just gets everything (we call it allmessages), and often we just go look or search there for whatever we're interested in.

In my view there are two pragmatic reasons we've wound up in this situation. First, the priority that a particular message of interest is logged at is something we'd have to actively remember in order for it to be of use. Carefully separating out the priorities into different files only actually helps us if we can remember that we want to look at, say, all.alert for important messages from our programs. In practice we can barely remember which syslog facility most things use, which is one reason we often just look at allmessages.

More importantly, we're mostly looking at syslog messages from software we didn't write and it turns out that what syslog priorities get used are both unpredictable and fairly random. Some programs dump things we want to know all the way down at priority debug; others spray unimportant issues (or what we consider unimportant) over nominally high priorities like err or even crit. This effectively contaminates most syslog priorities with a mixture of messages we care about and messages we don't, and also makes it very hard to predict what priority we should look at. We're basically down to trying to remember that program <X> probably logs the things we care about at priority <Y>. There are a bunch of program <X>s and in practice it's not worth trying to remember how they all behave (and they can change their minds from version to version, and we may have both versions on our servers on different OSes).

(There is a similar but somewhat smaller issue with syslog facilities, which is one reason we use allmessages so much. A good illustration of this is trying to predict or remember which messages from which programs will wind up in facility auth and which wind up in authpriv.)

This whole muddle of syslog priority usage is unfortunate but probably inevitable. The end result is that syslog priorities have become relatively meaningless and so there's no real harm in me giving up on them and logging everything at one level. It's much more important to capture useful information that we'll want for troubleshooting than to worry about what exact priority it should be recorded at.

(There's also an argument that fine-grained priority levels are the wrong approach anyway and you have maybe three or four real priority levels at most. Some people would say even less, but I'm a sysadmin and biased.)

SyslogPrioritiesGivingUp written at 23:23:03; Add Comment

2017-11-27

The dig program now needs some additional options for useful DNS server testing

I've been using the venerable dig program for a long time as my primary tool to diagnose odd name server behavior. Recently, I've discovered that I need to start using some additional options in order for it to make useful tests, where by 'useful tests' I mean that dig's results correspond to results I would get through a real DNS server such as Unbound.

(Generally my first test with DNS issues is just to query my local Unbound server, but if I want to figure out why that failed I need some tool that will let me find out specific details about what didn't work.)

For some time now I've known that some nameservers reject your queries if you ask for recursive lookups, so I try to use +norecurs. In exploring an issue today, I discovered that some nameservers also don't respond if you ask for some EDNS options, which it turns out that dig apparently now sets by default. Specifically they don't respond to DNS queries that include an EDNS COOKIE option, although they will respond to queries that are merely EDNS ones without the COOKIE option or any other options.

(Some experimentation with dig suggests that including any EDNS option causes these DNS servers to not respond. I tried both +nocookie +nsid and +nocookie +expire, and neither got a response.)

This means that for testing I now want to use 'dig +norecurs +nocookie', at least. It's possible that I want to go all the way to 'dig +norecurs +noedns', although that may be sufficiently different from what modern DNS servers send that I'll get failures when a real DNS server would succeed. I expect that I'm going to want to wrap all of this in a script, because otherwise I'll never remember to set all of the switches all of the time and I'll sometimes get mysterious failures.

(Some experimentation suggests that my Unbound setup sends EDNS0 queries with the 'DNSSEC Okay' bit set and no EDNS options, which would be 'dig +norecurs +nocookie +dnssec' if I'm understanding the dig manpage correctly. These options appear to produce DNS queries that the balky DNS server will respond to. With three options, I definitely want to wrap this in a script.)

What this suggests to me in general is that dig is not going to be the best tool for this sort of thing in the future. The Dig people clearly feel free to change its default behavior, and in ways that don't necessarily match what DNS servers do; future versions may include more such changes, causing more silent failures or behavior differences until I notice and carefully read the manpage to find what to turn off in this new version.

(A casual search turns up drill, which is another thing from NLNet Labs, the authors of Unbound and NSD. Like dig, it defaults to 'recursion allowed' queries, but that's probably going to be a common behavior. Drill does have an interesting -T option to do a full trace from the root nameservers on down, bypassing whatever your local DNS resolver may have cached. Unfortunately it doesn't have an option to report the IP address of the DNS server it gets each set of answers from; you have to go all the way to dumping the full queries with -V 5.)

DigOptionsForUsefulTests written at 02:20:26; Add Comment

2017-11-17

When you should run an NTP daemon on your servers

Yesterday I made the case for why you mostly shouldn't run an NTP daemon and should instead synchronize time on your servers through periodic use of ntpdate or similar programs. However, mostly is not all of the time and I think that there are times when running an NTP daemon is the right answer.

So here is a list of when I'd run an NTP daemon:

  • Your servers need accurate time, where they're always within a few milliseconds or less of true time (for some definition of true time).

    If you need this, you'll want to carefully design your local NTP setup, including both your server hierarchy and your external time sources (you want stratum 1 time sources, ideally ones where you have stable network paths to them). In extreme cases you'll want to set up your own stratum 1 server based on, for example, GPS time.

  • Your servers need synchronized time, where they're always within a few milliseconds or less of each other.

    The canonical case for tightly synchronized time is a set of NFS fileservers, where clients write files to multiple fileservers and want them all to have the same timestamps (with high precision). Synchronized time is less demanding than accurate time; you just need a NTP server hierarchy where everyone synchronizes to a core set of NTP servers and those core servers get good enough time from the outside world.

    (Another case is synchronized timestamps across multiple machines.)

  • You absolutely can't have time go backwards, even a little bit, even if the server's time is very off. A stronger version is that you can never have even large forward clock jumps; the clock must always slew slowly, even if it takes a long time to adjust to true time.

    Modern versions of ntpdate, sntp, systemd-timesyncd and so on can slew the clock for modest adjustments. However, for large adjustments this takes sufficiently long that you need an always running daemon to supervise and fine-tune the process.

    (Note that the famous Cloudflare incident was not a case of time going backward due to time synchronization and wouldn't have been prevented by running an NTP daemon. If anything, it might have been caused by running one.)

  • You need ongoing monitoring of the clock state on your servers.

    An NTP daemon on each server makes a good way to keep tabs on the clock state of all of them. Querying full NTP time source parameters will give you additional warning markers, such as unusual delays or clock dispersions to your NTP time servers.

  • Your servers have absolutely terrible local clocks that drift rapidly; for example, they might be off by a second in the span of ten minutes. An always-running NTP daemon is a better way of reining this in than running ntpdate or another client every few minutes.

    (Basically the NTP daemon is going to be slewing the clock either frequently or all the time, just to keep it under control.)

  • You have only a few servers, they can talk to the Internet, and you want a hassle-free way of giving them decent time.

    The preferred NTP daemon on modern Unixes generally comes with a sensible default configuration that will get time from some suitable pool of NTP servers out on the Internet. Depending on what pool it's pointed at, you may even get NTP servers that are close to you (in a network sense). Install the daemon, make sure it's enabled, and almost all of the time your servers will wind up with perfectly good time. If you're not sure what the time state of a server is, query the local NTP daemon and it'll tell you.

    (Of course, your Linux may already be doing this with something like systemd-timesyncd, in which case you don't need to do anything.)

(There are probably additional cases I'm not thinking of right now.)

Finally, you may want to have an NTP daemon on one or two (or three) servers to act as the local source of NTP time that everyone synchronizes to through ntpdate cron entries or the like. I don't consider this the same sort of thing as the above cases, because you're using the NTP daemons as infrastructure instead of directly as time sources for servers; that they also maintain time on their host servers is kind of an incidental side effect (although a useful one).

As you might suspect from this list, our fileservers run NTP daemons so that we have coherent, synchronized time across our NFS environment. Nothing else does (apart from the NTP daemons that act as local time sources).

NTPDaemonWhen written at 01:41:42; Add Comment

2017-11-16

I think you should mostly not run NTP daemons on your machines

In my entry on switching from ntpd to chrony, I mentioned that we don't have many machines that run full time NTP daemons. In reaction, Sotiris Tsimbonis asked in his comment:

You mean you don't have many machines that run full time NTP daemons and service others as a time source, right?

How do you keep time synchronized in your systems if not by running ntpd? an ntpdate cronjob?

This brings up a heretical position of mine.

I'm a professed time maven. Not only do I run NTP daemons on my workstations, but I tinker with their configuration and server lists and enjoy checking in on their NTP synchronization status (it's fun in various ways, honest). Despite all of my enthusiasm for NTP and good time, I think that you should not run NTP daemons on your servers, especially in anything resembling a common default configuration, unless you have special needs and know what you're doing. Instead you should have almost all of your machines set their time from a trusted upstream source on boot and every so often afterward (once an hour is often convenient). This is what we do, and not just because it's easier.

In most situations, the most important thing for server time is that all of your servers are pretty close to each other. It is better that they all be wrong together than some of them be right and others be wrong, and if a server is out of sync you want it to be corrected right away rather than be slowly guided back to correct time. And you want this to happen reliably, without needing monitoring and remediation.

(If you think you're going to monitor and remediate time issues across your server fleet, ask yourself what you'll do if you detect an out of sync server. If the answer is 'reset its time', then you might as well automate that.)

A NTP daemon is usually not the best way to achieve this. NTP daemons are normally biased toward being cautious about trusting upstream time sources and prefer to change the system clock slowly, without abrupt jumps; this famously leads to various problems if your system winds up with its clock significantly out (some NTP daemons have historically given up entirely in that case). Even once you've configured your NTP daemon to not have these problems, you still need to worry about what happens if the daemon dies or stops doing anything.

(The normal biases of NTP daemons make sense in an environment where you're talking to a random collection of time sources outside of your control, some of which may be broken or even vaguely malicious.)

Modern servers in good operating condition in a decent environment don't have their time drift very much over the course of an hour (our typical adjustment is under a millisecond). Cron is reliable (and if it dies you have bigger problems than time synchronization) and it's straightforward to write a little script that force-sets the server's time from a local NTP server (your OS may already come with one). If you're worried about the NTP server being a single point of failure, run two or three. You're still going to want to monitor the health and time synchronization of your NTP server (or servers), but at least you only have a few of them.

There are situations where you need better time than this and you understand why (and how it has to be better). That's when you turn to running a NTP daemon on every server involved (among other things, like carefully considering where you're ultimately getting your NTP time from). Not before then.

NTPDaemonWhyAvoid written at 01:03:41; Add Comment

2017-11-15

I've switched from ntpd to chrony as my NTP daemon

I've been running some version of NTP(D) on my machines for a very long time now. In the early 1990s the University of Toronto was lucky enough to have Dennis Fergusson, who was very interested in time keeping and wrote a version of ntpd; I caught my interest in NTP from the general UofT Unix sysadmin environment at the time and kept it ever since.

When I first vaguely noticed chrony, it was on my Fedora laptop; way back in Fedora 11, Fedora switched to chrony by default. The release notes at the time made it sound like chrony was just a client and was focused on laptops and other frequently disconnected machines, so I didn't pay much attention to it. I let a Fedora upgrade switch my laptop over, because why not, but otherwise I kept on running ntpd without thinking twice. Over time this got a little bit more annoying on my desktop machines, because Fedora kept trying to switch over and I'd keep having to reverse that and block chrony every few Fedora version upgrades so that I'd keep running my old faithful setup.

I'm not sure what caused me to take an actual look at chrony, but in late September I did just that. This time around I read the chrony web pages and thus discovered that chrony is a full featured NTP daemon, just like ntpd. That definitely made me look at chrony in a new light, as did chrony's comparison page. I'm occasionally given to sudden impulses, so I decided to switch over more or less on the spot; my logs say that I shut down ntpd and started chrony on my office workstation shortly before noon on September 25th (and then on my home machine the next morning). This turned out to be interesting timing, as shortly afterward the Core Infrastructure Initiative released Securing Network Time, where chrony came out by far the best of the three NTP implementations that were evaluated.

The CII article indirectly explains why I was willing to consider switching. There's a quiet schism going on in the NTP world, with a group of people forking the main NTP code to develop 'NTPSec'; infosec people whose views I respect are quite down on the result, and I haven't been terribly impressed by what I've read about the project. At the same time, the NTP code itself is acknowledged to be old and crusty, which is not a great thing for either security or its long term future. Once I found out that chrony was a full featured NTP daemon written from scratch, with modern code and active maintenance, switching seemed like not a bad idea.

(I'd previously checked out Poul-Henning Kamp's quite interesting Ntimed as another potential ntpd replacement, but sadly it went dormant.)

I'm broadly pleased with the result of switching. Chrony has been easier to configure and the result mostly works the way I want. The daemon seems to work just as well as ntpd and my time stays synchronized, just as before. There are some things from ntpq that I miss, especially the ability to easily see what my time sources are themselves synchronized to, but I'll survive. On the positive side, chrony has some useful additional features for my home machine, such as the explicit ability to tell the daemon that we're about to go offline.

We don't have many machines that run full time NTP daemons, but in the future I'm going to propose setting up such machines with chrony instead of ntpd if chrony is packaged for their OS. At this point, sadly I have a lot more trust in ongoing maintenance and support for chrony than I do for NTP.

There's a part of me that's a little bit sad about this because, as mentioned, I have been running ntpd for a very long time. Even though I'm still keeping up with time keeping, switching to something else feels like the end of an era. It's one more link to history quietly slipping away.

NtpdToChrony written at 01:53:41; Add Comment

2017-11-09

Why I'm not enthused about live patching kernels and systems

So I said a thing on Twitter:

Unpopular Unix sysadmin opinion: long system uptimes are generally a sign of multiple problems (including lack of security updates).

(This is a drum that I've beaten before, more or less.)

In response a number of people raised the possibility of long uptimes through kernel live patching (eg) and I was rather unenthused about the idea and the technology involved. My overall view is that live patching really needs to be designed into the system from the ground up in order to really work well; otherwise it is a hack to avoid having to reboot. This hack may be justified under some exceptional circumstances, but I'm not enthused about it being used routinely.

Live patching has two broad pragmatic problems as generally implemented. First, the result of live patching a server is a machine where the system in memory is not the same as the system that you'd get after a reboot. You certainly hope that they're very similar, similar enough that you can assume 'it works now' means 'it'll work after a reboot', but you can't be sure. This is the same fundamental problem we have on a larger scale with servers that have been installed a while ago and then updated in place, where they're not the same as what you'd get if you rebuilt from scratch now (although you sure hope that they're close enough). The pragmatic problems with this have increasingly driven people to designs involving immutable system installs (in one way or another), where you never try to update things in place and always rebuild from scratch.

The second is that live patching becomes a difficult technical problem when data structures in memory change between versions of the code, for example by being reshaped or by having the semantics of fields change (including such thing as new flag values in flag fields). To really deal with this you need to disallow such changes entirely, do flawless on the fly translation of data structures between their old and new versions, or have data structures be versioned and have the code be able to deal with all versions (which is basically on the fly translation implemented in the OS code instead of the patcher). This problem is sufficiently hard that many live patching systems simply punt and explicitly can't be used if data structures have changed (it looks like all of the Linux kernel live patching systems take this approach).

(Not supporting live patching if data structures have changed semantics in any way does open up the interesting question of how you detect if this has happened. Do you rely on people to carefully read the code changes by hand and hope that they never make a mistake?)

You can build a system that's designed to make live patching fully reliable even in the face of these issues. But I don't think you can add live patching to an existing system (well) after the fact and get this; you have to build in handling both issues from the ground up, and this is going to affect your system design in a lot of ways. For instance, I suspect it drives you towards a lot of modularity and explicitly codified (and enforced) internal APIs and message formats. No current widely used Unix system has been designed this way, and so all kernel live patching systems for them are hacks.

(Live patching a kernel is a technically very neat hack, as is automatically analyzing source code diffs in order to determine what to patch where in the generated binary code. I do admire the whole technology from a safe distance.)

LivePatchingWhyNot written at 00:15:48; Add Comment

2017-10-25

Having different commands on different systems does matter

In a comment on yesterday's entry about our frustrations with OmniOS lacking a lot of normal system commands, opk wrote:

As Chris mentions, there are alternatives and on a Solaris system, that means prstat and snoop. And for the vast majority of cases, they are more than just enough. Maybe they don't have quite so many options but it's unfair to criticize the lack of top and tcpdump if it is just that you're used to typing those command names on Linux.

I'm afraid that I have to disagree; I think it absolutely is fair to criticize a modern Unix system for being different in this way. In fact, I think such gratuitous differences should be criticized regularly.

If you operate in a homogenous environment (in this case, all Illumos-based) then yes, this doesn't really matter; at most you have an initial learning process as you come into the environment. But if you operate in a heterogenous environment, every divergence between different Unixes is a point of friction and an overhead. It's yet another thing to remember (or to rediscover), and although we might think otherwise, we have only a finite capacity to remember and keep track of this sort of thing. It is simpler and better to have tcpdump and top on every system than to have to keep track of how each different system does these things.

(I don't object if a system wants to have extra commands, including ones that exist because they do more than the now-standard ones; that's up to it. I just want the standard set too.)

Today, pragmatically, merely having equivalent commands under different names is not good enough. Unixes that do this create annoyance and irritation in sysadmins that have to use them in heterogenous environments (and generally a lot of people do). You can say that it shouldn't be this way, but that's not solving the real problem. In fact, people adopted many of these tools in the first place not just because they were good but also because they created uniformity across your systems, even if you had to get there by compiling things yourself.

(To be clear, this is independent of either the features of the commands or what command line arguments they take, although differences there don't help matters. Merely renaming tcpdump to networkdump is enough to make life irritating.)

CommandDifferencesMatter written at 01:12:55; Add Comment

2017-10-17

My current grumpy view on key generation for hardware crypto keys

I tweeted:

My lesson learned from the Infineon HSM issue is to never trust a HSM to generate keys, just to store them. Generate keys on a real machine.

In my usual manner, this is perhaps overstated for Twitter. So let's elaborate on it a bit, starting with the background.

When I first heard about the Infineon TPM key generation issue (see also the technical blog article), I wasn't very concerned, since we don't have sophisticated crypto smartcards or electronic ID cards or the like. Then I found out that some Yubikeys are affected and got grumpy. When I set up SSH keys on my Yubikey 4, I had the Yubikey itself generate the RSA key involved. After all, why not? That way the key was never exposed on my Linux machine, even if the practical risks were very low. Unfortunately, this Infineon issue now shows the problem in that approach.

In theory, a hardware key like the Yubikey is a highly secure physical object that just works. In practice they are little chunks of inexpensive hardware that run some software, and there's nothing magical about that software; like all software, it's subject to bugs and oversights. This means that in practice, there is a tradeoff about where you generate your keys. If you generate them inside the HSM instead of on your machine, you don't have to worry about your machine being compromised or the quality of your software, but you do have to worry about the quality of the HSM's software (and related to that, the quality of the random numbers that the HSM can generate).

(Another way to put this is that a HSM is just a little computer that you can't get at, running its own collection of software on some hardware that's often pretty tiny and limited.)

As a practical matter, the software I'd use for key generation on my Linux machine is far more scrutinized (especially these days) and thus almost certainly much more trustworthy than the opaque proprietary software inside a HSM. The same is true for /dev/urandom on a physical Linux machine such as a desktop or a laptop. It's possible that a HSM could do a better job on both fronts, but it's extremely likely that my Linux machine is good enough on both. That leaves machine compromise, which is a very low probability issue for most people. And if you're a bit worried, there are also mitigation strategies for the cautious, starting with disconnecting from the network, turning off swap, generating keys into a tmpfs, and then rebooting your machine afterward.

Once upon a time (only a year ago), I thought that the balance of risks made it perfectly okay to generate RSA keys in the Yubikey HSM. It turns out that I was wrong in practice, and now I believe that I was wrong in general for me and most people. I now feel that the balance of risks strongly favour trusting the HSM more or less as little as possible, which means only trusting it to hold keys securely and perhaps limit their use to only when the HSM is unlocked or the key usage is approved.

(This is actually giving past me too much credit. Past me didn't even think about the risk that the Yubikey software could have bugs; past me just assumed that of course it didn't and therefor was axiomatically better than generating keys on the local machine and moving them into the HSM. After all, who would sell a HSM that didn't have very carefully audited and checked software? I really should have known better, because the answer is 'nearly everyone'.)

PS: If you have a compliance mandate that keys can never be created on a general-purpose machine in any situation where they might make it to the outside world, you have two solutions (at least). One of them involves hope and then perhaps strong failure, as here with Infineon, and one of them involves a bunch of work, some persuasion, and perhaps physically destroying some hardware afterward if you're really cautious.

KeyGenerationAndHSMs written at 00:17:55; Add Comment

(Previous 10 or go back to October 2017 at 2017/10/14)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.