Wandering Thoughts


The shutdown command is a relic of BSD's historical origins

Most of the remaining Unixes these days have a shutdown command that all look the same; the FreeBSD manpage is about typical, and you can compare it to the Linux manpage. Even Illumos has a version with broadly similar features. I suspect that a bunch of people have used shutdown periodically without thinking about the actual command very much. But if you look at it, shutdown is an interesting relic of BSD's origins as a timesharing system.

I call shutdown a relic here because of its behavior if you tell it to shut the system down at some point in the future, say fifteen minutes from now. If you've never run shutdown this way, what it does is start broadcasting repeated messages about the impending shutdown via wall (or an internal equivalent), disables further logins to the machine shortly before the reboot or shutdown time, and then does the reboot at the time. All of this makes perfect sense in a timesharing environment where the major use of the system is from a bunch of people that are logged in via terminals (whether real serial ones or over the network).

However, this way of doing an extended shutdown doesn't make much sense outside of that sort of timesharing setup, which we can see by what it doesn't have. First, it doesn't have any method of notifying people other than wall to terminal sessions that are visible in utmp. Back in the days, of course, you pretty much couldn't be logged in without being in utmp; these days, you might be logged in but not even running a terminal emulator program (never mind whether or not you're paying attention to it instead of your mail client). A modern take on shutdown would probably be built using a more general (and more complicated) notification scheme.

Second, there's no mechanism to tell running servers and daemons and the like about the impending shutdown, or even ask them to stop taking on new work at the same point that shutdown locks out new regular user logins. The BSD shutdown simply assumes that daemons are unimportant, don't need any advance notice, and can be abruptly dealt with when the shutdown time is reached. This was more or less true on the original Vaxen running 4.x BSD in the 1980s, but is definitely no longer true today on many servers. Almost no one logs into our web server or our mail server or our IMAP server, but they're all in constant use. In those environments, 'shutdown -r 18:00 "some nice message"' is more or less equivalent to 'echo reboot | at 18:00' as far as the net effects go (ie, at 6pm the services all abruptly vanish).

That shutdown is a relic of BSD's origins isn't bad as such, not as far as I'm concerned. It's just interesting. I have a peculiar affection for these historical oddities and lingering remnants of Unix's past.

(Part of it is that it's a reminder that Unix came from deep roots and hasn't had them all carefully scrubbed away and hidden like awkward relatives.)

unix/ShutdownBSDTimesharingRelic written at 01:09:13; Add Comment


I like the Python 3 string .translate() method

Suppose, hypothetically, that you wanted to escape the & character in text as a HTML entity:

txt = txt.replace('&', '&')

Okay, maybe there's a character or two more:

txt = txt.replace('<', '&lt;')

And so it goes. The .replace() string method is an obvious and long standing hammer, and I've used it to do any number of single-character replacements years (as well as some more complicated multi-character ones, such as replacing \r\n with \n).

Recently I was working on my Exim attachment type logger, and more specifically I was fixing its handling of odd characters in the messages that it logged as part of making it work in Python 3. My Python 2 approach to this was basically to throw repr() at the problem and forget about it, but using repr() for this is a hack (especially in Python 3). As part of thinking about just what I actually wanted, I decided that I wanted control characters to be explicitly turned into some sort of clear representation of themselves. This required explicitly remapping and replacing them, and I needed to do this to a fair number of characters.

At first I thought that I would have to do this with .replace() (somehow) or a regular expression with a complicated substitution or something equally ugly, but then I ran across the Python 3 str.translate() method. In Python 2 this method is clearly very optimized but also only useful for simple things, since you can only replace a character with a single other character. In Python 3, .translate() has become much more general; it takes a dictionary of translations and the values in the dictionary don't have to be single characters.

So here's what my handling of control characters now looks like:

# ctrl-<chr> -> \0xNN escape
ctrldict = {c: "\\x%02x" % c for c in range(0,32)}
ctrldict[127] = "\\x7f"
# A few special characters get special escapes
ctrldict[ord("\n")] = "\\n"; ctrldict[ord("\r")] = "\\r";
ctrldict[ord("\t")] = "\\t"
ctrldict[ord("\\")] = "\\\\"

def dectrl(msg):
  return msg.translate(ctrldict)

That was quite easy to put together, it's pretty straightforward to understand, and it works. The only tricky bit was having to read up on how the keys for the translation dictionaries are not characters but the (byte) ordinal of each character (or the Unicode codepoint ordinal if you want to be precise). Once I found .translate(), the whole exercise was much less annoying than I expected.

Python 2's string .translate() still leaves me mostly unenthused, but now that I've found it, Python 3's has become an all purpose tool that I'm looking forward to making more use of. I have any number of habitual uses of .replace() that should probably become .translate() in Python 3 code. That you can replace a single character by multiple characters makes .translate() much more versatile and useful, and the simplified calling sequence is nice.

(Python 3's version merges the Python 2 deletechars into the translation map, since you can just map characters to None to delete them.)

PS: Having read the documentation a bit, I now see that str.maketrans() is the simple way to get around the whole ord() stuff that I'm doing in my code. Oh well, the original code is already written. But I'll have to remember maketrans() for the future.

(The performance and readability of .replace() versus .translate() is something that can be measured (for performance) and debated (for readability). I haven't made any performance measurements and I don't really care for most of my code. As far as readability, probably I'll conclude that .translate() wins if I'm doing more than one or two substitutions.)

python/Python3StringTranslatePraise written at 00:07:24; Add Comment


Writing in Python 3 has been a positive experience so far

As I've mentioned in passing before, I have a little Python thing to log MIME attachment type information from Exim and I converted it from Python 2 to Python 3 about a month and a half ago. The direct reason for doing this was a relatively small one: the Python 3 version of the zipfile module could automatically handle ZIP archives (and tar files) that used XZ compression, which people were sending us. My indirect reason was that I haven't really done anything with Python 3 yet and it's clear to me that it's the future, so if I had an excuse it felt like it was time for me to get some experience.

The whole conversion went surprisingly smoothly. In large part this was because I'd written relatively clean modern Python code and I wasn't interpreting raw bytes off disk (or out of the network). Since I was mostly calling modules I could rely on them to do all of the hard work of thinking about Unicode; the actual change wound up being very small as a result. Of course that first pass involved some decisions that were kind of a hack, so I wound up having to do some real Unicode handling. I actually feel that this was a positive change for the code in general, since it forced me to think carefully about what I really wanted to do here instead of brushing it under the carpet of 'let's just spray random raw bytes at syslog and standard output'.

(Part of it is that doing the Python 3 equivalent of what I usually do here in Python 2 was just that little extra bit too much of a hack for me to accept.)

Python 3 character encoding issues were not painless and did force me to do some digging. In an ideal world this would be better documented, but on the other hand I don't entirely mind having to do some code reading (but that's just me, other people could wind up more irritated). I feel, perhaps incorrectly, that wrestling with these issues here has made me better prepared to deal with similar ones in the future.

So overall I'd call this a complete success. Moving to Python 3 got me real benefits, caused me to clean up the code somewhat, and wasn't particularly painful. I now feel much more positive about doing more substantial work in Python 3 at some point, and just generally working with/in Python 3 in the future.

(I'm still unlikely to convert any of our existing Python code over to Python 3 unless I get some clear benefit from it, the way I did here. I'm not yet that attracted to Python 3, and besides my co-workers would rather that I left well enough alone.)

python/Python3PositiveExperience written at 00:48:27; Add Comment


Making my Yubikey work reasonably with my X screen locking

When I moved from unencrypted SSH keys to encrypted SSH keys held in a running ssh-agent process, I arranged things so that the keys would be removed when I locked my screen (which I do frequently) and then unlocked and added again when I unlocked my screen; I wrote this up as part of this entry. Soon after I started playing around with having SSH keys in my Yubikey, it became clear to me that I needed to do the same thing with the Yubikey's SSH keys. More specifically, I needed to automatically re-add the Yubikey's keys when I unlocked the screen, which means (automatically) providing the Yubikey's PIN code to ssh-add instead of being constantly prompted for it every time I unlocked my screen. Typing two passwords at screen unlock time is just a bit too irritating for me; inevitably it would discourage me from routinely using the Yubikey.

(Removing the Yubikey keys from ssh-agent happens automatically when I run 'ssh-add -D' as part of starting the screen locker, although I've added specifically removing the PKCS#11 SSH agent stuff as well. You actually want to do this because otherwise the PKCS#11 SSH agent stuff gets into a weird state where it's non-functional but loaded, so you can't just do 'ssh-add -s' to get it going again.)

As I sort of mentioned in passing in my entry on how I set up SSH keys on my Yubikey, the Yubikey's PIN code allows more or less full alphanumerics, so in theory I could just make the PIN code the same as my regular SSH key password and then use the obvious extension of the Perl script from this entry to also feed it to ssh-add when I re-enable PKCS#11 stuff. However, after thinking about it I decided that I wasn't entirely comfortable with that; too many tools for dealing with the Yubikey are just a little bit too casual with the PIN code for me to make it something as powerful and dangerous as my regular password.

(For example, a number of them want the PIN provided in plain text on the command line. I'm not doing that with my regular password.)

This left me with the problem of going from my regular password to the Yubikey PIN. The obvious answer is to encrypt a file with the PIN in it with my regular password, then decrypt it on the fly in order to feed it to ssh-add. After some searching I settled on doing this with ccrypt, which is packaged for Fedora and which has an especially convenient mode where you can feed it the key as the first line of input, with the encrypted file following immediately afterwards.

So now I have a little script that takes my regular password on standard input (fed from the Perl script I run via xlock's -pipepassCmd argument) and uses it to decrypt the PIN file and feed it to ssh-add. It looks like this:

# drop PKCS#11 stuff; required to re-add it
ssh-add -e /usr/lib64/opensc-pkcs11.so >/dev/null 2>&1
# give ssh-add no way to ask us for the passphrase
(sed 1q; cat $CRYPTLOC) | ccat -k - | \
   notty ssh-add -s /usr/lib64/opensc-pkcs11.so

The one peculiar bit is notty, which is a little utility program to run another program without a controlling terminal. If you run ssh-add this way, it reads the PKCS#11 PIN from standard input, which is just what I want here. I need to use notty at all because the Perl script runs this script via (Perl) Expect, which means that it's running with a pty.

(There are alternate ways to arrange things here, but right now I prefer this approach.)

(See my first Yubikey entry for a discussion of when you need to remove and re-add the PKCS#11 SSH agent stuff. The short version is any time that you remove and reinsert the Yubikey, drop SSH keys with 'ssh-add -D' (as we're doing during screen locking), or run various commands to poke at the Yubikey directly.)

PS: I've come around to doing 'ssh-add -e' before almost any attempt to do 'ssh-add -s'. It's a hack and in an ideal world it wouldn't be necessary, but there's just too many situations where ssh-agent can wind up with PKCS#11 stuff loaded but non-functional and the best (and sometimes only) way to clean this up is to remove it and theoretically start from scratch again. Maybe someday all of this will be handled better. (Perhaps gpg-agent is better here.)

linux/YubikeyAndScreenLocking written at 23:10:52; Add Comment


Why we care about long uptimes

Here's a question: why should we care about long uptimes, especially if we have to get these long uptimes in somewhat artificial situations like not applying updates?

(I mean, sysadmins like boasting about long uptimes, but this is just boasting. And we shouldn't make long uptimes a fetish.)

One answer is certainly 'keeping your system up avoids disrupting users'. Of course there are many other ways to achieve this, such as redundancy and failure-resistant environments. The whole pets versus cattle movement is in part about making single machine uptime unimportant; you achieve your user visible uptime by a resilient environment that can deal with all sorts of failures, instead of heroic (and artificial) efforts to keep single machines from rebooting or single services from restarting.

(Note that not all environments can work this way, although ours may be an extreme case.)

My answer is that long uptimes demonstrate that our systems are fundamentally stable. If you can keep a system up and stable for a long time, you've shown that (in your usage) it doesn't have issues like memory leaks, fragmentation, lurking counter rollover problems, and so on. Even very small issues here can destabilize your system over a span of months or years, so a multi-year uptime is a fairly strong demonstration that you don't have these problems. And this matters because it means that any instability problems in the environment are introduced by us, and that means we can control them and schedule them and so on.

A system that lacks this stability is one where at a minimum you're forced to schedule regular service restarts (or system reboots) in order to avoid unplanned or unpleasant outages when the accumulated slow problems grow too big. At the worst, you have unplanned outages or service/system restarts when the system runs itself into the ground. You can certainly deal with this with things like auto-restarted programs and services, deadman timers to force automated reboots, and so on, but it's less than ideal. We'd like fundamentally stable systems because they provide a strong base to build on top of.

So when I say 'our iSCSI backends have been up for almost two years', what I'm really saying is 'we've clearly managed to build an extremely stable base for our fileserver environment'. And that's a good thing (and not always the case).

sysadmin/LongUptimesImportance written at 23:55:29; Add Comment

How I managed to shoot myself in the foot with my local DNS resolver

I have my home machine's Twitter client configured so that it opens links in my always-running Firefox, and in fact there's a whole complicated lashup of shell scripting surrounding this in an attempt to the right thing with various sorts of links. For the past little while, clicking on some of those links has often (although not always) been very slow to take effect; I'd click a link and it'd be several seconds before I got my new browser window. In the beginning I wrote this off as just Twitter being slow (which it sometimes is) and didn't think too much about it. Today this got irritating enough that I decided to investigate a bit, so I ran Dave Cheney's httpstat against twitter.com, expecting to see that all the delay was in either connecting to Twitter or in getting content back.

(To be honest, I expected that this was something to do with IPv6, as has happened before. My home IPv6 routing periodically breaks or malfunctions even when my IPv4 routing is fine.)

To my surprise, httpstat reported that it'd spent just over 5000 milliseconds in DNS lookup. So much for blaming anyone else; DNS lookup delays are pretty much all my fault, since I run a local caching resolver. I promptly started looking at my configuration and soon found the problem, which comes in two parts.

First, I had (and have) my /etc/resolv.conf configured with a non-zero ndots setting and several search (sub)domains. This is for good historical reasons, since it lets me do things like 'ssh apps0.cs' instead of having to always specify the long fully qualified domain. However, this means that every reasonably short website name, like twitter.com, was being checked to see if it was actually a university host like twitter.com.utoronto.ca. Of course it isn't, but that means that I was querying our DNS servers quite a lot, even for lookups that I conceptually thought of having nothing to do with the university.

Second, my home Unbound setup is basically a copy of my work Unbound setup, and when I set it up (and copied it) I deliberately configured explicit Unbound stub zones for the university's top level domain that pointed to our nameservers. At work, the intent of this was to be able to resolve in-university hostnames even if our Internet link went down. At home, well, I was copying the work configuration because that was easy and what was the harm in short-cutting lookups this way?

In case you are ever tempted to this, the answer is that you have to be careful to keep your list of stub zone nameservers up to date, and of course I hadn't. As long as my configuration didn't break spectacularly I didn't give it any thought, and it turned out that one of the IP addresses I had listed as a stub-addr server doesn't respond to me at all any more (and some of the others may not have been entirely happy with me). If Unbound decided to send a query for twitter.com.utoronto.ca to that IP, well, it was going to be waiting for a timeout. No wonder I periodically saw odd delays like this (and stalls when I was trying to pull from or check github.com, and so on).

(Twitter makes this much more likely by having an extremely short TTL on their A records, so they fell out of Unbound's cache on a regular basis and had to be re-queried.)

I don't know if short-cut stub zones for the university's forward and reverse DNS is still a sensible configuration for my office workstation's Unbound, but it definitely isn't for home usage. If the university's Internet link is down, well, I'm outside it at home; I'm not reaching any internal servers for either DNS lookups or connections. So I've wound up taking it out of my home configuration and looking utoronto.ca names up just like any other domain.

(This elaborates on a Tweet of mine.)

Sidebar: The situation gets more mysterious

It's possible that this is actually a symptom of more than me just setting up a questionable caching DNS configuration and then failing to maintain and update it. In the process of writing this entry I decided to take another look at various university DNS data, and it turns out that the non-responding IP address I had in my Unbound configuration is listed as an official NS record for various university subdomains (including some that should be well maintained). So it's possible that something in the university's DNS infrastructure has fallen over or become incorrect without having been noticed.

(I wouldn't say that my Unbound DNS configuration was 'right', at least at home, but it does mean that my configuration might have kept working smoothly if not for this broader issue.)

sysadmin/LocalDNSConfigurationFumble written at 02:17:43; Add Comment


ZFS's 'panic on on-disk corruption' behavior is a serious flaw

Here's a Twitter conversation from today:

@aderixon: For a final encore at 4pm today, I used a corrupted zpool to kill an entire Solaris database cluster, node by node. #sysadmin

@thatcks: Putting the 'fail' in 'failover'?

@aderixon: Panic-as-a-service. Srsly, "zpool import" probably shouldn't do that.

@thatcks: Sadly, that's one of the unattractive sides of ZFS. 'Robust recovery from high-level on-disk metadata errors' is not a feature.

@aderixon: Just discovering this from bug reports. There will be pressure to go back to VXVM now. :-(

Let me say this really loudly:

Panicing the system is not an error-recovery strategy.

That ZFS is all too willing to resort to system panics instead of having real error handling or recovery for high level metadata corruption is a significant blemish. Here we see a case where this behavior has had a real impact on a real user, and may cause people to give up on ZFS entirely. They are not necessarily wrong to do so, either, because they've clearly hit a situation where ZFS can seriously damage their availability.

In my jaundiced sysadmin view, OS panics are for temporary situations where the entire system is sufficiently corrupt or unrecoverable that there is no way out. When ZFS panics on things that are recoverable with more work, it's simply being lazy and arrogant. When the issue is with a single pool, ZFS panicing converts a single-pool issue into an entire-server issue, and servers may have multiple pools and all sorts of activities.

Panicing due to on-disk corruption is even worse, as it converts lack of error recovery into permanent unavailability (often for the entire system). A temporary situation at least might clear itself when you panic the system and reboot, as you can hope that a corrupted in-memory data structure will be rebuilt in non-corrupted form when the system comes back up. But a persistent condition like on-disk corruption will never go away just because you reboot the server, so there is very little hope that ZFS's panic has worked around the problem. At the best, it's still lurking there like a landmine waiting to blow your system up later. At the worst, in single server situations you can easily get the system locked into a reboot loop, where it boots and starts an import and panics again. In clustering or failover environments, you can wind up taking the entire cluster down (along with all of its services) as the pool with corruption successively poisons every server that tries to recover it.

Unfortunately none of this is likely to change any time soon, at least in the open source version of ZFS. ZFS has been like this from the start and no one appears to care enough to fund the significant amount of work that would be necessary to fix its error handling.

(It's possible that Oracle will wind up caring enough about this to do the work in a future Solaris version, but I'm dubious even of that. And if they do, it's not like we can use it.)

(I had my own experience with this sort of thing years ago; see this 2008 entry. As far as I can tell, very little has changed in how ZFS reacts to such problems since then.)

solaris/ZFSPanicOnCorruptionFlaw written at 02:08:26; Add Comment


Watch out for web server configurations that 'cross over' between sites

We have a long-standing departmental web server that dates back to the days when it wasn't obvious that the web was going to be a big thing. Naturally, one of the things that it has is old-style user home pages, in the classical old Apache UserDir style using /~<user>/ URLs. Some of these are plain HTML pages in directories, some reverse proxy to user run web servers, and some have suexec CGIs. The same physical server and Apache install also hosts a number of other virtual hosts, some for users and some for us, such as our support site.

Recently we noticed a configuration problem: UserDirs were active on all of the sites hosted by Apache, not just our main site. Well, they were partially active. On all of the other virtual hosts, you only got the bare files for a /~<user>/ URL; CGIs didn't run (instead you got the contents of the CGI file itself) and no reverse proxies were in effect. We had what I'll call a 'crossover' configuration setting, where something that was supposed to apply only to a single virtual host had leaked over into others.

Such crossover configuration leaks are unfortunately not that hard to wind up with in Apache, and I think I've managed to do this in Lighttpd as well. Generally this happens when you set up some configuration item without being careful to explicitly scope it; that's what Ubuntu's /etc/apache2/mods-available/userdir.conf configuration file did (and does), as it has the following settings:

<IfModule mod_userdir.c>
  UserDir public_html
  UserDir disabled root

(This is actually a tough problem without very split apart configuration files. Presumably Ubuntu wants a2enmod userdir to automatically enable userdirs in at least your default site, not just simply turn the module on and require you to add explicit UserDir settings to things.)

In Lighttpd you can get this if you put any number of things outside a carefully set up host-specific stanza:

$HTTP["host"] == "..." {
   # everything had better be here
# oops:
alias.url += ( "/some" => "/fs/thing" )

And it's not like a web server necessarily wants to absolutely forbid global settings. For instance, in my Lighttpd setup I have the following general stanza:

# Everyone shares the same Acme/Let's Encrypt
# challenge area for convenience.
alias.url += ( "/.well-known/acme-challenge/" => "/var/run/acme/acme-challenge/" )

This is quite handy, because it means the tool I use needs no website-specific configuration; regardless of what website name it's trying to verify, it can just stick files in /var/run/acme/acme-challenge. And that makes it trivial to get a LE certificate for another name, which definitely encourages me to do so.

I do wish that web servers at least made it harder to do this sort of 'crossover' global setting by accident. Perhaps web servers should require you to explicitly label configuration settings with their scope, even if it's global. You might still do it, but at least it would be clearer that you're setting something that will affect all sites you serve.

(In the mean time, I guess I have another rainy day project. I have to admit that 'audit all global Apache configuration settings' is not too thrilling or compelling, so it may be quite some time before it gets done. If ever.)

web/ApacheSiteConfigurationCrossover written at 02:10:38; Add Comment


How I've set up SSH keys on my Yubikey 4 (so far)

There are a fair number of guides out on the Internet for how to set up a Yubikey that holds a SSH key for you, like this one. For me, the drawback of these is that they're all focused on doing this through GPG, a full set of PGP keys, and gpg-agent. I don't want any of that. I have no interest in PGP, I'd rather not switch away from ssh-agent to gpg-agent, and I definitely don't want to get a whole set of PGP keys that I have to manage and worry about. I just want the Yubikey to hold a SSH key or two.

Fortunately this turns out to be quite possible and not all that complicated. I wound up finding and using two main references for this, Wikimedia's Yubikey-SSH documentation and then Thomas Habets' Yubikey 4 for SSH with physical presence proof. Also useful is Yubico's official documentation on this. I'm doing things somewhat differently than all of these, and I'm going to go through why I'm making the choices I am.

(I've done all of this on Fedora 24 with a collection of packages installed. You need the Yubico tools and OpenSC; I believe both of those are widely available for at least various Linux flavours. FreeBSD appears to have the necessary Yubico PIV tool in their ports, presumably along with the software you need to talk to the actual Yubikey.)

The first step is to change the default Yubikey PIN, PUK, and management key. You won't be using the PUK and management key very much so you might as well randomly generate them, as Wikimedia advises, but you'll be using the PIN reasonably frequently so you should come up with an 8-character alphanumeric password that you can remember.

# In a shell, I did:
key=$(dd if=/dev/urandom bs=1 count=24 2>/dev/null | hexdump -v -e '/1 "%02X"')
puk=$(dd if=/dev/urandom bs=1 count=6 2>/dev/null | hexdump -v -e '/1 "%u"'|cut -c1-8)
pin=[come up with one]
# Record all three values somewhere

yubico-piv-tool -a set-mgm-key -n $key
yubico-piv-tool -a change-pin -P 123456 -N $pin
yubico-piv-tool -a change-puk -P 12345678 -N $puk

Changing the management key is probably not absolutely required, because I don't think an attacker can use knowledge of the management key to compromise things. Even increasing the retry counters requires more than just the management key. I may wind up resetting my Yubikey's management key back to the default value for convenience.

(We don't need to do anything with ykpersonalize, because current Yubikeys come from the factory with all their operating modes already turned on.)

Next we'll create two SSH keys, one ordinary one and one that requires you to touch the Yubikey 4's button to approve every use. Both will require an initial PIN entry in order to use them; you'll normally do this when you load them into your ssh-agent.

SSH keypair creation goes like this:

  • tell the Yubikey to generate the special touch-always-required key.

    yubico-piv-tool -k $key -a generate --pin-policy=once --touch-policy=always -s 9a -o public.pem

    Note that the PIN and touch policies can only be set when the key is generated. If you get them wrong, you get to clear out the slot and start all over again. The default key type on the Yubikey 4 is 2048-bit RSA keys, and I decided that this is good enough for me for SSH purposes.

    A Yubikey 4 has four slots that we can use for SSH keys; these are the PIV standard slots 9a, 9c, 9d, and 9e. In theory the slots have special standardized meanings, but in practice we can mostly ignore that. I chose slot 9a here because that's what the Wikipedia example uses.

    (A Yubikey 4 also has a bunch of additional slots that we can set keys and certificates in, those being 82 through 95. However I've been completely unable to get the Fedora OpenSSH and PKCS#11 infrastructure to interact with them. It would be nice to be able to use these slots for SSH keys and leave the standard slots for their official purposes, but it's not possible right now.)

  • Use our new key to make a self-signed certificate. Because we told the Yubikey to require touch authentication when we use the key, you have to touch the Yubikey during the self-signing process to approve it.

    yubico-piv-tool -a verify-pin -P $pin -a selfsign-certificate -s 9a -S '/CN=touch SSH key/' --valid-days=1000 -i public.pem -o cert.pem

    I don't know if the Yubikey does anything special once the self-signed certificate expires, but I didn't feel like finding out any time soon. SSH keypair rollover is kind of a pain in the rear at the best of times.

    (We don't really need a self-signed certificate, since we only care about the keypair. But apparently making a certificate is required in order to make the public key fully usable for PKCS#11 and OpenSSH stuff.)

  • Load our now-generated self-signed certificate back into the Yubikey.

    yubico-piv-tool -k $key -a import-certificate -s 9a -i cert.pem

  • Finally, we need to get the SSH public key in its normal form and in the process verify that OpenSSH can talk to the Yubikey. The shared library path here is for 64-bit Fedora 24.

    ssh-keygen -D /usr/lib64/opensc-pkcs11.so -e

    This will spit out a 'ssh-rsa ...' line that is the public key in the usual format, suitable for adding to authorized_keys and so on.

    (Also, yes, configuring how to do PKCS#11 things by specifying a shared library is, well, it's something.)

The process for creating and setting up our more ordinary key is almost the same thing. We'll set a different touch policy and we'll extract the SSH public key from the public.pem file instead of using ssh-keygen -e, because ssh-keygen gives you no sign of which key is which once you have more than one key on the Yubikey. We'll use slot 9c for this second key. You could probably use any of the other three slots, but 9c is the slot I happened to have used to test all of this so I know it works.

yubico-piv-tool -k $key -a generate --pin-policy=once --touch-policy=never -s 9c -o public.pem
yubico-piv-tool -a verify-pin -P $pin -a selfsign-certificate -s 9c -S '/CN=basic SSH key/' --valid-days=1000 -i public.pem -o cert.pem
yubico-piv-tool -k $key -a import-certificate -s 9c -i cert.pem

ssh-keygen -i -m PKCS8 -f public.pem

Note that you absolutely do not want to omit the --pin-policy bit here. Otherwise you'll inherit the default PIN policy for this slot and things will wind up going terribly wrong when you try to use this key through ssh-agent.

The ssh-keygen invocation here came from this Stackexchange answer, which also has what you need to extract this information from a full certificate. This is a useful thing to know, because you can retrieve specific certificates from the Yubikey with eg 'yubico-piv-tool -a read-certificate -s SLOT', and you can see slot information with 'yubico-piv-tool -a status' (this includes the CN data we set up above, so it's useful to make it distinct).

With all of this set up, you can now add your Yubikey keys to ssh-agent with:

ssh-add -s /usr/lib64/opensc-pkcs11.so

You'll be prompted for your PIN. After it's accepted, you can use the basic Yubikey SSH key just as you would any other SSH key loaded into ssh-agent. The touch-required key is also used normally, except that you have to remember to touch the Yubikey while it's flashing to get your attention (fortunately the default timeout is rather long).

In an ideal world, everything would now be smooth sailing with ssh-agent. Unfortunately this is not an ideal world. The first problem is that you currently have to remove and re-add the PKCS#11 SSH agent stuff every time you remove and reinsert the Yubikey or purge your ssh-agent keys. More significantly, various other things can also break ssh-agent's connection to the Yubikey, forcing you to go through the same thing. One of these things is using yubico-piv-tool to do anything with the Yubikey, even getting its status. So if you do a SSH thing and it reports:

sign_and_send_pubkey: signing failed: agent refused operation

What this means is 'remove and re-add the PKCS#11 stuff again'. Some of the time, doing a SSH operation that requires your PIN such as:

ssh -I /usr/lib64/opensc-pkcs11.so <somewhere that needs it>

will reset things without the whole rigmarole.

You don't have to use the Yubikey keys through ssh-agent, of course; you can use them directly with either ssh -I /usr/lib64/opensc-pkcs11.so or by setting PKCS11Provider /usr/lib64/opensc-pkcs11.so in your .ssh/config (perhaps only for specific hosts or domains). However the drawback of this is that you'll be challenged for your Yubikey PIN every time you use a Yubikey-hosted SSH key (this happens regardless of what the setting of --pin-policy is). Using an agent means that you're only challenged once every so often. Of course in some circumstances, being challenged for your PIN on every use may be a feature.

(I have a theory about what's going on and going wrong in OpenSC, but it's for another entry. Ssh-agent has its own bugs here too, and it's possible that using gpg-agent instead would make things significantly nicer here. I have no personal experience with using gpg-agent as a SSH agent and not much interest in experimenting right now.)

While I haven't tested more than two SSH keys, I believe you could fill up all four slots with four different SSH keys just in case you wanted to segment things that much. Note that in general there's no simple way to tell which specific SSH key you're being requested to authorize; all keys share the same PIN and if you have more than one key set to require touch, you can't tell which key you're doing touch-to-approve for.

(Also, as far as I know the PKCS#11 stuff will make all keys available whenever it's used, including for ssh-agent. You can control which keys will be offered to what hosts by using IdentitiesOnly, but that's a limit imposed purely by the SSH client itself. If you absolutely want to strongly control use of certain keys while others are a bit more casual, you probably need multiple Yubikeys.)

Sidebar: working with PKCS#11 keys and IdentitiesOnly

The ssh_config manpage is very specific here: if you set IdentitiesOnly, keys from ssh-agent and even keys that come from an explicit PKCS11Provider directive will be ignored unless you have an IdentityFile directive for them. Which normally you can't have, because the Yubikey won't give you the private key. Fortunately there is a way around this; you can use IdentityFile with only the public key file. This is a rare case where doing this makes perfect sense and is the only way to get what you want if you want to combine Yubikey-hosted keys with selective identity offering.

sysadmin/Yubikey4ForSSHKeys written at 01:58:15; Add Comment


I have yet to start using any smartphone two-factor authentication

Now that I have a smartphone, in theory I could start using two-factor authentication to improve my security. In practice I have yet to set up my phone for this for anything (although I did download an app for it). There turn out to be several reasons for this.

First, the whole area is fairly confusing and partly populated by people that I don't really trust (hi, Google). Perhaps I am looking in the wrong places, but when I went looking at least the first time around there was a paucity of documentation on what is actually going on in the whole process, how it worked, what to expect, and so on. What I could find was mostly glossy copy and 'then some magic happens'. I'm a sysadmin; I don't like magic.

(The confusing clutter of apps didn't help things either, although I suspect that people who know what they're doing here have an easier time cutting through the marketing copy everyone has.)

Then, well, it's early days with my smartphone and I'm nervous about really committing to it for something as crucial as authentication. Pretty much everything I've read on 2FA contains scary warnings about what happens if your phone evaporates; at the least it's a big hassle. Switching on 2FA this early feels alarmingly like jumping into the deep end. Certainly it doesn't seem like something to do casually or simply as an experiment.

(Probably there's a good way to play around with 2FA to just try it out, but I have no idea what it would be. Scratch accounts on various services? Right now I'd have to commit to 2FA on something just to find out how the apps look and work. I suspect that other people have a background clutter of less important accounts that they can use to experiment with stuff like this.)

Finally is the big, blunt issue for me: I just don't have very many accounts out there (especially on websites) that I both feel strongly about and that I'm willing to make harder to use by adding 2FA authentication. Most of my accounts are casual things, even on big-ticket sites like Facebook, and on potentially somewhat more important sites like Github I'm not very enthused about throwing roadblocks in the way of, say, pushing commits up to my public repos.

(Part of this is that I'm usually not logged in to places. And obviously things would be quite different if I worked with any important Github repos.)

All of this feels vaguely embarrassing, since after all I'm supposed to care about security and I now have this marvelous possibility for completely free two-factor authentication, yet I'm not taking advantage of it. But I've already established that I have limits on how much I care about security.

tech/TwoFactorPhoneDisuse written at 02:25:04; Add Comment

(Previous 10 or go back to October 2016 at 2016/10/10)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.