Wandering Thoughts

2017-12-13

Our Apache file serving problem on our general purpose web server

One of the servers we run for our department is an old-fashioned general purpose web server that hosts things like people's home pages and the web pages for (some) research groups. In terms of content, we have a mix of static files, old-fashioned CGIs (run through suexec), and reverse proxies to user run web servers. One of the things people here do with this web server is use it to share research data files and datasets, generally through their personal home page because that's the easy way to go. Some of these files are pretty large.

When you share data, people download it; sometimes a lot of people, because sometimes computer scientists share hot research results. This is no problem from a bandwidth perspective; we (the department and the university) have lots of bandwidth (it's not like the old days) and we'd love to see it used. However, some number of the people asking for this data are on relatively slow connections, and some of these data files are large. When you combine these two, you get very slow downloads and thus client HTTP connections that stick around for quite a long time.

(Since 6am this morning, we've seen 27 requests that took more than an hour to complete, 265 that took more than ten minutes, and over 7,500 that took more than a minute.)

For historical reasons we're using the 'prefork' Apache MPM, and perhaps you now see the problem. Each low-bandwidth client that's downloading a big file occupies a whole worker process for what is a very long time (by web server standards). We feel we can only configure so many worker processes, mostly because each of them eats a certain amount of the machine's finite memory, and we've repeatedly had all our worker processes eaten up by these slow clients, locking out all other requests for other URLs for a while. The clients come and go, for reasons we're not certain of; perhaps someone is posting a link somewhere, or maybe a classroom of people are being directed to download some sample data or the like. It's honestly kind of mysterious to us.

(In theory we could also worry about how many worker processes we allow because each worker process could someday be a CGI that's running at the same time as other CGIs, and if we run too many CGIs at once the web server explodes. In practice we've already configured so many worker processes in an attempt to keep some request slots open during these 'slow clients, popular file' situations that our web server would likely explode if even half of the current worker processes were running CGIs at once.)

Right now we're resorting to using mod_qos to try to limit access to currently popular things, but this isn't ideal for several reasons. What we really want is a hybrid web serving model, where just pushing files out to clients is done with a lightweight, highly scalable method that's basically free but Apache can continue to handle CGIs in something like the traditional model. Ideally we could even turn down the 'CGI workers' count, now that they don't have to also be 'file workers'.

Changing web servers away from Apache isn't an option and neither is splitting the static files off to another server entirely. Based on my reading so far, trying to switch to the event MPM looks like our most promising option; in fact in theory the event MPM sounds very close to our ideal setup. I'm not certain how it interacts with CGIs, though; the Apache documentation suggests that we might need or want to switch to mod_cgid, and that's going to require testing (the documentation claims it's basically a drop-in replacement, but I'm not sure I trust that).

(Setting suitable configuration parameters for a thread-based MPM is going to be a new and somewhat exciting area for us, too. It seems likely that ThreadsPerChild is the important tuning knob, but I have no idea what the tradeoffs are. Perhaps we should take the default Ubuntu 16.04 settings for everything except MaxRequestWorkers and perhaps AsyncRequestWorkerFactor, which we might want to tune up if we expect lots of waiting connections.)

web/ApacheFileServingOurProblem written at 23:42:29; Add Comment

2017-12-12

Some notes on systemd-resolved, the systemd DNS resolver

My office workstation's upgrade to Fedora 27 resulted in a little incident with NetworkManager, which I complained about on Twitter; the resulting Twitter conversation brought systemd-resolved to my attention. My initial views weren't all that positive (because I'm biased here; systemd's recent inventions have often not been good things) but I didn't fully understand its state on my systems, so I wound up doing some digging. I'm still not too enthused, but I've wound up less grumpy than I was before and I'm not going to be forcefully blocking systemd-resolved from running at all just yet.

Systemd-resolved is systemd's DNS resolver. It has three interfaces:

  • A DBUS API that's exposed at /org/freedesktop/resolve1. I don't know how many things use this API (or at least try to use it).

  • A local caching DNS resolver at 127.0.0.53 (IPv4 only) that clients can query to specifically talk to systemd-resolved, even if you have another local caching DNS server at 127.0.0.1.

  • glibc's getaddrinfo() and friends, which would send all normal hostname lookups off to systemd-resolved. Importantly, this is sanely implemented as a NSS module. If you don't have resolve in your hosts: line in /etc/nsswitch.conf, systemd-resolved is not involved in normal hostname resolution.

All of my Fedora machines have systemd-resolved installed as part of systemd but none of them appear to have the NSS resolve module enabled, so none of them are using systemd-resolved as part of normal hostname resolution. They do appear to enable the DBus service (as far as I can sort out the chain of DBus stuff that leads to unit activation). The systemd-resolved daemon itself is not normally running, and there doesn't seem to be any systemd socket stuff that would activate it if you sent a DNS query to port 53 on 127.0.0.53, so on my Fedora machines it appears the only way it will ever start is if something makes an explicit DBus query.

However, once activated resolved has some behaviors that I don't think I'm fond of (apart from the security bugs and the regular bugs). I'm especially not enthused about its default use of LLMNR, which will normally see it broadcasting certain DNS queries out on all of my active interfaces. I consider LLMNR somewhere between useless and an active risk of various things, depending on what sort of network I'm connected to.

Resolved will make queries to DNS servers in parallel if you have more than one of them available through various paths, but here I think it's a reasonable approach to handling DNS resolution in the face of things like VPNs, which otherwise sort of requires awkward hand configuration. It's unfortunate that this behavior can harm people who know what they're doing and who want behavior like their local DNS resolver (or resolv.conf) to always override the DNS resolver settings they're getting from some random network's DHCP.

Since resolved doesn't actually shove itself in the way of anyone who didn't actively ask for it (via DBus or querying 127.0.0.53), I currently feel it's unobjectionable enough to leave unmasked and thus potentially activated via DBus. Assuming that I'm understanding and using journalctl correctly, it never seems to have been activated on either of my primary Fedora machines (and they have journalctl logs that go a long way back).

linux/SystemdResolvedNotes written at 19:47:22; Add Comment

2017-12-11

Some things about booting with UEFI that are different from MBR booting

If you don't dig into it, a PC that boots with UEFI seems basically the same as one that uses BIOS MBR booting, even if you have multiple OSes installed (for example, Linux and Windows 10). In either case, with Linux you boot into a GRUB boot menu with entries for Linux kernels and also Windows, and you can go on to boot either. However, under the hood this is an illusion and there are some important differences, as I learned in a recent UEFI adventure.

In BIOS MBR booting, there's a single bootloader per disk (loaded from the MBR). You only ever boot this bootloader; if it goes on to boot an entire alternate OS, it's often doing tricky magic to make them think they've been booted from the MBR. If you call up the BIOS boot menu, what it offers you is a choice of which disk to load the MBR bootloader from. When you install a bootloader on a disk, for example when your Linux distribution's installer sets up GRUB, it overwrites any previous bootloader present; in order to keep booting other things, they have to be in the configuration for your new bootloader. Since there's only one bootloader on a disk, loss or corruption of this bootloader is fatal for booting from the disk, even if you have an alternate OS there.

In UEFI booting, there isn't a single bootloader per disk the way there is with MBR booting. Instead, the UEFI firmware itself may have multiple boot entries; if you installed multiple OSes, it almost certainly does (with one entry per OS). The UEFI boot manager tries these boot entries in whatever order it's been set to, passing control to the first one that successfully loads. This UEFI bootloader can then do whatever it wants to; in GRUB's case, it will normally display its boot menu and then go on to boot the default entry. If you call up the UEFI firmware boot menu, what you see is these UEFI boot entries, probably augmented with any additional disks that have an EFI system partition with an EFI/BOOT/BOOTX64.EFI file on them (this is the default UEFI bootloader name for 64-bit x86 systems). This may reveal UEFI boot entries that you didn't realize were (still) there, such as a UEFI Windows boot entry or a lingering Linux one.

(If you have multiple fixed disks with EFI system partitions, I believe that you can have UEFI boot entries that refer to different disks. So in a mirrored system disk setup, in theory you could have an UEFI boot entry for the EFI system partition on each system disk.)

The possibility of multiple UEFI boot entries means that your machine can boot an alternate OS that has a UEFI boot entry even if your normal primary (UEFI) bootloader is damaged, for example if it has a corrupted or missing configuration file. Under some situations your machine may transparently fall back to such an additional UEFI boot entry, which can be pretty puzzling if you're used to the normal BIOS MBR situation where either your normal bootloader comes up or the BIOS reports 'cannot boot from this disk'. It's also possible to have two UEFI boot entries for the same OS, one of which works and one of which doesn't (or, for a non-hypothetical example, one which only works when Secure Boot is off because it uses an an unsigned UEFI bootloader).

A UEFI bootloader that wants to boot an alternate OS has more options than a BIOS MBR bootloader does. Often the simplest way is to use UEFI firmware services to load the UEFI bootloader for the other OS and transfer control to it. For instance, in GRUB:

chainloader /EFI/Microsoft/Boot/bootmgfw.efi

This is starting exactly the same Windows UEFI bootloader that my Windows UEFI boot entry uses. I'm not sure that Windows notices any difference between being booted directly from its UEFI boot entry and being chainloaded this way. However, such chainloading doesn't require that there still be a UEFI boot entry for the alternate OS, just that the UEFI bootloader .EFI file still be present and working. Similarly, you can have UEFI boot entries for alternate OSes that aren't present in your GRUB menu; the two systems are normally decoupled from each other.

(You could have a UEFI bootloader that read all the UEFI boot entries and added menu entries for any additional ones, but I don't believe that GRUB does this. You could also have a grub.cfg menu builder that used efibootmgr to automatically discover such additional entries.)

A UEFI bootloader is not obliged to have a boot menu or support booting alternate OSes (or even alternate installs of its own OS), because in theory that's what additional UEFI boot entries are for. The Windows 10 UEFI bootloader normally boots straight into Windows, for example. Linux UEFI bootloaders will usually have an option for a boot menu, though, because in Linux you typically want to have more than one kernel as an option (if only so you can fall back to the previous kernel if a new one has problems).

(In theory you could probably implement multiple kernels as multiple UEFI boot entries, but this gets complicated, there's only so many of them (I believe five), and apparently UEFI firmware is often happier if you change its NVRAM variables as little as possible.)

Sidebar: UEFI multi-OS food fights

In the BIOS MBR world, installing multiple OSes could result in each new OS overwriting the MBR bootloader with its own bootloader, possibly locking you out of the other OSes. In the UEFI world there's no single bootloader any more, so you can't directly get this sort of food fight; each OS should normally only modify its own UEFI boot entry and not touch other ones (although if you run out of empty ones, who knows what will happen). However, UEFI does have the idea of a user-modifiable order for these boot entries, so an OS (new or existing) can decide that its UEFI boot entry should of course go at the front of that list, so it's the default thing that gets booted by the machine.

I suspect that newly installed OSes will almost always try to put themselves in as the first and thus default UEFI boot entry. Existing OSes may or may not do this routinely, but I wouldn't be surprised if they definitely did it should you tell them to check for boot problems and repair anything they find. Probably this is a feature.

tech/UEFIBootThings written at 22:22:18; Add Comment

2017-12-10

Let's Encrypt and a TLS monoculture

Make no mistake, Let's Encrypt is great and I love them. I probably wouldn't currently have TLS certificates on my personal websites without them (since the free options have mostly dried up), and we've switched over to them at work, primarily because of the automation. However, there's something that I worry about from time to time with Let's Encrypt, and that's how their success may create something of a TLS monoculture.

In general it's clear that Let's Encrypt accounts for a large and steadily growing number of TLS certificates out there in the wild. Some recent reports I could find suggest that it may now be the largest single CA, at 37% of observed certificates (eg nettrack.info). Let's Encrypt's plans for 2018 call for doubling their active certificates and unique domains, and if this comes to pass their dominance is only going to grow. Some of this, as with us, will come from Let's Encrypt displacing certificates from other CAs on existing HTTPS sites, but probably LE hopes for a lot of it to come from more and more sites adopting HTTPS (with LE certificates).

This increasing concentration of TLS certificates from a single source has two obvious effects. The first effect is that it makes Let's Encrypt itself an increasingly crucial piece of the overall HTTPS infrastructure. If Let's Encrypt ever has problems, it will affect a whole lot of sites, and if it ever has security issues, it seems very likely that browsers will be even less prepared than usual to do much about it. That Let's Encrypt certificates only last for 90 days also seems likely to magnify any operational issues or scaling problems, since it increases the certificate issuance rate required to support any given number of active certificates.

(As far as security goes, fortunately increasingly mandatory certificate transparency makes it harder for an attacker to hide security exploits against a CA.)

Beyond security issues, though, this implies that any Let's Encrypt policies on who can or can't get TLS certificates (and under what circumstances) may have significant and disproportionate impact. Let's Encrypt is currently fairly unrestricted there, as far as I know, but this may not be under their control under all circumstances; for example, legal judgements might force them to restrict or block issuance of certificates to some groups, network areas, or countries.

The second effect is that HTTPS TLS certificate practices are likely to increasingly become dominated and defined by whatever Let's Encrypt does (and doesn't). When LE issues the majority of the active certificates in the world, your code and systems had better accept their practices and their certificates. If LE certificates include some field, you'd better be able to handle it; if they don't, you're not going to be able to require it. Of course, this gives Let's Encrypt political influence over TLS standards and operational practices, and this means that persuading Let's Encrypt about something is valuable and thus likely to be something people pursue. None of this is surprising; it's always been the case that a dominant vendor creates a de facto standard.

(The effects of Let's Encrypt on client TLS code are fortunately limited because there are plenty of extremely important HTTPS websites that are very unlikely to switch over to Let's Encrypt certificates. Google (including Youtube), Microsoft, Facebook, Twitter, Amazon, Apple, etc, are all major web destinations and all of them are likely to keep using non-LE certificates.)

web/LetsEncryptMonoculture written at 22:11:09; Add Comment

2017-12-09

You don't have to authorize a machine for Let's Encrypt from the machine

A commentator on yesterday's entry brought up the issue of authorizing internal-only machines, ones that are in DNS but that aren't otherwise reachable from the Internet. Although we haven't actually done this, in general it's possible to do Let's Encrypt's authorization for a particular machine on an entirely different machine, even without using the DNS-based authorization method. All you need is that HTTP requests from the Internet go somewhere where you can handle them in something you control.

If the internal host has a public IP, this is going to take a firewall with some redirection rules (and a suitable other host). But you probably have that already. If the internal host has a private IP address, you probably have 'split horizon' DNS so in your Internet-visible DNS you can assign it a public IP that goes to the suitable other host. As far as I know, most Let's Encrypt clients are perfectly happy in this situation; they don't try to check that the host you're running them on is the host <X> that you're requesting a certificate for.

(If you're unlucky enough to have private IP addresses in public DNS (which can happen for odd reasons), well, then you're out of luck for that host.)

This does leave you with the job of transporting the new TLS certificate to the internal host and handling any daemon notifications needed there, but there are lots of solutions for that. 'Propagate file to host <X> and do something if it's changed' is not hard to automate and generally there's a lot of already mature solutions for it (some of which you may already be using). Some Let's Encrypt clients let you run custom scripts on 'certificate updated' events, so you could use this to immediately push the new certificate to the target host.

In the specific case of acmetool, you have a lot of options if you're willing to do some scripting. Acmetool supports running scripts to handle both challenges and 'certificate updated' events. If you want to run acmetool on your internal host, you could have it push the HTTP challenge files to the bastion host that will expose them to Let's Encrypt; if you want to run it on the bastion host, you could have it propagate the new TLS certificates to the internal host either directly or indirectly (by storing them into some internal data store, which the internal system then pulls from).

Sidebar: Clever tricks with the ACME protocol

As I found out, Let's Encrypt's ACME protocol splits up authorizing machines from issuing certificates. This means that it's technically possible to authorize a host from one machine (say, your bastion machine or your DNS server) and then later obtain a certificate for that host from a second machine (say, the internal machine itself, provided it can talk to the Let's Encrypt servers). The two machines involved have to use a common Let's Encrypt account in order to share the authorization, but that's just a matter of having the same account information and private keys on both (although this has some security implications).

However, as far as I know clients don't generally support performing these steps separately, either doing only authorization and then stopping or doing certificate requests and aborting if Let's Encrypt tells you that it requires authorization. An ideal client for this would also track authorization and certificate timeouts separately, so your bastion host or DNS server could run something to make sure that all authorizations were current and then internal hosts would never wind up reporting 'need authorization' errors.

(You might also want to associate different authorizations with different Let's Encrypt accounts and keys, to limit your exposure if an internal host is compromised. With the bastion host, well, you're on your own unless you build something really complicated.)

sysadmin/LetsEncryptIndirectAuthorization written at 18:03:28; Add Comment

We've switched over to using Let's Encrypt as much as possible

Over the years, we've used a whole collection of different TLS CAs. We've preferred free ones where we could, for good reasons, which meant that we've used both ipsCA (until they exploded) and StartSSL (aka StartCom), but we've also paid for TLS certificates when we had to; modern TLS certificates are pretty affordable even for us if we don't go crazy. And these days we even have access to free TLS certificates through the university's central IT. However, we've now switched over to using Let's Encrypt as much as possible; basically it's the first CA we attempt to use, and if it doesn't work for some reason we'd probably turn to the free TLS certificates from central IT (both because they're free and because the process of getting one isn't too painful).

Our main reason for switching to Let's Encrypt isn't that it's free (it's not our only current source of free certificates); instead, as with my personal use, it's become all about the great automation. With Let's Encrypt, getting an initial certificate just requires running a command line program, and once we've worked out how to handle any particular program (since LE's good for more than web servers), we can completely stop worrying about certificate renewals. It just quietly happens and everything works and we don't notice a thing. The LE client that we've wound up using all of the time is Hugo Landau's acmetool, which is what I settled on myself. Acmetool has proven to be reliable and easy to tweak so it supports various programs like Dovecot and Exim.

(Our current approach to satisfying Let's Encrypt challenges is to let HTTP from the Internet through to any machine that needs a TLS certificate, whether or not it normally runs a web server. Acmetool will automatically run its own web server while a challenge is active, if necessary.)

Using acmetool or any other suitable Let's Encrypt client is not the only way of automating TLS certificate updates, but it has the great advantage for us that it comes basically ready to go. In our environment there's almost nothing to build to support new TLS-using programs and almost nothing special to do to set acmetool up on any particular machine (and we have canned directions for the few steps required). People with existing modern automation infrastructure may already have this solved, and so may find Let's Encrypt less compelling than we do.

Almost two years ago I wrote about how we couldn't use Let's Encrypt for production due to rate limits. What's changed since then is that Let's Encrypt's current rate limits specifically exempt certificate renewals from their 'certificates per registered domain' limit. This means that if we can get an initial certificate for a host, we're basically sure to be able to renew it, which is the important thing for us. If the initial issuance fails, that's when we can turn to alternate CAs (but for the names we want it almost never does).

PS: Since automation is such a big motivation for us, what sold us is not Let's Encrypt by itself but acmetool. In a real sense, we're indifferent to what TLS certificate provider is behind acmetool, and if we could get free certificates from central IT just as easily (perhaps literally using LE's ACME protocol), we'd be happy to do just that. But at least for now, Let's Encrypt itself and ACME are conjoined together.

sysadmin/LetsEncryptSwitchover written at 00:27:29; Add Comment

2017-12-08

Some thoughts on what StartCom's shutdown means in general

I wrote a couple of weeks ago about StartCom giving up its Certificate Authority business, and then I was reminded of it more recently when they sent my StartSSL contact address an email message about it. Perhaps unsurprisingly, that email was grumpier than their public mozilla.dev.security.policy message; I believe it was similar to what they posted on their own website (I saved it, but I can't be bothered to look at it now). Partly as a result of this, I've been thinking about what StartCom's shutdown means about the current state of the CA world.

Once upon a time, owning a CA certificate was a license to print money unless you completely fumbled it. Given that StartCom was willing to completely give up on what was once a valuable asset, it seems clear that those days are over now, but I think they're over from two sides at once. On the income side, free certificates from Let's Encrypt and other sources seem to be taking an increasingly large chunk out of everyone else's business. There are still people who pay for basic TLS certificates, but it's increasingly hard to see why. Or at least the number of such people is going to keep shrinking.

(Well, one reason is if automatic provisioning is such a pain that you're willing to throw money at certificates that last a year or more. But sooner or later people and software are going to get over that.)

However, I think that's not the only issue. It seems very likely that it's increasingly costly to operate a CA in a way that browsers like, with sufficient security, business processes, adherence to various standards, and so on. It's clear that CAs used to be able to get away with a lot of sloppy behaviors and casual practices, because we've seen some of those surface in, for example, mis-issued test certificates for real domains. That doesn't fly any more, so running a CA requires more work and more costs, especially if something goes badly wrong and you have to pass a strong audit to get back into people's good graces.

(In StartCom's case, I suspect that one reason their CA certificate became effectively worthless is that getting it re-accepted by Chrome and Mozilla would have required about as much work as starting from scratch with a new certificate and business. Starting from scratch might even be easier, since you wouldn't be tainted by StartCom's past. Thus I suspect StartCom couldn't find any buyers for their CA business and certificates.)

Both of these factors seem very likely to get worse. Free TLS certificates will only pick up momentum from here (Let's Encrypt is going to offer wildcard certificates soon, for example), and browsers are cranking up the restrictions on CAs. Chrome is especially moving forward, with future requirements such as Certificate Transparency for all TLS certificates.

(It seems likely that part of the expense of running a modern commercial CA is having people on staff who can participate usefully in places like the CA/Browser forum, because as a CA you clearly have to care about what gets decided in those places.)

web/StartComShutdownThoughts written at 00:08:40; Add Comment

2017-12-06

My upgrade to Fedora 27, Secure Boot, and a mistake made somewhere

I'm usually slow about updating to new versions of Fedora; I like to let other people find the problems and then it's generally a hassle in various ways, so I keep putting it off. This week I decided that I'd been sitting on the Fedora 27 upgrade for long enough (or too long), and today it was the turn of my work laptop. It didn't entirely go well, but after the dust settled I think it's due to an innocent looking mistake I made and my specific laptop configuration.

This is a new laptop, a Dell XPS 13, and this is the first Fedora upgrade I've done on it (I installed Fedora 26 when we got it in mid-August). As I usually do, I did the Fedora 26 to 27 upgrade with the officially unsupported method of a live upgrade with dnf based on the traditional documentation for it, which I've been doing on multiple machines for many years. After I finished the upgrade process, I rebooted and the laptop failed to come up in Linux; instead it booted into the Windows 10 installation that I have on the other half of its drive. My Linux install (now with Fedora 27) was intact, but it wouldn't boot at all.

I will start with the summary. If your system boots using UEFI, you almost certainly shouldn't ever run grub2-install. Some portions of the Fedora wiki (like the Fedora page on Grub2) will tell you this pretty loudly, but the 'upgrade with package manager' page still says to use grub2-install without any qualifications, and that's what I did during my Fedora 27 upgrade.

What caused my issue is that I have Secure Boot enabled on my laptop, and at some point during the upgrade my Fedora UEFI boot entry wound up pointing to the EFI image image EFI/fedora/grubx64.efi, which isn't correctly signed and so won't boot under Secure Boot. The XPS UEFI firmware doesn't report any error message when this happens; instead it silently goes on to the next UEFI boot entry (if there is one), which in my case was Windows' standard entry. In order to boot my laptop with Secure Boot enabled, the UEFI boot entry for Fedora 27 needs to point to EFI/fedora/shimx64.efi instead of grubx64.efi. This shim loader is signed and passes the UEFI firmware's Secure Boot verification, and once it starts it hands things off to grubx64.efi for regular GRUB2 UEFI booting.

(If I disabled Secure Boot, I could use the grubx64.efi UEFI boot entry. Otherwise, only the shimx64.efi entry worked.)

At this point I don't know what my Fedora 26 UEFI boot entry looked like, but I suspect that it pointed to the Fedora 26 version of the shim (which appears to be called EFI/fedora/shim.efi). My best guess for what happened during my Fedora 27 upgrade is that when I did the grub2-install at the end, one of the things it did was run efibootmgr and reset where the 'fedora' UEFI boot entry pointed. I don't remember seeing any message reporting this, but I didn't run grub2-install with any flag to make it verbose and the code to run efibootmgr appears to be in the Grub2 source.

(And changing the UEFI boot entry is sort of reasonable. After all, I told Grub2 to install itself, and that logically includes making the UEFI boot entry point to it, just as grub2-install on a non-UEFI system will update the MBR boot record to point to itself.)

PS: I consider all of this a valuable learning experience, since I got to shoot myself in the foot and learn a bunch of things about UEFI on a machine I could live without. I'm planning to set up my future desktops as pure UEFI machines, and making this mistake on one of them would have been much more painful. For that matter, simply knowing how to set up UEFI boot entries is going to come in handy when I migrate my current disks over to the new machines.

(I'm up in the air about whether or not I'll use Secure Boot on the desktops. If they come that way, well, maybe.)

Sidebar: How I fixed this

In theory you can boot a Fedora 27 live image from a USB stick and fiddle around with efibootmgr. In practice, I went in to the laptop's UEFI 'BIOS' interface and told it to add another UEFI boot entry, because this had a reasonably simple and obvious interface. The resulting entry is a bit different from what I think efibootmgr would make, but it works (as well it should, since it was set up by the very thing that's interpreting it).

(In the course of this experience I was not pleased to discover that the Dell XPS 13's UEFI interface will let you delete UEFI boot entries with immediate effect and no confirmation or saving needed. Click the wrong button at the wrong time, and your entry is irretrievably gone on the spot.)

linux/Fedora27SecureBootMistake written at 23:59:38; Add Comment

In practice, Go's slices are two different data structures in one

As I've seen them in Go code and used them myself, Go's slices are generally used in two pretty distinctly separate situations. As a result, I believe that many people have two different conceptual models of slices and their behavior, depending on which situation they're using slices in.

The first model and use of slices is as views into a concrete array (or string) that you already have in your code. You're taking an efficient reference to some portion of the array and saying 'here, deal with this chunk' to some piece of code. This is the use of slices that is initially presented in A Tour of Go here and that is implicitly used in, for example, io.Reader and io.Writer, both of which are given a reference to an underlying concrete byte array.

The second model and use of slices is as dynamically resizable arrays. This is the usage where, for example, you start with 'var accum []string', and then add things to it with 'accum = append(accum, ...)'. In general, any code using append() is using slices this way, as is code that uses explicit slice literals ('[]string{a, b, ..}'). Dynamically resizable arrays are a very convenient thing, so this sort of slice shows up in lots of Go code.

(Part of this is that Go's type system strongly encourages you to use slices instead of arrays, especially in arguments and return values.)

Slices as dynamically resizable arrays actually have an anonymous backing store behind them, but you don't normally think about it; it's materialized, managed, and deallocated for you by the runtime and you can't get a direct reference to it. As we've seen, it's easy to not remember that the second usage of slices is actually a funny, GC-driven special case of the first sort of use. This can lead to leaking memory or corrupting other slices.

(It's not quite fair to call the anonymous backing array an implementation detail, because Go explicitly documents it in the language specification. But I think people are often going to wind up working that way, with the slice as the real thing they deal with and the backing array just an implementation detail. This is especially tempting since it works almost all of the time.)

This distinct split in usage and conceptual model (and the glitches that result at the edges of it) are why I've wound up feeling that in practice, Go's slices are two different data structures in one. The two concepts may be implemented with the same language features and runtime mechanisms, but people treat them differently and have different expectations and beliefs about them.

programming/GoSlicesTwoViews written at 00:26:37; Add Comment

2017-12-04

Some notes on using Go to check and verify SSH host keys

For reasons beyond the scope of this entry, I recently wrote a Go program to verify the SSH host keys of remote machines, using the golang.org/x/crypto/ssh package. In the process of doing this, I found a number of things in the package's documentation to be unclear or worth noting, so here are some notes about it.

In general, you check the server's host key by setting your own HostKeyCallback function in your ClientConfig structure. If you only want to verify a single host key, you can use FixedHostKey(), but if you want to check the server key against a number of them, you'll need to roll your own callback function. This includes the case where you have both a RSA and an ed25519 key for the remote server and you don't necessarily know which one you'll wind up verifying against.

(You can and should set your preferred order of key types in HostKeyAlgorithms in your ClientConfig, but you may or may not wish to accept multiple key types if you have them. There are potential security considerations because of how SSH host key verification works, and unless you go well out of your way you'll only verify one server host key.)

Although it's not documented that I can see, the way you compare two host keys to see if they're the same is to .Marshal() them to bytes and then compare the bytes. This is what the code for FixedHostKey() does, so I consider it official:

type fixedHostKey struct {
  key PublicKey
}

func (f *fixedHostKey) check(hostname string, remote net.Addr, key PublicKey) error {
  if f.key == nil {
    return fmt.Errorf("ssh: required host key was nil")
  }
  if !bytes.Equal(key.Marshal(), f.key.Marshal()) {
    return fmt.Errorf("ssh: host key mismatch"
  }
  return nil
}

In a pleasing display of sanity, your HostKeyCallback function is only called after the crypto/ssh package has verified that the server can authenticate itself with the asserted host key (ie, that the server knows the corresponding private key).

Unsurprisingly but a bit unfortunately, crypto/ssh does not separate out the process of using the SSH transport protocol to authenticate the server's host keys and create the encrypted connection from then trying to use that encrypted connection to authenticate as a particular user. This generally means that when you call ssh.NewClientConn() or ssh.Dial(), it's going to fail even if the server's host key is valid. As a result, you need your HostKeyCallback function to save the status of host key verification somewhere where you can recover it afterward, so you can distinguish between the two errors of 'server had a bad host key' and 'server did not let us authenticate with the "none" authentication method'.

(However, you may run into a server that does let you authenticate and so your call will actually succeed. In that case, remember to call .Close() on the SSH Conn that you wind up with in order to shut things down neatly; otherwise you'll have some resource leaks in your Go code.)

Also, note that it's possible for your SSH connection to the server to fail before it gets to host key authentication and thus to never have your HostKeyCallback function get called. For example, the server might not offer any key types that you've put in your HostKeyAlgorithms. As a result, you probably want your HostKeyCallback function to have to affirmatively set something to signal 'server's keys passed verification', instead of having it set a 'server's keys failed verification' flag.

(I almost made this mistake in my own code, which is why I'm bothering to mention it.)

As a cautious sysadmin, it's my view that you shouldn't use ssh.Dial() but should instead net.Dial() the net.Conn yourself and then use ssh.NewClientConn(). The problem with relying on ssh.Dial() is that you can't set any sort of timeout for the SSH authentication process; all you have control over is the timeout of the initial TCP connection. You probably don't want your check of SSH host keys to hang if the remote server's SSH daemon is having a bad day, which does happen from time to time. To avoid this, you need to call .SetDeadline() with an appropriate timeout value on the net.Conn after it's connected but before you let the crypto/ssh code take it over.

The crypto/ssh package has a convenient function for iteratively parsing a known_hosts file, ssh.ParseKnownHosts(). Unfortunately this function is not suitable for production use by itself, because it completely gives up the moment it encounters even a single significant error in your known_hosts file. This is not how OpenSSH ssh behaves, for example; by and large ssh will parse all valid lines and ignore lines with errors. If you want to duplicate this behavior, you'll need to split your known_hosts file up into lines with bytes.Split(), then feed each non-blank, non-comment line to ParseKnownHosts (if you get an io.EOF error here, it means 'this line isn't formatted like a SSH known hosts line'). You'll want to think about what you do about errors; I accumulate them all, report up to the first N of them, and then only abort if there's been too many.

(In our case we want to keep going if it looks like we've only made a mistake in a line or two, but if looks like things are badly wrong we're better off giving up entirely.)

Sidebar: Collecting SSH server host keys

If all you want to do is collect SSH server host keys for hosts, you need a relatively straightforward variation of this process. You'll repeatedly connect to the server with a different single key type in HostKeyAlgorithms each time, and your HostKeyCallback function will save the host key it gets called with. If I was doing this, I'd save the host key in its []byte marshalled form, but that's probably overkill.

programming/GoSSHHostKeyCheckingNotes written at 23:41:41; Add Comment

(Previous 10 or go back to December 2017 at 2017/12/03)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.