Wandering Thoughts archives

2017-03-31

I quite like the simplification of having OpenSSH canonicalize hostnames

Some time ago I wrote up some notes on OpenSSH's optional hostname canonicalization. At the time I had just cautiously switched over to having my OpenSSH setup on my workstation canonicalize my hostnames, and I half expected it to go wrong somehow. It's been just over a year since then and not only has nothing blown up, I now actively prefer having OpenSSH canonicalize the hostnames that I use and I've just today switched my OpenSSH setup on our login servers over to do this and cleared out my ~/.ssh/known_hosts file to start it over from scratch.

That latter bit is a big part of why I've come to like hostname canonicalization. We have a long history of using multiple forms of shortened hostnames for convenience, plus sometimes we wind up using a host's fully qualified name. When I didn't canonicalize names, my known_hosts file wound up increasingly cluttered with multiple entries for what was actually the same host, some of them with the IP address and some without. After canonicalization, all of this goes away; every host has one entry and that's it. Since we already maintain a system-wide set of SSH known hosts (partly for our custom NFS mount authentication system), my own known_hosts file now doesn't even accumulate very many entries.

(I should probably install our global SSH known hosts file even on my workstation, which is deliberately independent from our overall infrastructure; this would let me drastically reduce my known_hosts file there too.)

The other significant reason to like hostname canonicalization is the reason I mentioned in my original entry, which is that it allows me to use much simpler Host matching rules in my ~/.ssh/config while only offering my SSH keys to hosts that should actually accept them (instead of to everyone, which can have various consequences). This seems to have become especially relevant lately, as some of our recently deployed hosts seem to have reduced the number of authentication attempts they'll accept (and each keypair you offer counts as one attempt). And in general I just like having my SSH client configuration saying what I actually want, instead of having to flail around with 'Host *' matches and so on because there was no simple way to say 'all of our hosts'. With canonical hostnames, now there is.

As far as DNS reliability for resolving CNAMEs goes, we haven't had any DNS problems in the past year (or if we have, I failed to notice them amidst greater problems). We might someday, but in general DNS issues are going to cause me problems no matter what, since my ssh has to look up at least IP addresses in DNS. If it happens I'll do something, but in the mean time I've stopped worrying about the possibility.

SSHCanonHostnamesWin written at 22:07:59; Add Comment

2017-03-29

The work of safely raising our local /etc/group line length limit

My department has now been running our Unix computing environment for a very long time (which has some interesting consequences). When you run a Unix environment over the long term, old historical practices slowly build up and get carried forward from generation to generation of the overall system, because you've probably never restarted everything from complete scratch. All of this is an elaborate route to say that as part of our local password propagation infrastructure, we have a program that checks /etc/passwd and /etc/group to make sure they look good, and this program puts a 512 byte limit on the size of lines in /etc/group. If it finds a group line longer than that, it complains and aborts and you get to fix it.

(Don't ask what our workaround is for groups with large memberships. I'll just say that it raises some philosophical and practical questions about what group membership means.)

We would like to remove this limit; it makes our life more complicated in a number of ways, causes problems periodically, and we're pretty sure that it's no longer needed and probably hasn't been needed for years. So we should just take that bit of code out, or at least change the '> 512' to '> 4096', right?

Not so fast, please. We're pretty sure that doing so is harmless, but we're not certain. And we would like to not blow up some part of our local environment by mistake if it turns out that actually there is still something around here that has heartburn on long /etc/group lines. So in order to remove the limit we need to test to make sure everything still works, and one of the things that this has meant is sitting down and trying to think of all of the places in our environment where something could go wrong with a long group line. It's turned out that there were a number of these places:

  • Linux could fail to properly recognize group membership for people in long groups. I rated this as unlikely, since the glibc people are good at avoiding small limits and relatively long group lines are an obvious thing to think about.

  • OmniOS on our fileservers could fail to recognize group membership. Probably unlikely too; the days when people put 512-byte buffers or the like into getgrent() and friends are likely to be long over by now.

    (Hopefully those days were long over by, say, 2000.)

  • Our Samba server might do something special with group handling and so fail to properly deal with a long group, causing it to think that someone wasn't a member or deny them access to group-protected file.

  • The tools we use to build an Apache format group file from our /etc/group could blow up on long lines. I thought that this was unlikely too; awk and sed and so on generally don't have line length limitations these days.

    (They did in the past, in one form or another, which is probably part of why we had this /etc/group line length check in the first place.)

  • Apache's own group authorization checking could fail on long lines, either completely or just for logins at the end of the line.

  • Even if they handled regular group membership fine, perhaps our OmniOS fileservers would have a problem with NFS permission checks if you were in more than 16 groups and one of your extra groups was a long group, because this case causes the NFS server to do some additional group handling. I thought this was unlikely, since the code should be using standard OmniOS C library routines and I would have verified that those worked already, but given how important NFS permissions are for our users I felt I had to be sure.

(I was already confident that our local tools that dealt with /etc/group would have no problems; for the most part they're written in Python and so don't have any particular line length or field count limitations.)

It's probably worth explicitly testing Linux tools like useradd and groupadd to make sure that they have no problems manipulating group membership in the presence of long /etc/group lines. I can't imagine them failing (just as I didn't expect the C library to have any problems), but that just means it would be really embarrassing if they turned out to have some issue and I hadn't checked.

All of this goes to show that getting rid of bits of the past can be much more work and hassle than you'd like. And it's not particularly interesting work, either; it's all dotting i's and crossing t's just in case, testing things that you fully expect to just work (and that have just worked so far). But we've got to do this sometime, or we'll spend another decade with /etc/group lines limited to 512 bytes or less.

(System administration life is often not particularly exciting.)

GroupSizeIncreaseWorries written at 01:57:29; Add Comment

2017-03-26

Your exposure from retaining Let's Encrypt account keys

In a comment on my entry on how I think you have lots of Let's Encrypt accounts, Aristotle Pagaltzis asked a good question:

Taking this logic to its logical conclusion: as long as you can arrange to prove your control of a domain under some ACME challenge at any time, should you not immediately delete an account after obtaining a certificate through it?

(Granted – in practice, there is the small matter that deleting accounts appears unimplemented, as per your other entry…)

Let's take the last bit first: for security purposes, it's sufficient to destroy your account's private key. This leaves dangling registration data on Let's Encrypt's servers, but that's not your problem; with your private key destroyed, no one can use your authorized account to get any further certificates.

(If they can, either you or the entire world of cryptography have much bigger problems.)

For the broader issue: yes, in theory it's somewhat more secure to immediately destroy your private key the moment you have successfully obtained a certificate. However, there is a limit to how much security you get this way because someone with unrestricted access to your machine can get their own authorization for it with an account of their own. If I have root access to your machine and you normally run a Let's Encryption authorization process from it, I can just use my own client to do that same and get my own authorized account. I can then take the private key off the machine and later use it to get my own certificates for your machine.

(I can also reuse an account I already have and merely pass the authorization check, but in practice I might as well get a new account to go with it.)

The real exposure for existing authorized accounts is when it's easier to get at the account's private key than it is to get unrestricted access to the machine itself. If you keep the key on the machine and only accessible to root, well, I won't say you have no additional exposure at all, but in practice your exposure is probably fairly low; there are a lot of reasonably sensitive secrets that are protected this way and we don't consider it a problem (machine SSH host keys, for example). So in my opinion your real exposure starts going up when you transport the account key off the machine, for example to reuse the same account on multiple machines or over machine reinstalls.

As a compromise you might want to destroy account keys every so often, say once a year or every six months. This limits your long-term exposure to quietly compromised keys while not filling up Let's Encrypt's database with too many accounts.

As a corollary to this and the available Let's Encrypt challenge methods, someone who has compromised your DNS infrastructure can obtain their own Let's Encrypt authorizations (for any account) for any arbitrary host in your domain. If they issue a certificate for it immediately you can detect this through certificate transparency monitoring, but if they sit on their authorization for a while I don't think you can tell. As far as I know, LE provides no way to report on accounts that are authorized for things in your domain (or any domain), so you can't monitor this in advance of certificates being issued.

For some organizations, compromising your DNS infrastructure is about as difficult as getting general root access (this is roughly the case for us). However, for people who use outside DNS providers, such a compromise may only require gaining access to one of your authorized API keys for their services. And if you have some system that allows people to add arbitrary TXT records to your DNS with relatively little access control, congratulations, you now have a pretty big exposure there.

LetsEncryptAccountExposure written at 01:22:08; Add Comment

2017-03-22

Setting the root login's 'full name' to identify the machine that sent email

Yesterday I wrote about making sure you can identify what machine sent you a status email, and the comments Sotiris Tsimbonis shared a brilliant yet simple solution to this problem:

We change the gecos info for this purpose.

chfn -f "$HOSTNAME root" root

Take it from me; this is beautiful genius (so much so that both we and another group here immediately adopted it). It's so simple yet still extremely effective, because almost everything that sends email does so using programs like mail that will fill out the From: header using the login's GECOS full name from /etc/passwd. You get email that looks like:

From: root@<your-domain> (hebikera root)

This does exactly what we want by immediately showing the machine that the email is from. In fact many mail clients these days will show you only the 'real name' from the From: header by default, not the actual email address (I'm old-fashioned, so I see the traditional full From: header).

This likely works with any mail-sending program that doesn't require completely filled out email headers. It definitely works in the Postfix sendmail cover program for 'sendmail -t' (as well as the CentOS 6 and 7 mailx, which supplies the standard mail command).

(As an obvious corollary, you can also use this trick for any other machine-specific accounts that send email; just give them an appropriate GECOS 'full name' as well.)

There's two perhaps obvious cautions here. First, if you ever rename machines you have to remember to re-chfn the root login and any other such logins to have the correct hostname in them. It's probably worth creating an officially documented procedure for renaming machines, since there are other things you'll want to update as well (you might even script it). Second, if you have some sort of password synchronization system you need it to leave root's GECOS full name alone (although it can update root's password). Fortunately ours already does this.

IdentifyMachineEmailByRootName written at 23:49:10; Add Comment

Making sure you can identify what machine sent you a status email

I wrote before about making sure that system email works, so that machines can do important things like tell you that their RAID array has lost redundancy and you should do something about that. In a comment on that entry, -dsr- brought up an important point, which is you want to be able to easily tell which machine sent you email.

In an ideal world, everything on every machine that sends out email reports would put the machine's hostname in, say, the Subject: header. This would give you reports like:

Subject: SMART error (FailedOpenDevice) detected on host: urd

In the real world you also get helpful emails like this:

Subject: Health

Device: /dev/sdn [SAT], FAILED SMART self-check. BACK UP DATA NOW!

The only way for us to tell which machine this came from was to look at either the Received: headers or the Message-ID, which is annoying.

There are at least two ways to achieve this. The first approach is what -dsr- said in the comment, which is to make every machine send its email to a unique alias on your system. This unfortunately has at least two limitations. The first is that it somewhat clashes with a true 'null client' setup, where your machines dump absolutely all of their email on the server. A straightforward null client does no local rewriting of email at all, so to get this you need a smarter local mailer (and then you may need per-machine setup, hopefully automated). The second limitation is that there's no guarantee that all of the machine's email will be sent to root (and thus be subject to simple rewriting). It's at least likely, but machines have been known to send status email to all sorts of addresses.

(I'm going to assume that you can arrange for the unique destination alias to be visible in the To: header.)

You can somewhat get around this by doing some of the rewriting on your central mail handler machine (assuming that you can tell the machine email apart from regular user email, which you probably want to do anyways). This needs a relatively sophisticated configuration, but it probably can be done in something like Exim (which has quite powerful rewrite rules).

However, if you're going to do this sort of magic in your central mail handler machine, you might as well do somewhat different magic and alter the Subject: header of such email to include the host name. For instance, you might just add a general rule to your mailer so that all email from root that's going to root will have its Subject: altered to add the sending machine's hostname, eg 'Subject: [$HOSTNAME] ....'. Your central mail handler already knows what machine it received the email from (the information went into the Received header, for example). You could be more selective, for instance if you know that certain machines are problem sources (like the CentOS 7 machine that generated my second example) while others use software that already puts the hostname in (such as the Ubuntu machine that generated my first example).

I'm actually more attracted to the second approach than the first one. Sure, it's a big hammer and a bit crude, but it creates the easy to see marker of the source machine that I want (and it's a change we only have to make to one central machine). I'd feel differently if we routinely got status emails from various machines that we just filed away (in which case the alias-based approach would give us easy per-machine filing), but in practice our machines only email us occasionally and it's always going to be something that goes to our inboxes and probably needs to be dealt with.

IdentifyingStatusEmailSource written at 01:11:32; Add Comment

2017-03-13

OpenSSH's IdentityFile directive only ever adds identity files (as of 7.4)

In some complicated scenarios (especially with 2FA devices), even IdentitiesOnly can potentially give you too many identities between relatively generic Host ... entries and host-specific ones. Since there is only so far it's sensible to push Host ... entries with negated hostnames before you wind up with a terrible mess, there are situations where it would be nice to be able to say something like:

Host *.ourdomain
  IdentityFile ...
  IdentityFile ...
  [...]

Host something-picky.ourdomain
  IdentityFile NONE
  IdentityFile /u/cks/.ssh/identities/specific
  IdentitiesOnly yes
  [...]

Here, you want to offer a collection of identities from various sources to most hosts, but there are some hosts that both require very specific identities and will cut your connection off if you offer too many identities (as mentioned back here).

I have in the past said that 'as far as I knew' IdentityFile directives were purely cumulative (eg in comments on this entry). This held out a small sliver of hope that there was some way of doing this that I either couldn't see in the manpages or that just wasn't documented. As it happens, I recently decided to look at the OpenSSH source code for 7.4 (the latest officially released version) to put this to rest once and for all, and the bad news is that I have to stop qualifying my words. As far as I can tell from the source code, there is absolutely no way of wiping out existing IdentityFile directives that have been added by various matching Host stanzas. There's an array of identities (up to the maximum 100 that's allowed), and the code only ever adds identities to it. Nothing removes entries or resets the number of valid entries in the array.

Oh well. It would have been nice, and maybe someday the OpenSSH people will add some sort of feature for this.

In the process of reading bits of the OpenSSH code, I ran across an interesting comment in sshconnect2.c's pubkey_prepare():

/*
 * try keys in the following order:
 * 	1. certificates listed in the config file
 * 	2. other input certificates
 *	3. agent keys that are found in the config file
 *	4. other agent keys
 *	5. keys that are only listed in the config file
 */

(IdentitiesOnly does not appear to affect this order, it merely causes some keys to be excluded.)

To add to an earlier entry of mine, keys supplied with -i fall into the 'in the config file' case, because what that actually means is 'keys from -i, from the user's configuration file, and from the system configuration file, in that order'. They all get added to the list of keys with the same function, add_identity_file(), but -i is processed first.

(This means that my earlier writeup of the SSH identity offering order is a bit incomplete, but at this point I'm sufficiently tired of wrestling with this particular undocumented SSH mess that I'm not going to carefully do a whole bunch of tests to verify what the code comment says here. Having skimmed the code, I believe the comment.)

SSHNoIdentityFileOverride written at 23:05:53; Add Comment

2017-03-05

Why I (as a sysadmin) reflexively dislike various remote file access tools for editors

I somewhat recently ran across this irreal.org entry (because it refers to my entry on staying with Emacs for code editing), and in a side note it mentions this:

[Chris] does a lot of sysadmin work and prefers Vim for that (although I think Tramp would go a long way towards meeting the needs that he thinks Vim resolves).

This is partly in reference to my entry on Why vi has become my sysadmin's editor, and at the end of that entry I dismissed remote file access things like Tramp. I think it's worth spending a little bit of time talking about why I reflexively don't like them, at least with my sysadmin hat on.

There are several reasons for this. Let's start with the mechanics of remote access to files. If you're a sysadmin, it's quite common for you to be editing files that require root permissions to write to, and sometimes to even read. This presents two issues for a Tramp like system. The first is that either you arrange passwordless access to root-privileged files or that at some point during this process you provide your root-access password. The first is very alarming and the second requires a great deal of trust in Tramp or other code to securely handle the situation. The additional issue for root access is that best practices today is to not log or scp in directly as root but instead to log in as yourself and then use su or sudo to gain root access. Perhaps you can make the remote file access system of choice do this, but it's extremely unlikely to be simple because this is not at all a common usage case for them. Almost all of these systems are built by developers to allow them to access their own files remotely; indirect access to privileged contexts is a side feature at best.

(Let's set aside issues of, say, two-factor authentication.)

All of this means that I would have to build access to an extremely sensitive context on uncertain foundations that require me to have a great deal of trust in both the system's security (when it probably wasn't built with high security worries in the first place) and that it will always do exactly and only the right thing, because once I give it root permissions one slip or accident could be extremely destructive.

But wait, there's more. Merely writing sensitive files is a dangerous and somewhat complicated process, one where it's actually not clear what you should always do and some of the ordinary rules don't always apply. For instance, in a sysadmin context if a file has hardlinks, you generally want to overwrite it in place so that those hardlinks stay. And you absolutely have to get the permissions and even ownership correct (yes, sysadmins may use root permissions to write to files that are owned by someone other than root, and that ownership had better stay). Again, it's possible for a remote file access system to get this right (or be capable of being set up that way), but it's probably not something that the system's developers have had as a high priority because it's not a common usage case. And I have to trust that this is all going to work, all the time.

Finally, often editing a file is only part of what I'm doing as root. I hopefully want to commit that file to version control and also perhaps (re)start daemons or run additional commands to make my change take effect and do something. Perhaps a remote file editing system even has support for this, even running as a privileged user through some additional access path, but frankly this is starting to strain my ability to trust this system to get everything right (and actually do this well). Of course I don't have to use the remote access system for this, since I can just get root privileges directly and do all of this by hand, but if I'm going to be setting up a root session anyways to do additional work, why not go the small extra step to run vi in it? That way I know exactly what I'm getting and I don't have to extend a great deal of trust that a great deal of magic will do the right thing and not blow up in my face.

(And if the magic blows up, it's not just my face that's affected.)

Ultimately, directly editing files with vim as root (or the appropriate user) on the target system is straightforward, simple, and basically guaranteed to work. It has very few moving parts and they are mostly simple ones that are amenable to inspection and understanding. All of this is something that sysadmins generally value quite a bit, because we have enough complexity in our jobs as it is.

WhyRemoteFileWriteDislike written at 23:25:24; Add Comment

2017-03-04

Should you add MX entries for hosts in your (public) DNS?

For a long time, whenever we added a new server to our public DNS, we also added an MX entry for it (directing inbound email to our general external MX gateway). This was essentially historical habit, and I believe it came about because a very long time ago there were far too many programs that would send out email with From: and even To: addresses of '<user>@<host>.<our domain>'. Adding MX entries made all of that email work.

In the past few years, I have been advocating (mostly successfully) for moving away from this model of automatically adding MX entries, because I've come to believe that in practice it leads to problems over the long term. The basic problem is this: once an email address leaks into public, you're often stuck supporting it for a significant amount of time. Once those <host>.<our-dom> DNS names start showing up in email, they start getting saved in people's address books, wind up in mailing list subscriptions, and so on and so forth. Once that happens, you've created a usage of the name that may vastly outlast the actual machine itself; this means that over time, you may well have to accumulate a whole collection of lingering MX entries for now-obsolete machines that no longer exist.

These days, it's not hard to configure both machines and mail programs to use the canonical '<user>@<our-dom>' addresses that you want and to not generate those problematic '<user>@<host>.<our-dom>' addresses. If programs (or people) do generate such addresses anyways, your next step is to fix them up in your outgoing mail gateway, forcefully rewriting them to your canonical form. Once you get all of this going you no longer need to support the <host>.<our-dom> form of addresses at all in order to make people's email work, and so in my view you're better off arranging for such addresses to never work by omitting MX entries for them (and refusing them on your external mail gateway). That way if people do mistakenly (or deliberately) generate such addresses and let them escape to the outside world, they break immediately and people can immediately fix things. You aren't stuck with addresses that will work for a while and then either impose long-term burdens on you or break someday.

The obvious exception here is cases where you actually do want a hostname to work in email and be accepted in general; perhaps you want, say, '<user>@support.<our-dom>' to work, or '<user>@www.<our-dom>', or the like. But then you can actively choose to add an MX entry (and any other special processing you may need), and you can always defer doing this until the actual need materializes instead of doing it in anticipation when you set up the machine's other DNS.

(If you want only some local usernames to be accepted for such hostnames, say only 'support@support.<our-dom>', you'll obviously need to do more work in your email system. We haven't gone to this extent so far; all local addresses are accepted for all hostname and domain name variants that we accept as 'us'.)

AvoidingMXEntriesForHosts written at 23:48:13; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.