2017-03-31
I quite like the simplification of having OpenSSH canonicalize hostnames
Some time ago I wrote up some notes on OpenSSH's optional hostname
canonicalization. At the time I had just
cautiously switched over to having my OpenSSH setup on my workstation
canonicalize my hostnames, and I half expected it to go wrong
somehow. It's been just over a year since then and not only has
nothing blown up, I now actively prefer having OpenSSH canonicalize
the hostnames that I use and I've just today switched my OpenSSH
setup on our login servers over to do this and cleared out my
~/.ssh/known_hosts
file to start it over from scratch.
That latter bit is a big part of why I've come to like hostname
canonicalization. We have a long
history of using multiple forms of shortened hostnames for convenience,
plus sometimes we wind up using a host's fully qualified name. When
I didn't canonicalize names, my known_hosts
file wound up
increasingly cluttered with multiple entries for what was actually
the same host, some of them with the IP address and some without.
After canonicalization, all of this goes away; every host has one
entry and that's it. Since we already maintain a system-wide set
of SSH known hosts (partly for our custom NFS mount authentication
system), my own known_hosts
file now doesn't even accumulate very many entries.
(I should probably install our global SSH known hosts file even on
my workstation, which is deliberately independent from our overall
infrastructure; this would let me drastically reduce my known_hosts
file there too.)
The other significant reason to like hostname canonicalization is
the reason I mentioned in my original entry,
which is that it allows me to use much simpler Host
matching rules
in my ~/.ssh/config
while only offering my SSH keys to hosts that
should actually accept them (instead of to everyone, which can
have various consequences). This seems
to have become especially relevant lately, as some of our recently
deployed hosts seem to have reduced the number of authentication
attempts they'll accept (and each keypair you offer counts as one
attempt). And in general I just like having my SSH client configuration
saying what I actually want, instead of having to flail around with
'Host *
' matches and so on because there was no simple way to say
'all of our hosts'. With canonical hostnames, now there is.
As far as DNS reliability for resolving CNAMEs goes, we haven't had
any DNS problems in the past year (or if we have, I failed to notice
them amidst greater problems). We might someday, but in general DNS
issues are going to cause me problems no matter what, since my ssh
has to look up at least IP addresses in DNS. If it happens I'll do
something, but in the mean time I've stopped worrying about the
possibility.
2017-03-29
The work of safely raising our local /etc/group
line length limit
My department has now been running our Unix computing environment
for a very long time (which has some interesting consequences). When you run a Unix environment over the
long term, old historical practices slowly build up and get carried
forward from generation to generation of the overall system, because
you've probably never restarted everything from complete scratch.
All of this is an elaborate route to say that as part of our local
password propagation infrastructure, we
have a program that checks /etc/passwd
and /etc/group
to make
sure they look good, and this program puts a 512 byte limit on the
size of lines in /etc/group
. If it finds a group line longer than
that, it complains and aborts and you get to fix it.
(Don't ask what our workaround is for groups with large memberships. I'll just say that it raises some philosophical and practical questions about what group membership means.)
We would like to remove this limit; it makes our life more complicated
in a number of ways, causes problems periodically, and we're pretty
sure that it's no longer needed and probably hasn't been needed for
years. So we should just take that bit of code out, or at least
change the '> 512
' to '> 4096
', right?
Not so fast, please. We're pretty sure that doing so is harmless,
but we're not certain. And we would like to not blow up some part
of our local environment by mistake if it turns out that actually
there is still something around here that has heartburn on long
/etc/group
lines. So in order to remove the limit we need to
test to make sure everything still works, and one of the things that
this has meant is sitting down and trying to think of all of the
places in our environment where something could go wrong with a
long group line. It's turned out that there were a number of these
places:
- Linux could fail to properly recognize group membership for people
in long groups. I rated this as unlikely, since the glibc people
are good at avoiding small limits and relatively long group lines
are an obvious thing to think about.
- OmniOS on our fileservers could
fail to recognize group membership. Probably unlikely too; the days
when people put 512-byte buffers or the like into
getgrent()
and friends are likely to be long over by now.(Hopefully those days were long over by, say, 2000.)
- Our Samba server might do something special with group handling and
so fail to properly deal with a long group, causing it to think that
someone wasn't a member or deny them access to group-protected file.
- The tools we use to build an Apache format group file
from our
/etc/group
could blow up on long lines. I thought that this was unlikely too;awk
andsed
and so on generally don't have line length limitations these days.(They did in the past, in one form or another, which is probably part of why we had this
/etc/group
line length check in the first place.) - Apache's own group authorization checking could fail on long
lines, either completely or just for logins at the end of the
line.
- Even if they handled regular group membership fine, perhaps our OmniOS fileservers would have a problem with NFS permission checks if you were in more than 16 groups and one of your extra groups was a long group, because this case causes the NFS server to do some additional group handling. I thought this was unlikely, since the code should be using standard OmniOS C library routines and I would have verified that those worked already, but given how important NFS permissions are for our users I felt I had to be sure.
(I was already confident that our local tools that dealt with
/etc/group
would have no problems; for the most part they're
written in Python and so don't have any particular line length or
field count limitations.)
It's probably worth explicitly testing Linux tools like useradd
and groupadd
to make sure that they have no problems manipulating
group membership in the presence of long /etc/group
lines. I can't
imagine them failing (just as I didn't expect the C library to have
any problems), but that just means it would be really embarrassing
if they turned out to have some issue and I hadn't checked.
All of this goes to show that getting rid of bits of the past can
be much more work and hassle than you'd like. And it's not particularly
interesting work, either; it's all dotting i's and crossing t's
just in case, testing things that you fully expect to just work
(and that have just worked so far). But we've got to do this sometime,
or we'll spend another decade with /etc/group
lines limited to
512 bytes or less.
(System administration life is often not particularly exciting.)
2017-03-26
Your exposure from retaining Let's Encrypt account keys
In a comment on my entry on how I think you have lots of Let's Encrypt accounts, Aristotle Pagaltzis asked a good question:
Taking this logic to its logical conclusion: as long as you can arrange to prove your control of a domain under some ACME challenge at any time, should you not immediately delete an account after obtaining a certificate through it?
(Granted – in practice, there is the small matter that deleting accounts appears unimplemented, as per your other entry…)
Let's take the last bit first: for security purposes, it's sufficient to destroy your account's private key. This leaves dangling registration data on Let's Encrypt's servers, but that's not your problem; with your private key destroyed, no one can use your authorized account to get any further certificates.
(If they can, either you or the entire world of cryptography have much bigger problems.)
For the broader issue: yes, in theory it's somewhat more secure to immediately destroy your private key the moment you have successfully obtained a certificate. However, there is a limit to how much security you get this way because someone with unrestricted access to your machine can get their own authorization for it with an account of their own. If I have root access to your machine and you normally run a Let's Encryption authorization process from it, I can just use my own client to do that same and get my own authorized account. I can then take the private key off the machine and later use it to get my own certificates for your machine.
(I can also reuse an account I already have and merely pass the authorization check, but in practice I might as well get a new account to go with it.)
The real exposure for existing authorized accounts is when it's easier to get at the account's private key than it is to get unrestricted access to the machine itself. If you keep the key on the machine and only accessible to root, well, I won't say you have no additional exposure at all, but in practice your exposure is probably fairly low; there are a lot of reasonably sensitive secrets that are protected this way and we don't consider it a problem (machine SSH host keys, for example). So in my opinion your real exposure starts going up when you transport the account key off the machine, for example to reuse the same account on multiple machines or over machine reinstalls.
As a compromise you might want to destroy account keys every so often, say once a year or every six months. This limits your long-term exposure to quietly compromised keys while not filling up Let's Encrypt's database with too many accounts.
As a corollary to this and the available Let's Encrypt challenge methods, someone who has compromised your DNS infrastructure can obtain their own Let's Encrypt authorizations (for any account) for any arbitrary host in your domain. If they issue a certificate for it immediately you can detect this through certificate transparency monitoring, but if they sit on their authorization for a while I don't think you can tell. As far as I know, LE provides no way to report on accounts that are authorized for things in your domain (or any domain), so you can't monitor this in advance of certificates being issued.
For some organizations, compromising your DNS infrastructure is
about as difficult as getting general root access (this is roughly
the case for us). However, for people who use outside DNS providers,
such a compromise may only require gaining access to one of your
authorized API keys for their services. And if you have some system
that allows people to add arbitrary TXT
records to your DNS with
relatively little access control, congratulations, you now have a
pretty big exposure there.
2017-03-22
Setting the root
login's 'full name' to identify the machine that sent email
Yesterday I wrote about making sure you can identify what machine sent you a status email, and the comments Sotiris Tsimbonis shared a brilliant yet simple solution to this problem:
We change the gecos info for this purpose.
chfn -f "$HOSTNAME root" root
Take it from me; this is beautiful genius (so much so that both we
and another group here immediately adopted it). It's so simple yet
still extremely effective, because almost everything that sends
email does so using programs like mail
that will fill out the
From:
header using the login's GECOS full name from /etc/passwd
.
You get email that looks like:
From: root@<your-domain> (hebikera root)
This does exactly what we want by immediately showing the machine
that the email is from. In fact many mail clients these days will
show you only the 'real name' from the From:
header by default,
not the actual email address (I'm old-fashioned, so
I see the traditional full From:
header).
This likely works with any mail-sending program that doesn't require
completely filled out email headers. It definitely works in the
Postfix sendmail
cover program for 'sendmail -t
' (as well as
the CentOS 6 and 7 mailx, which supplies the standard mail
command).
(As an obvious corollary, you can also use this trick for any other machine-specific accounts that send email; just give them an appropriate GECOS 'full name' as well.)
There's two perhaps obvious cautions here. First, if you ever rename
machines you have to remember to re-chfn
the root
login and any
other such logins to have the correct hostname in them. It's probably
worth creating an officially documented procedure for renaming machines, since there are
other things you'll want to update as well (you might even script
it). Second, if you have some sort of password synchronization
system you need it to leave root
's GECOS
full name alone (although it can update root
's password). Fortunately
ours already does this.
Making sure you can identify what machine sent you a status email
I wrote before about making sure that system email works, so that machines can do important things like tell you that their RAID array has lost redundancy and you should do something about that. In a comment on that entry, -dsr- brought up an important point, which is you want to be able to easily tell which machine sent you email.
In an ideal world, everything on every machine that sends out email
reports would put the machine's hostname in, say, the Subject:
header. This would give you reports like:
Subject: SMART error (FailedOpenDevice) detected on host: urd
In the real world you also get helpful emails like this:
Subject: Health
Device: /dev/sdn [SAT], FAILED SMART self-check. BACK UP DATA NOW!
The only way for us to tell which machine this came from was to
look at either the Received:
headers or the Message-ID
, which
is annoying.
There are at least two ways to achieve this. The first approach is
what -dsr- said in the comment, which is to make every machine
send its email to a unique alias on your system. This unfortunately
has at least two limitations. The first is that it somewhat clashes
with a true 'null client' setup, where your machines dump absolutely
all of their email on the server. A straightforward null client
does no local rewriting of email at all, so to get this you need a
smarter local mailer (and then you may need per-machine setup,
hopefully automated). The second limitation is that there's no
guarantee that all of the machine's email will be sent to root
(and thus be subject to simple rewriting). It's at least likely,
but machines have been known to send status email to all sorts of
addresses.
(I'm going to assume that you can arrange for the unique destination
alias to be visible in the To:
header.)
You can somewhat get around this by doing some of the rewriting on your central mail handler machine (assuming that you can tell the machine email apart from regular user email, which you probably want to do anyways). This needs a relatively sophisticated configuration, but it probably can be done in something like Exim (which has quite powerful rewrite rules).
However, if you're going to do this sort of magic in your central
mail handler machine, you might as well do somewhat different magic
and alter the Subject:
header of such email to include the host
name. For instance, you might just add a general rule to your mailer
so that all email from root
that's going to root
will have its
Subject:
altered to add the sending machine's hostname, eg
'Subject: [$HOSTNAME] ....
'. Your central mail handler already
knows what machine it received the email from (the information went
into the Received
header, for example). You could be more selective,
for instance if you know that certain machines are problem sources
(like the CentOS 7 machine that generated my second example) while
others use software that already puts the hostname in (such as the
Ubuntu machine that generated my first example).
I'm actually more attracted to the second approach than the first one. Sure, it's a big hammer and a bit crude, but it creates the easy to see marker of the source machine that I want (and it's a change we only have to make to one central machine). I'd feel differently if we routinely got status emails from various machines that we just filed away (in which case the alias-based approach would give us easy per-machine filing), but in practice our machines only email us occasionally and it's always going to be something that goes to our inboxes and probably needs to be dealt with.
2017-03-13
OpenSSH's IdentityFile
directive only ever adds identity files (as of 7.4)
In some complicated scenarios (especially with 2FA devices), even IdentitiesOnly
can potentially give
you too many identities between relatively generic Host ...
entries and host-specific ones. Since there is only so far it's
sensible to push Host ...
entries with negated hostnames before
you wind up with a terrible mess, there are situations where it
would be nice to be able to say something like:
Host *.ourdomain IdentityFile ... IdentityFile ... [...] Host something-picky.ourdomain IdentityFile NONE IdentityFile /u/cks/.ssh/identities/specific IdentitiesOnly yes [...]
Here, you want to offer a collection of identities from various sources to most hosts, but there are some hosts that both require very specific identities and will cut your connection off if you offer too many identities (as mentioned back here).
I have in the past said that 'as far as I knew' IdentityFile
directives were purely cumulative (eg in comments on this entry). This held out a small sliver of hope that
there was some way of doing this that I either couldn't see in the
manpages or that just wasn't documented. As it happens, I recently
decided to look at the OpenSSH source code for 7.4 (the latest
officially released version) to put this to rest once and for all,
and the bad news is that I have to stop qualifying my words. As far
as I can tell from the source code, there is absolutely no way of
wiping out existing IdentityFile
directives that have been added
by various matching Host
stanzas.
There's an array of identities (up to the maximum 100 that's allowed),
and the code only ever adds identities to it. Nothing removes entries
or resets the number of valid entries in the array.
Oh well. It would have been nice, and maybe someday the OpenSSH people will add some sort of feature for this.
In the process of reading bits of the OpenSSH code, I ran across an interesting comment in sshconnect2.c's pubkey_prepare():
/* * try keys in the following order: * 1. certificates listed in the config file * 2. other input certificates * 3. agent keys that are found in the config file * 4. other agent keys * 5. keys that are only listed in the config file */
(IdentitiesOnly
does not appear to affect this order, it merely
causes some keys to be excluded.)
To add to an earlier entry of mine, keys
supplied with -i
fall into the 'in the config file' case, because
what that actually means is 'keys from -i
, from the user's
configuration file, and from the system configuration file, in that
order'. They all get added to the list of keys with the same function,
add_identity_file(), but -i
is processed first.
(This means that my earlier writeup of the SSH identity offering order is a bit incomplete, but at this point I'm sufficiently tired of wrestling with this particular undocumented SSH mess that I'm not going to carefully do a whole bunch of tests to verify what the code comment says here. Having skimmed the code, I believe the comment.)
2017-03-05
Why I (as a sysadmin) reflexively dislike various remote file access tools for editors
I somewhat recently ran across this irreal.org entry (because it refers to my entry on staying with Emacs for code editing), and in a side note it mentions this:
[Chris] does a lot of sysadmin work and prefers Vim for that (although I think Tramp would go a long way towards meeting the needs that he thinks Vim resolves).
This is partly in reference to my entry on Why vi
has become
my sysadmin's editor, and at the end of that
entry I dismissed remote file access things like Tramp. I think
it's worth spending a little bit of time talking about why I
reflexively don't like them, at least with my sysadmin hat on.
There are several reasons for this. Let's start with the mechanics
of remote access to files. If you're a sysadmin, it's quite common
for you to be editing files that require root permissions to write
to, and sometimes to even read. This presents two issues for a Tramp
like system. The first is that either you arrange passwordless
access to root-privileged files or that at some point during this
process you provide your root-access password. The first is very
alarming and the second requires a great deal of trust in Tramp or
other code to securely handle the situation.
The additional issue for root access is that best practices today
is to not log or scp
in directly as root but instead to log in
as yourself and then use su
or sudo
to gain root access. Perhaps
you can make the remote file access system of choice do this, but
it's extremely unlikely to be simple because this is not at all a
common usage case for them. Almost all of these systems are built
by developers to allow them to access their own files remotely;
indirect access to privileged contexts is a side feature at best.
(Let's set aside issues of, say, two-factor authentication.)
All of this means that I would have to build access to an extremely sensitive context on uncertain foundations that require me to have a great deal of trust in both the system's security (when it probably wasn't built with high security worries in the first place) and that it will always do exactly and only the right thing, because once I give it root permissions one slip or accident could be extremely destructive.
But wait, there's more. Merely writing sensitive files is a dangerous and somewhat complicated process, one where it's actually not clear what you should always do and some of the ordinary rules don't always apply. For instance, in a sysadmin context if a file has hardlinks, you generally want to overwrite it in place so that those hardlinks stay. And you absolutely have to get the permissions and even ownership correct (yes, sysadmins may use root permissions to write to files that are owned by someone other than root, and that ownership had better stay). Again, it's possible for a remote file access system to get this right (or be capable of being set up that way), but it's probably not something that the system's developers have had as a high priority because it's not a common usage case. And I have to trust that this is all going to work, all the time.
Finally, often editing a file is only part of what I'm doing as
root. I hopefully want to commit that file to version control and also perhaps (re)start daemons
or run additional commands to make my change take effect and do
something. Perhaps a remote file editing system even has support
for this, even running as a privileged user through some additional
access path, but frankly this is starting to strain my ability to
trust this system to get everything right (and actually do this
well). Of course I don't have to use the remote access system for
this, since I can just get root privileges directly and do all of
this by hand, but if I'm going to be setting up a root session
anyways to do additional work, why not go the small extra step to
run vi
in it? That way I know exactly what I'm getting and I don't
have to extend a great deal of trust that a great deal of magic
will do the right thing and not blow up in my face.
(And if the magic blows up, it's not just my face that's affected.)
Ultimately, directly editing files with vim
as root (or the
appropriate user) on the target system is straightforward, simple,
and basically guaranteed to work. It has very few moving parts and
they are mostly simple ones that are amenable to inspection and
understanding. All of this is something that sysadmins generally
value quite a bit, because we have enough complexity in our jobs
as it is.
2017-03-04
Should you add MX entries for hosts in your (public) DNS?
For a long time, whenever we added a new server to our public DNS,
we also added an MX entry for it (directing inbound email to our
general external MX gateway). This was essentially historical habit,
and I believe it came about because a very long time ago there were
far too many programs that would send out email with From:
and
even To:
addresses of '<user>@<host>.<our domain>'. Adding MX
entries made all of that email work.
In the past few years, I have been advocating (mostly successfully) for moving away from this model of automatically adding MX entries, because I've come to believe that in practice it leads to problems over the long term. The basic problem is this: once an email address leaks into public, you're often stuck supporting it for a significant amount of time. Once those <host>.<our-dom> DNS names start showing up in email, they start getting saved in people's address books, wind up in mailing list subscriptions, and so on and so forth. Once that happens, you've created a usage of the name that may vastly outlast the actual machine itself; this means that over time, you may well have to accumulate a whole collection of lingering MX entries for now-obsolete machines that no longer exist.
These days, it's not hard to configure both machines and mail programs to use the canonical '<user>@<our-dom>' addresses that you want and to not generate those problematic '<user>@<host>.<our-dom>' addresses. If programs (or people) do generate such addresses anyways, your next step is to fix them up in your outgoing mail gateway, forcefully rewriting them to your canonical form. Once you get all of this going you no longer need to support the <host>.<our-dom> form of addresses at all in order to make people's email work, and so in my view you're better off arranging for such addresses to never work by omitting MX entries for them (and refusing them on your external mail gateway). That way if people do mistakenly (or deliberately) generate such addresses and let them escape to the outside world, they break immediately and people can immediately fix things. You aren't stuck with addresses that will work for a while and then either impose long-term burdens on you or break someday.
The obvious exception here is cases where you actually do want a hostname to work in email and be accepted in general; perhaps you want, say, '<user>@support.<our-dom>' to work, or '<user>@www.<our-dom>', or the like. But then you can actively choose to add an MX entry (and any other special processing you may need), and you can always defer doing this until the actual need materializes instead of doing it in anticipation when you set up the machine's other DNS.
(If you want only some local usernames to be accepted for such hostnames, say only 'support@support.<our-dom>', you'll obviously need to do more work in your email system. We haven't gone to this extent so far; all local addresses are accepted for all hostname and domain name variants that we accept as 'us'.)