The work of safely raising our local
/etc/group line length limit
My department has now been running our Unix computing environment
for a very long time (which has some interesting consequences). When you run a Unix environment over the
long term, old historical practices slowly build up and get carried
forward from generation to generation of the overall system, because
you've probably never restarted everything from complete scratch.
All of this is an elaborate route to say that as part of our local
password propagation infrastructure, we
have a program that checks
/etc/group to make
sure they look good, and this program puts a 512 byte limit on the
size of lines in
/etc/group. If it finds a group line longer than
that, it complains and aborts and you get to fix it.
(Don't ask what our workaround is for groups with large memberships. I'll just say that it raises some philosophical and practical questions about what group membership means.)
We would like to remove this limit; it makes our life more complicated
in a number of ways, causes problems periodically, and we're pretty
sure that it's no longer needed and probably hasn't been needed for
years. So we should just take that bit of code out, or at least
change the '
> 512' to '
> 4096', right?
Not so fast, please. We're pretty sure that doing so is harmless,
but we're not certain. And we would like to not blow up some part
of our local environment by mistake if it turns out that actually
there is still something around here that has heartburn on long
/etc/group lines. So in order to remove the limit we need to
test to make sure everything still works, and one of the things that
this has meant is sitting down and trying to think of all of the
places in our environment where something could go wrong with a
long group line. It's turned out that there were a number of these
- Linux could fail to properly recognize group membership for people
in long groups. I rated this as unlikely, since the glibc people
are good at avoiding small limits and relatively long group lines
are an obvious thing to think about.
- OmniOS on our fileservers could
fail to recognize group membership. Probably unlikely too; the days
when people put 512-byte buffers or the like into
getgrent()and friends are likely to be long over by now.
(Hopefully those days were long over by, say, 2000.)
- Our Samba server might do something special with group handling and
so fail to properly deal with a long group, causing it to think that
someone wasn't a member or deny them access to group-protected file.
- The tools we use to build an Apache format group file
/etc/groupcould blow up on long lines. I thought that this was unlikely too;
sedand so on generally don't have line length limitations these days.
(They did in the past, in one form or another, which is probably part of why we had this
/etc/groupline length check in the first place.)
- Apache's own group authorization checking could fail on long
lines, either completely or just for logins at the end of the
- Even if they handled regular group membership fine, perhaps our OmniOS fileservers would have a problem with NFS permission checks if you were in more than 16 groups and one of your extra groups was a long group, because this case causes the NFS server to do some additional group handling. I thought this was unlikely, since the code should be using standard OmniOS C library routines and I would have verified that those worked already, but given how important NFS permissions are for our users I felt I had to be sure.
(I was already confident that our local tools that dealt with
/etc/group would have no problems; for the most part they're
written in Python and so don't have any particular line length or
field count limitations.)
It's probably worth explicitly testing Linux tools like
groupadd to make sure that they have no problems manipulating
group membership in the presence of long
/etc/group lines. I can't
imagine them failing (just as I didn't expect the C library to have
any problems), but that just means it would be really embarrassing
if they turned out to have some issue and I hadn't checked.
All of this goes to show that getting rid of bits of the past can
be much more work and hassle than you'd like. And it's not particularly
interesting work, either; it's all dotting i's and crossing t's
just in case, testing things that you fully expect to just work
(and that have just worked so far). But we've got to do this sometime,
or we'll spend another decade with
/etc/group lines limited to
512 bytes or less.
(System administration life is often not particularly exciting.)
Your exposure from retaining Let's Encrypt account keys
In a comment on my entry on how I think you have lots of Let's Encrypt accounts, Aristotle Pagaltzis asked a good question:
Taking this logic to its logical conclusion: as long as you can arrange to prove your control of a domain under some ACME challenge at any time, should you not immediately delete an account after obtaining a certificate through it?
(Granted – in practice, there is the small matter that deleting accounts appears unimplemented, as per your other entry…)
Let's take the last bit first: for security purposes, it's sufficient to destroy your account's private key. This leaves dangling registration data on Let's Encrypt's servers, but that's not your problem; with your private key destroyed, no one can use your authorized account to get any further certificates.
(If they can, either you or the entire world of cryptography have much bigger problems.)
For the broader issue: yes, in theory it's somewhat more secure to immediately destroy your private key the moment you have successfully obtained a certificate. However, there is a limit to how much security you get this way because someone with unrestricted access to your machine can get their own authorization for it with an account of their own. If I have root access to your machine and you normally run a Let's Encryption authorization process from it, I can just use my own client to do that same and get my own authorized account. I can then take the private key off the machine and later use it to get my own certificates for your machine.
(I can also reuse an account I already have and merely pass the authorization check, but in practice I might as well get a new account to go with it.)
The real exposure for existing authorized accounts is when it's easier to get at the account's private key than it is to get unrestricted access to the machine itself. If you keep the key on the machine and only accessible to root, well, I won't say you have no additional exposure at all, but in practice your exposure is probably fairly low; there are a lot of reasonably sensitive secrets that are protected this way and we don't consider it a problem (machine SSH host keys, for example). So in my opinion your real exposure starts going up when you transport the account key off the machine, for example to reuse the same account on multiple machines or over machine reinstalls.
As a compromise you might want to destroy account keys every so often, say once a year or every six months. This limits your long-term exposure to quietly compromised keys while not filling up Let's Encrypt's database with too many accounts.
As a corollary to this and the available Let's Encrypt challenge methods, someone who has compromised your DNS infrastructure can obtain their own Let's Encrypt authorizations (for any account) for any arbitrary host in your domain. If they issue a certificate for it immediately you can detect this through certificate transparency monitoring, but if they sit on their authorization for a while I don't think you can tell. As far as I know, LE provides no way to report on accounts that are authorized for things in your domain (or any domain), so you can't monitor this in advance of certificates being issued.
For some organizations, compromising your DNS infrastructure is
about as difficult as getting general root access (this is roughly
the case for us). However, for people who use outside DNS providers,
such a compromise may only require gaining access to one of your
authorized API keys for their services. And if you have some system
that allows people to add arbitrary
TXT records to your DNS with
relatively little access control, congratulations, you now have a
pretty big exposure there.
root login's 'full name' to identify the machine that sent email
Yesterday I wrote about making sure you can identify what machine sent you a status email, and the comments Sotiris Tsimbonis shared a brilliant yet simple solution to this problem:
We change the gecos info for this purpose.
chfn -f "$HOSTNAME root" root
Take it from me; this is beautiful genius (so much so that both we
and another group here immediately adopted it). It's so simple yet
still extremely effective, because almost everything that sends
email does so using programs like
From: header using the login's GECOS full name from
You get email that looks like:
From: root@<your-domain> (hebikera root)
This does exactly what we want by immediately showing the machine
that the email is from. In fact many mail clients these days will
show you only the 'real name' from the
From: header by default,
not the actual email address (I'm old-fashioned, so
I see the traditional full
This likely works with any mail-sending program that doesn't require
completely filled out email headers. It definitely works in the
sendmail cover program for '
sendmail -t' (as well as
the CentOS 6 and 7 mailx, which supplies the standard
(As an obvious corollary, you can also use this trick for any other machine-specific accounts that send email; just give them an appropriate GECOS 'full name' as well.)
There's two perhaps obvious cautions here. First, if you ever rename
machines you have to remember to re-
root login and any
other such logins to have the correct hostname in them. It's probably
worth creating an officially documented procedure for renaming machines, since there are
other things you'll want to update as well (you might even script
it). Second, if you have some sort of password synchronization
system you need it to leave
full name alone (although it can update
root's password). Fortunately
ours already does this.
Making sure you can identify what machine sent you a status email
I wrote before about making sure that system email works, so that machines can do important things like tell you that their RAID array has lost redundancy and you should do something about that. In a comment on that entry, -dsr- brought up an important point, which is you want to be able to easily tell which machine sent you email.
In an ideal world, everything on every machine that sends out email
reports would put the machine's hostname in, say, the
header. This would give you reports like:
Subject: SMART error (FailedOpenDevice) detected on host: urd
In the real world you also get helpful emails like this:
Device: /dev/sdn [SAT], FAILED SMART self-check. BACK UP DATA NOW!
The only way for us to tell which machine this came from was to
look at either the
Received: headers or the
There are at least two ways to achieve this. The first approach is
what -dsr- said in the comment, which is to make every machine
send its email to a unique alias on your system. This unfortunately
has at least two limitations. The first is that it somewhat clashes
with a true 'null client' setup, where your machines dump absolutely
all of their email on the server. A straightforward null client
does no local rewriting of email at all, so to get this you need a
smarter local mailer (and then you may need per-machine setup,
hopefully automated). The second limitation is that there's no
guarantee that all of the machine's email will be sent to
(and thus be subject to simple rewriting). It's at least likely,
but machines have been known to send status email to all sorts of
(I'm going to assume that you can arrange for the unique destination
alias to be visible in the
You can somewhat get around this by doing some of the rewriting on your central mail handler machine (assuming that you can tell the machine email apart from regular user email, which you probably want to do anyways). This needs a relatively sophisticated configuration, but it probably can be done in something like Exim (which has quite powerful rewrite rules).
However, if you're going to do this sort of magic in your central
mail handler machine, you might as well do somewhat different magic
and alter the
Subject: header of such email to include the host
name. For instance, you might just add a general rule to your mailer
so that all email from
root that's going to
root will have its
Subject: altered to add the sending machine's hostname, eg
Subject: [$HOSTNAME] ....'. Your central mail handler already
knows what machine it received the email from (the information went
Received header, for example). You could be more selective,
for instance if you know that certain machines are problem sources
(like the CentOS 7 machine that generated my second example) while
others use software that already puts the hostname in (such as the
Ubuntu machine that generated my first example).
I'm actually more attracted to the second approach than the first one. Sure, it's a big hammer and a bit crude, but it creates the easy to see marker of the source machine that I want (and it's a change we only have to make to one central machine). I'd feel differently if we routinely got status emails from various machines that we just filed away (in which case the alias-based approach would give us easy per-machine filing), but in practice our machines only email us occasionally and it's always going to be something that goes to our inboxes and probably needs to be dealt with.
IdentityFile directive only ever adds identity files (as of 7.4)
In some complicated scenarios (especially with 2FA devices), even
IdentitiesOnly can potentially give
you too many identities between relatively generic
entries and host-specific ones. Since there is only so far it's
sensible to push
Host ... entries with negated hostnames before
you wind up with a terrible mess, there are situations where it
would be nice to be able to say something like:
Host *.ourdomain IdentityFile ... IdentityFile ... [...] Host something-picky.ourdomain IdentityFile NONE IdentityFile /u/cks/.ssh/identities/specific IdentitiesOnly yes [...]
Here, you want to offer a collection of identities from various sources to most hosts, but there are some hosts that both require very specific identities and will cut your connection off if you offer too many identities (as mentioned back here).
I have in the past said that 'as far as I knew'
directives were purely cumulative (eg in comments on this entry). This held out a small sliver of hope that
there was some way of doing this that I either couldn't see in the
manpages or that just wasn't documented. As it happens, I recently
decided to look at the OpenSSH source code for 7.4 (the latest
officially released version) to put this to rest once and for all,
and the bad news is that I have to stop qualifying my words. As far
as I can tell from the source code, there is absolutely no way of
wiping out existing
IdentityFile directives that have been added
by various matching
There's an array of identities (up to the maximum 100 that's allowed),
and the code only ever adds identities to it. Nothing removes entries
or resets the number of valid entries in the array.
Oh well. It would have been nice, and maybe someday the OpenSSH people will add some sort of feature for this.
In the process of reading bits of the OpenSSH code, I ran across an interesting comment in sshconnect2.c's pubkey_prepare():
/* * try keys in the following order: * 1. certificates listed in the config file * 2. other input certificates * 3. agent keys that are found in the config file * 4. other agent keys * 5. keys that are only listed in the config file */
IdentitiesOnly does not appear to affect this order, it merely
causes some keys to be excluded.)
To add to an earlier entry of mine, keys
-i fall into the 'in the config file' case, because
what that actually means is 'keys from
-i, from the user's
configuration file, and from the system configuration file, in that
order'. They all get added to the list of keys with the same function,
-i is processed first.
(This means that my earlier writeup of the SSH identity offering order is a bit incomplete, but at this point I'm sufficiently tired of wrestling with this particular undocumented SSH mess that I'm not going to carefully do a whole bunch of tests to verify what the code comment says here. Having skimmed the code, I believe the comment.)
Why I (as a sysadmin) reflexively dislike various remote file access tools for editors
[Chris] does a lot of sysadmin work and prefers Vim for that (although I think Tramp would go a long way towards meeting the needs that he thinks Vim resolves).
This is partly in reference to my entry on Why
vi has become
my sysadmin's editor, and at the end of that
entry I dismissed remote file access things like Tramp. I think
it's worth spending a little bit of time talking about why I
reflexively don't like them, at least with my sysadmin hat on.
There are several reasons for this. Let's start with the mechanics
of remote access to files. If you're a sysadmin, it's quite common
for you to be editing files that require root permissions to write
to, and sometimes to even read. This presents two issues for a Tramp
like system. The first is that either you arrange passwordless
access to root-privileged files or that at some point during this
process you provide your root-access password. The first is very
alarming and the second requires a great deal of trust in Tramp or
other code to securely handle the situation.
The additional issue for root access is that best practices today
is to not log or
scp in directly as root but instead to log in
as yourself and then use
sudo to gain root access. Perhaps
you can make the remote file access system of choice do this, but
it's extremely unlikely to be simple because this is not at all a
common usage case for them. Almost all of these systems are built
by developers to allow them to access their own files remotely;
indirect access to privileged contexts is a side feature at best.
(Let's set aside issues of, say, two-factor authentication.)
All of this means that I would have to build access to an extremely sensitive context on uncertain foundations that require me to have a great deal of trust in both the system's security (when it probably wasn't built with high security worries in the first place) and that it will always do exactly and only the right thing, because once I give it root permissions one slip or accident could be extremely destructive.
But wait, there's more. Merely writing sensitive files is a dangerous and somewhat complicated process, one where it's actually not clear what you should always do and some of the ordinary rules don't always apply. For instance, in a sysadmin context if a file has hardlinks, you generally want to overwrite it in place so that those hardlinks stay. And you absolutely have to get the permissions and even ownership correct (yes, sysadmins may use root permissions to write to files that are owned by someone other than root, and that ownership had better stay). Again, it's possible for a remote file access system to get this right (or be capable of being set up that way), but it's probably not something that the system's developers have had as a high priority because it's not a common usage case. And I have to trust that this is all going to work, all the time.
Finally, often editing a file is only part of what I'm doing as
root. I hopefully want to commit that file to version control and also perhaps (re)start daemons
or run additional commands to make my change take effect and do
something. Perhaps a remote file editing system even has support
for this, even running as a privileged user through some additional
access path, but frankly this is starting to strain my ability to
trust this system to get everything right (and actually do this
well). Of course I don't have to use the remote access system for
this, since I can just get root privileges directly and do all of
this by hand, but if I'm going to be setting up a root session
anyways to do additional work, why not go the small extra step to
vi in it? That way I know exactly what I'm getting and I don't
have to extend a great deal of trust that a great deal of magic
will do the right thing and not blow up in my face.
(And if the magic blows up, it's not just my face that's affected.)
Ultimately, directly editing files with
vim as root (or the
appropriate user) on the target system is straightforward, simple,
and basically guaranteed to work. It has very few moving parts and
they are mostly simple ones that are amenable to inspection and
understanding. All of this is something that sysadmins generally
value quite a bit, because we have enough complexity in our jobs
as it is.
Should you add MX entries for hosts in your (public) DNS?
For a long time, whenever we added a new server to our public DNS,
we also added an MX entry for it (directing inbound email to our
general external MX gateway). This was essentially historical habit,
and I believe it came about because a very long time ago there were
far too many programs that would send out email with
To: addresses of '<user>@<host>.<our domain>'. Adding MX
entries made all of that email work.
In the past few years, I have been advocating (mostly successfully) for moving away from this model of automatically adding MX entries, because I've come to believe that in practice it leads to problems over the long term. The basic problem is this: once an email address leaks into public, you're often stuck supporting it for a significant amount of time. Once those <host>.<our-dom> DNS names start showing up in email, they start getting saved in people's address books, wind up in mailing list subscriptions, and so on and so forth. Once that happens, you've created a usage of the name that may vastly outlast the actual machine itself; this means that over time, you may well have to accumulate a whole collection of lingering MX entries for now-obsolete machines that no longer exist.
These days, it's not hard to configure both machines and mail programs to use the canonical '<user>@<our-dom>' addresses that you want and to not generate those problematic '<user>@<host>.<our-dom>' addresses. If programs (or people) do generate such addresses anyways, your next step is to fix them up in your outgoing mail gateway, forcefully rewriting them to your canonical form. Once you get all of this going you no longer need to support the <host>.<our-dom> form of addresses at all in order to make people's email work, and so in my view you're better off arranging for such addresses to never work by omitting MX entries for them (and refusing them on your external mail gateway). That way if people do mistakenly (or deliberately) generate such addresses and let them escape to the outside world, they break immediately and people can immediately fix things. You aren't stuck with addresses that will work for a while and then either impose long-term burdens on you or break someday.
The obvious exception here is cases where you actually do want a hostname to work in email and be accepted in general; perhaps you want, say, '<user>@support.<our-dom>' to work, or '<user>@www.<our-dom>', or the like. But then you can actively choose to add an MX entry (and any other special processing you may need), and you can always defer doing this until the actual need materializes instead of doing it in anticipation when you set up the machine's other DNS.
(If you want only some local usernames to be accepted for such hostnames, say only 'support@support.<our-dom>', you'll obviously need to do more work in your email system. We haven't gone to this extent so far; all local addresses are accepted for all hostname and domain name variants that we accept as 'us'.)
Using Certificate Transparency to monitor your organization's TLS activity
One of the obvious things that you can do with Certificate Transparency is to monitor the CT logs for bad people somehow getting a certificate for one of your websites. If you're Paypal or Google or the University of Toronto, and you see a CT log entry for a 'www.utoronto.ca' certificate that isn't yours, you can ring lots of alarms. You can do this with actual infrastructure (perhaps based on the actual logs, and see also, and also), or you can do this on a manual and ad-hoc basis through one of the websites that let you query the CT logs, such as Google's or crt.sh, or maybe Facebook's (which of course requires a Facebook login because Facebook, that's why).
But there's another use for it, and that is looking for people in your own organization who are getting properly issued certificates. Perhaps I'm biased by working in a university, but around here there's no central point that really controls TLS certificates; if you can persuade a TLS certificate provider to give you a certificate, people will. And these days, the existence of Let's Encrypt means that if you have control over your own hosts, you can probably get certificates for them. If you are in such an organization, monitoring Certificate Transparency logs is one way to keep track of who is doing roughly what with TLS, perhaps discover interesting services you want to know about, and so on.
(Perhaps you are saying 'we control who gets to run TLS services because we control the perimeter firewall'. Do you control DNS too, so that people can't point off to things they're hosting in AWS? You probably don't want to go that far, by the way, because the alternative is for people to buy their own domain names too and then they won't even show up in your CT monitoring.)
You don't even have to be at the top of an organization to find this interesting, because sometimes there are subgroups all the way down. Some of our graduate students run machines that can be reached from the outside world, and I'm sure that sooner or later some of them will want a TLS certificate and discover Let's Encrypt. It's reassuring to know that when this happens we have at least some chance of finding out about it.
(Not an entirely great chance, because sometimes professors set up new domain names for graduate student projects and don't tell us about them.)
PS: Of course, as a bystander in your (overall) organization you can also use CT logs to satisfy your curiosity about things like how common Let's Encrypt certificates are, and how broadly spread they are across your organization. Is your group one of the few areas actively experimenting with them, or are a whole lot of people using them all over the place?
(All of this is probably pretty obvious, but I feel like writing it down.)
Sometimes it can be hard to tell one cause of failure from another
I mentioned recently how a firmware update fixed a 3ware controller so that it worked. As it happens, my experiences with this machine nicely illustrates the idea that sometimes it can be hard to tell one failure from another, or to put it another way, when you have a failure it can be hard to tell what the actual cause is. So let me tell the story of trying to install this machine.
Like many places within universities, we don't have a lot of money, but we do have a large collection of old, used hardware. Rather than throw eg five year old hardware away because it's beyond its nominal service life, we instead keep around anything that's not actively broken (or at least that doesn't seem broken) and press it into use again in sufficiently low-priority situations. One of the things that we have as a result of this is an assorted collection of various sizes of SATA HDs. We've switched over to SSDs for most servers, but we don't really have enough money to use SSDs for everything, especially when we're reconditioning an inherited machine under unusual circumstances.
Or in other words, we have a big box of 250 GB Seagate SATA HDs that have been previously used somewhere (probably as SunFire X2x00 system disks), all of which had passed basic tests when they were put into the box some time ago. When I wanted a pair of system disks for this machine I turned to that box. Things did not go well from there.
One of the disks from the first pair had really slow IO problems, which of course manifested as a far too slow Ubuntu 16.04 install. After replacing the slow drive, the second install attempt ended with the original 'good' drive dropping off the controller entirely, apparently dead. The replacement for that drive turned out to also be excessively slow, which took me up to four 250 GB SATA drives, of which one might be good (and three slow failed attempts to bring up one of our Ubuntu 16.04 installs). At that point I gave up and used some SSDs that we had relatively strong confidence in, because I wasn't sure if our 250 GB SATA drives were terrible or if the machine was eating disks. The SSDs worked.
Before we did the 3ware firmware upgrade and it made other things work great, I would have confidently told you that our 250 GB SATA disks had started rotting and could no longer be trusted. Now, well, I'm not so sure. I'm perfectly willing to believe bad things about those old drives, but were my problems because of the drives, the 3ware controller's issues, or some combination of both? My guess now is on a combination of both, but I don't really know and that shows the problem nicely.
(It's not really worth finding out, either, since testing disks for slow performance is kind of a pain and we've already spent enough time on this issue. I did try the 'dead' disk in a USB disk docking station and it worked in light testing.)
Sometimes, firmware updates can be a good thing to do
There are probably places that routinely apply firmware updates to every piece of hardware they have. Oh, sure, with a delay and in stages (rushing into new firmware is foolish), but it's always in the schedule. We are not such a place. We have a long history of trying to do as few firmware updates as possible, for the usual reason; usually we don't even consider it unless we can identify a specific issue we're having that new firmware (theoretically) fixes. And if we're having hardware problems, 'update the firmware in the hope that it will fix things' is usually last on our list of troubleshooting steps; we tacitly consider it down around the level of 'maybe rebooting will fix things'.
I mentioned the other day that we've inherited a 16-drive machine with a 3ware controller care. As far as we know, this machine worked fine for the previous owners in a hardware (controller) RAID-6 configuration across all the drives, but we've had real problems getting it stable for us in a JBOD configuration (we much prefer to use software RAID; among other things, we already know how to monitor and manage that with Ubuntu tools). We had system lockups, problems installing Ubuntu, and under load such as trying to scan a 14-disk RAID-6 array, the system would periodically report errors such as:
sd 2:0:0:0: WARNING: (0x06:0x002C): Command (0x2a) timed out, resetting card.
(This isn't even for a disk in the RAID-6 array; sd 2:0:0:0 is one of the mirrored system disks.)
Some Internet searches turned up people saying 'upgrade the firmware'. That felt like a stab in the dark to me, especially if the system had been working okay for the previous owners, but I was getting annoyed with the hardware and the latest firmware release notes did talk about some other things we might want (like support for disks over 2 TB). So I figured out how to do a firmware update and applied the 'latest' firmware (which for our controller dates from 2012).
(Unsurprisingly the controller's original firmware was significantly out of date.)
I can't say that the firmware update has definitely fixed our problems with the controller, but the omens are good so far. I've been hammering on the system for more than 12 hours without a single problem report or hiccup, which is far better than it ever managed before, and some things that had been problems before seem to work fine now.
All of this goes to show that sometimes my reflexive caution about firmware updates is misplaced. I don't think I'm ready to apply all available firmware updates before something goes into production, not even long-standing ones, but I'm certainly now more ready to consider them than I was before (in cases where there's no clear reason to do so). Perhaps I should be willing to consider firmware updates as a reasonably early troubleshooting step if I'm dealing with otherwise mysterious failures.