2011-07-29
Deciding the meaning of 'disabling' an account (and the value of procedures)
One of the things that the challenge of disabling an account has made clear to me is that it's far from clear just what disabling an account means to different people. I think we can all agree that the user of a disabled account shouldn't be able to use any of your authenticated services any more, but as alluded to in my first entry there are a number of other things where it's not clear what should happen.
For example: should the disabled account still get email? Should their files remain visible and accessible to other users? Should their personal home page remain visible? Should they still be visible in your user directory, and if so should they be marked in some way? If they're the responsible person for local mailing lists, what happens to those?
(I'm separating this from the issue of completely disabling someone's
access to your services even in the face of passwordless ssh access
and the like.)
There's no set right answer to these questions. A lot depends on your specific environment and what generally happens after you disable an account. For example, if disabling an account is often reversed (so you're actually suspending it temporarily), you'll likely want a different set of answers than if disabling an account is almost always a prelude to deleting it entirely. However, you do want to have answers for these questions when you're disabling an account, in part because different answers mean that you do different things on a technical level.
(For example, if you want a user's email to still work you probably can't rename their login, even though that's a great way of disabling any crontabs and at jobs that they have sitting around.)
This brings up the value of either automating the process of disabling accounts or documenting a procedure for it. Doing either of these is going to cause you to confront these questions and come up with answers for them (well, doing either of these thoroughly). Even if the answers are 'it depends, we have to decide on a case to case basis', at least you now have written down that you need to ask the questions, come up with an answer, and take certain steps based on what the answer is. What results is both awareness and consistency; you know that these issues exist, you've thought about what the right thing to do is, and you're probably going to do the same thing every time.
You can still get awareness of these issues without having a set procedure for disabling accounts, but now you're at the mercy of remembering them all on the fly. If you forget, your 'answers' to these questions now only happen as side effects of whatever else you're doing to disable the account, and they may or may not be what you actually want to happen. And if you disable accounts in different ways at different times, you can get different outcomes without intending it, eg you thought that locking the password was equivalent to expiring the account but it turns out that they have different side effects.
(This is of course yet another example of where checklists are a good idea.)
2011-07-28
A directory service doesn't make it easy to disable user accounts
A typical reaction on Reddit to my earlier entry on the complexity of disabling accounts is this:
Couldn't this be solved by moving to an LDAP based (or AD for windows environments) login?
Unfortunately, the answer is no; a directory service doesn't make disabling users much easier, not by itself. The problem is inherently complex.
Let's imagine that we have some directory service; it stores user information, including multiple passwords, and it has a 'disabled' flag. What has to pay attention to the 'disabled' flag in an environment like ours?
- a disabled account should fail all password validation; this handles
simple logins, Samba access, IMAP, and things like a VPN.
(The directory service may handle this for you, or you can explicitly invalidate the passwords as well when you set the disabled flag.)
sshdmust refuse to authenticate any disabled account, with or without passwords, even when it's not even running a program through the user's shell (as Dan Astoorian noted in a comment on the last entry).crondandatdmust ignore crontabs and at jobs for disabled users.- your MTA must refuse to run programs for disabled users; this handles
pipes in the user's
.forwardet al. - the web server must refuse to run CGIs for disabled users.
(Really it needs to refuse to run any code for disabled users, including in-server code in languages such as PHP. But let's assume that you don't allow users to do that; all of their code has to run under their UID in one way or another.)
(Because many of these daemons explicitly run things using /bin/sh,
simply changing the user's shell to an invalid shell won't achieve
this.)
- your Unix systems need something that kills the processes of (newly) disabled users; this will handle both current logins and background processes that the user's left lying around.
- your network (including VPNs and wireless authentication systems)
need something to terminate the sessions of (newly) disabled users.
- if you have authenticated web services that use cookie-based sessions,
they need to invalidate the sessions of now-disabled users.
(You can do this either preemptively or on the fly as you check the session during a HTTP request, but you have to do it.)
- your DHCP server (or its database) needs to track the user associated with a machine and ignore the machine if it was registered by a disabled user. (If you allow users to register machines you probably should allow such machines to be re-registered by another user, which opens up interesting issues.)
So here's the question: today, how many of your daemons and systems
actually support doing this as they come out of the box? The answer
is almost certainly 'none' or 'very few' (and the example of sshd
shows that support for this may be ad-hoc, inconsistent, and
incomplete). Where they do not have out of the box support for this, you
need to either add it or handle these cases by hand (eg checking for
user crontabs, disabling the user's .forward).
Disabling a user using a directory system is only simple if you do not have services that do things on behalf of the user and the user cannot have lingering activity on your systems. This is the case in some environments (such as Windows desktop environments), but it's often not the case in a Unix environment. Where you do have 'on behalf of' services, they have to know to not do things for disabled users; where you have lingering or ongoing activity, something has to know to terminate it. Today this is generally not an integrated feature in anything (at least on Unix; Windows may have better integration for this with AD).
Some but not all of these cases get easier if you can hide (or delete) the disabled user's entry in your directory service. But it isn't a complete solution and it has (probably) undesirably side effects, and the Limoncelli test specifically talked about disabling a user, not deleting them (and hiding a user's entry is much closer to deleting them than simply disabling them).
2011-07-27
Why not YP, er, NIS
A commentator on the last entry asked:
Any particular reason you don't like NIS and/or LDAP?
The answer for NIS is relatively easy. Shorn of various bits and pieces, NIS is just a file distribution mechanism. Well we have one of those, and ours is simpler, far more flexible, more powerful, and much more transparent and thus easier to understand and reason about. There is nothing particularly unique about our mechanism; these days there are a great many ways to distribute files around (and then do things on the remote end).
(Many of these ways are better than what we have.)
The only advantage NIS has in a modern environment is that things can update slightly faster. In exchange you have to live with a pile of complexity, fragility, and opaqueness. This tradeoff is almost never worth it.
NIS itself is a creation of an era when almost none of this was true.
Back in those days there were no good tools for file replication,
networks were drastically slower, central servers were so wimpy
that distributing files to a bunch of clients at once would do bad
things, and things like /etc/passwd and all of the other files were
sufficiently large (especially for decent sized sites) that you simply
did not want them sitting on every machine's disk chewing up space (and
making various lookups in the files take longer). But that era is long
gone, and NIS should have gone with it.
(I assume that NIS lives on because it is the canned solution for file 'replication' for various important system files.)
2011-07-26
Disabling an account can be kind of complex
Question 31 on Tom Limoncelli's sysadmin test is:
G.31: Can a user's account be disabled on all systems in 1 hour?
This may be more complex than you think. I'll use our local environment as an illustration.
We have a central password distribution system.
It'd take me about thirty seconds to edit the master /etc/shadow
in order to lock someone's password in it (and then a couple of
minutes for this to get propagated around). Some people would
confidently stop here, and for many situations and many users it's
actually good enough.
(I know that there's some Linux command that will lock the password for
me, but it would honestly take me longer to figure out what it was and
look up the option I need than it does to edit /etc/shadow directly.)
But if a user has some form of passwordless ssh set up, they can still log in even with a locked password. So I'd better remember to set the user's shell to an administrative shell that just prints a 'this account is suspended' message. (Do you have such a shell ready to go?)
But wait. Samba doesn't use the Unix password file; it has its own
password file and we have magic set up to update the Samba password
store at the same time as we update the normal Unix passwords. Simply
editing the Unix /etc/shadow does nothing at all to the Samba password
file, so a user with a locked password could still get access through
Samba (some of our users might not even notice that they couldn't log
in to our Unix machines). The simplest fix for this is to not lock the
user's password in /etc/shadow but to use passwd to scramble it to a
random value so that the Samba password store is updated too.
However, this isn't good enough, since there are a number of ways for
users to run commands without actually logging in. Leaving these intact
probably doesn't count as really disabling the account, especially if
you're concerned that the user might be malicious and so might have
left themselves back doors. In our environment, a user could have a
crontab file or at jobs on any number of machines, they could have a
pipe to a command in their .forward (or in a simple mailing list that they own), and they could even have
CGIs on our web server (or worse, an entire user-managed web server). They might also have running processes on
any number of our machines. To fix this I'd have to check every machine
and decommission or kill anything that I found.
After all of this, I've managed to thoroughly disable a user's account on our Unix systems. But users have additional access in our environment in several ways, and if we're being thorough I need to get rid of them as well. First off, users may have VPN accounts; these are completely separate from their regular Unix account (with a different password). A disabled user should probably lose VPN access, so I'd better remember to edit that password file too.
Next, users may have registered specific machines on our general access network for laptops and other user-run machines. We probably don't want a disabled user to be able to get any network access through our networks, so I'd have to look up all of their machines in the DHCP database and remove or disable the entries.
(Can you find all of the machines registered to a user and remove their access? What about your wireless environment; how much access does the user have by still knowing your SSID and WEP or WPA wireless key?)
I could probably do all of this in an hour (ignoring delays in how long some things take to propagate around). But I'd certainly want to think about it carefully to make sure that I wasn't missing something in the corner, and if I was doing this in a hurry and under pressure I'd probably have missed something.
(Reversibly suspending an account is harder, assuming that you need to
be this thorough. Deleting an account entirely instead of just disabling
it is probably easier, because you don't have to worry about a lot of
these things if you remove the user's /etc/passwd entry and delete
their files.)
It's also worth noting that the idea of 'disabling' an account is imprecise. For example, should a disabled account still be able to receive email, or should we start bouncing email to it (and if so, should we use a special error message)? Should a disabled account's personal web page still be visible? The answers may depend on just why an account is being disabled and in any case these are policy issues, not technical ones.
(However, they interact with the technical issues. If a disabled account
should continue to receive email, what do you do if the account's normal
mail handling involves a piped program in the user's .forward?)
2011-07-20
Why I would like my mailer to have a real programming language (part 2)
In illustrated form, to go with the previous explanation of this.
The actual configuration change that I just made, amounting to part of one line:
< require_files = $local_part:$home/.forward
> require_files = <; $local_part;$home/.forward;${if !IS_SPAM {!$home/.forward-nonspam}}
(additions and changes have been bolded.)
The amount of added comments necessary to explain this configuration: 17 lines, not counting an entire previous entry (with an antecedent for background). Part of this is to explain the logic and what is going on, and part of this is to explain a necessary workaround because of interactions due to how Exim has chosen to do various sorts of string expansions.
(There are three separate sorts of string interpretation going on in this one line. It's fun.)
Don't ask how long this small change took to develop and test, despite the logic being simple and easily expressed when written down in plain language.
Sidebar: the levels of string interpretation here
Because someone someday may ask, here are the three levels that Exim is doing:
- a purely textual, configuration file level macro substitution
that expands
IS_SPAMinto an Exim string expansion condition. - splitting
require_fileson the list separator boundaries, either:(original line) or;(changed line) - string expanding the
${if ...}clause.
The separator has to change because (wait for it) IS_SPAM expands to
something that has :'s in it. This fooled me during debugging for some
time, because the pre-macro-substitution version does not have any :'s
so it looks safe from step 2.
A decently designed programming language would be a lot cleaner here. Unfortunately, Exim is probably trying to avoid being a Lisp instead.
2011-07-08
My view on iSCSI performance troubleshooting
I've been asked about troubleshooting iSCSI performance issues (ie, slow IO over iSCSI) a few times now, so here's my views. Now I need a disclaimer: this is somewhat theoretical, as we haven't really had to troubleshoot iSCSI performance issues yet in our environment (and to the extent that we've had slow filesystem IO, we haven't nailed down a clear cause among all of the moving parts of our setup).
As always, the first thing you need to do is nail down just what IO is slow and when; the big questions are generally random IO vs sequential IO and reads vs writes (and also latency vs bandwidth). The easy situation is that the slow IO happens all of the time and you can reproduce it on demand. The difficult situation, well, that can get quite tricky; sometimes a large part of the challenge is trying to figure out just why your system is slow and what's going on when it is.
(Also, one should not forget that the filesystem may be doing things that look like IO performance problems.)
Past that, my general rule of iSCSI troubleshooting is to remember that iSCSI is a bunch of disks coupled to a network (specifically, to a TCP stream). This means that before looking at iSCSI itself, you want to look at each of the parts separately to see if they're working fine and delivering the performance you expect.
(Doing so is much easier if you have full access to the targets and can run general programs on them. Closed target appliances make this much more challenging.)
Modern servers and gigabit switches and so on should reliably deliver gigabit wire speeds in both directions at once with basically zero packet loss (and negligible latency, at least for LANs). If your initiator and target cannot do this over your iSCSI network fabric, you have a network performance problem that you need to fix. Note that you do not need jumbo frames to saturate a gigabit network (even with iSCSI, cf) and I think that you should turn them off to make life simpler. There are lots of programs to measure all sorts of aspects of network performance, but for streaming TCP I just use ttcp.
(If you are working with 10G networking you absolutely want to do your own networking performance measurements and performance tuning before you even start looking at the iSCSI layer. But you did all of that before you decided to spend all of that money on 10G networking hardware, right?)
Modern disk systems vary tremendously based on what technology you're using, so there is no real substitute for measuring your system. My rule of thumb is that a modern SATA drive will do 60 to 100 Mbytes/sec of streaming IO (read or write doesn't seem to make much of a difference) and somewhat over 100 seeks a second, but drives that have quietly gone bad can perform much worse, IO to multiple drives at once may slow this down, RAID implementations can be slower than you think, and so on. When checking this stuff I prefer to start out measuring things as close to the raw hardware as possible and then move up the target's software stack if I think there's a need.
Once you've verified that the disks and the network are both fine, it's time to move up to iSCSI. Unfortunately this is where I start waving my hands vaguely through lack of hard experience; I can only suggest obvious things like getting network traces to see how long various iSCSI operations are taking. Given what I've read about iSCSI and its tuning parameters, my current opinion is that iSCSI tuning parameters aren't likely to make a significant performance difference in normal circumstances unless they're catastrophically mis-set.
(In an ideal world your iSCSI initiator and target would both support iSCSI-specific performance statistics, either built in or through tracing hooks. I'm not holding my breath on that.)
At least some iSCSI target software can create dummy targets, ones
that have no physical disk behind them. Such targets can be useful for
testing the iSCSI protocol overhead introduced by the initiator and the
target, although you need a test environment for this. Sometimes you
can put real test data on the dummy target, and sometimes it's just the
iSCSI equivalent of /dev/null and /dev/zero combined together; the
former is obviously more useful.
(In theory the iSCSI target software could have some mismatch between it and the real disk drivers that introduces extra overhead only when you're talking to real disks. Testing this may need some sort of ultra-fast disk such as a SSD.)
On a side note, one thing that may be different between iSCSI and local disk IO (and between filesystems and raw iSCSI IO) is the presence of write barriers. If you see fast local writes and slow remote writes, this is one possible cause of the difference. Since there are quite a lot of moving parts involved in generating real write barriers over iSCSI, it's possible for software updates to suddenly cause them to be generated (or not be generated, but that's harder to notice).
2011-07-07
An interesting gotcha with Exim and .forward processing
Yesterday I described how Exim implements traditional .forward semantics where putting your own address in your .forward means 'deliver it to me, bypassing my .forward'. Because Exim is a mailer construction kit, this isn't a specific feature for .forward handling, it's a generic general feature that happens to give you this result.
So far, so good. Now, let's talk about our .forward-nonspam feature. In the abstract, this is just another .forward-style router that reads a different file and only triggers under some conditions. In concrete, we need several routers in sequence, each of them doing one step of the processing logic:
- if .forward-nonspam exists and the message is not spam, expand .forward-nonspam
- if the message is spam, .forward-nonspam exists, and .forward does not exist, discard the message
- if .forward exists, expand .forward
If you have both a .forward-nonspam and a .forward, the third rule will only be triggered for spam messages because your .forward-nonspam skims off non-spam messages first.
Well. Mostly. You see, although all three of these routers are conceptually a single block of .forward processing, Exim doesn't know this; as far as Exim is concerned, they are three separate and completely unrelated routers. Now suppose you put your own address into .forward-nonspam and also have a .forward, as you might do to create a simple 'put all non-spam email into my regular inbox and all spam mail into a file' system, and you get a non-spam message. Exim processes things until it reaches the first router, expands your .forward-nonspam, gets your address and restarts routing it, gets to the first router again, sees that the router has already handled this address, and only skips that router, not all three .forward-processing routers. So your address falls through to the third router, which says 'sure, you have a .forward, I'll handle this' and dumps the non-spam message into the file for spam email.
Oops.
The fix for this is to split the third router into two routers, one for the case where you do have a .forward-nonspam (where it would only handle messages that are explicitly spam-tagged) and a second one for the case where you have no .forward-nonspam (where it would handle everything). However, this requires an annoying level of repetition in the Exim configuration file.
(For technical reasons I think that you can't combine this together in a single condition on a single router that works quite exactly right.)
Sidebar: the technical reasons
The condition you need is 'if .forward exists and either
.forward-nonspam doesn't exist or the message is non-spam'. Exim has
special support for securely and correctly checking for file existence
over NFS, but this support is only available in the require_files
router condition. However, we need to use a condition check with a
'${if ...}' string expansion to check 'is non-spam'. You can't or
together separate router conditions (they are all implicitly and'd
together instead), and the does-file-exist check that's available in
a ${if expansion doesn't work the right way over NFS.
In theory you could get around this with various evil hacks involving Exim string expansion, maybe.
(Talking to myself: one could rephrase the condition as 'if .forward
exists and, if the message is non-spam, .forward-nonspam doesn't exist'
and then write this as a single require_files condition with a
conditional string expansion in it.)
How Exim makes traditional .forward semantics work
Traditional .forward semantics allow you to put your own address in your .forward; this means 'deliver to me, bypassing my .forward'. As a mailer construction kit, Exim doesn't have any specific support for handling .forwards; it has some generic features that you can build .forward handling out of. As a consequence of this, it doesn't have any specific handling for this odd bit of .forward semantics and instead supports it in a generic way. I've mentioned this before in an entry on the power of Exim routers but I just pointed to the official Exim documentation for details, and the official documentation is a little bit opaque.
Each message that Exim handles starts out with some number of top level addresses, each of which is routed separately. In the process of doing this, individual routers may replace the current address with one or more new addresses (through, for example, expanding a .forward). Exim then normally tries to recursively route these new addresses just as if they were top level addresses, although it keeps track of the fact that they are 'children' of some address.
(With aliases and simple mailing lists and .forwards that forward mail to people who also have .forwards, you can have a many level chain of descendant addresses that were created from a single top level address.)
When Exim is doing this recursive routing for a particular top level
address, it remembers which routers have already handled which
addresses. Then if the address currently being routed is the same as
one of its ancestor addresses and the ancestor address has already been
processed by a particular router, Exim skips that router, acting as
if the router was inapplicable to the address or wasn't there at all
(instead of having the router re-process an address that it has already
processed once); processing the address will fall through to the next
router (or routers). In a typical Exim configuration, what's next after
the router that handles .forwards is the router that sends people's mail
to /var/mail/<user>.
This skipping of routers has to happen separately for each top level
destination. If an email message is sent both directly to cks and to
sysadmins, an alias that cks is on, you don't want cks to have
one copy of the message handled by his .forward and another copy wind
up in /var/mail/cks. Also, this skipping of routers is is completely
separate from how Exim merges several copies of the same destination
together and does only a single delivery to each unique destination (so
that in this case cks's .forward will handle only one copy of the
message).
(In fact the check has to be separate for each chain of address expansion. We need to be sure that this skipping is only triggered for genuinely recursive addresses and routers.)
In theory this skipping of routers applies to any type of router. In practice only a few of Exim's various types of routers can replace addresses with new addresses and so can possibly trigger this; most of the routers simply give destinations for addresses. At the same time, nothing restricts this to only happening to your router for .forwards; for example, an accidental alias loop will cause the alias handling router to be skipped in a similar way and the results there could be a lot more odd and peculiar (I suspect that one common result would be a 'no such user' error in addition to the message getting delivered to everyone on the alias).
One corollary of all of this is that it's potentially dangerous to create an address-expanding router that returns different results depending on stuff that can change during address routing; for example, a router that returns a different expansion based on the envelope sender address. Such a router won't get invoked a second time on the same address in a recursive situation, even if it would have returned a different, non-looping result. In its loop-breaking behavior, Exim implicitly assumes that every router return the same thing when recursively invoked on the same address.
(Exim does not literally memoize the result of evaluating the router for a given address, although it does cache and memoize the result of a lot of lookups that routers do.)
Sidebar: one way to get an alias loop
Suppose that you have a generic group alias, and a member of the group is going to be away. They think 'I know, I'll forward my email to the generic group alias to make sure that things get handled even if people email me directly'. The pernicious thing is that this appears to work if they test by mailing themselves, because then it's a .forward loop; the incoming mail goes .forward → alias → .forward, the .forward is skipped the second time around, and it all looks good. Only when the group alias is emailed directly does it become an alias loop (going alias → .forward → alias). Pick a rarely used group alias and it could be a while before this blows up.
PS: if you want to catch this in an Exim configuration, I think what you want is a second router that applies to all aliases and just errors them out with 'alias loop detected'. Assuming that both routers accept the exact same set of addresses in the same situations, the only time this second alias-handling router can trigger is if the first one is skipped for some reason, and generally the only way that that can happen is in the situation above, ie there's a loop.
(Disclaimer: I just came up with this idea and haven't actually tested it.)