2015-09-28
On not having people split their time between different bosses
In some places, it is popular (or occasionally done) to say something like 'well, this area only has the money for 1/3rd of a sysadmin, and this area has the money for 2/3rds of a sysadmin, so I know; we'll hire one sysadmin and split her up'. It is my personal view that this is likely to be a mistake, especially as often implemented. There are at least two pathologies you can run into here.
The basic pathology is that humans are frequently terrible at tracking their own time, so it is quite likely that you are not going to wind up with the time split that you intended. Without strong work against it, it's easy to get pulled towards one side because it's more interesting, clearly needs you more, or the like, and then have that side take over a disproportionate amount of your time. Perhaps time splitting might go well if your one sysadmin is a senior sysadmin with a lot of practical experience at doing this and a collection of tools and tricks for making it work. If your one sysadmin is a junior sysadmin thrown into the lion cage with no support, guidance, tools, and monitoring, well, you're probably going to get about the results that you should expect.
The more advanced pathology is that you are putting the sysadmin in the unhappy position of having to tell people no for purely bureaucratic reasons (or to go over and above their theoretical work hours), because sooner or later one of the areas is going to want more work than fits in the X amount of the sysadmin that they are entitled to. At that point the sysadmin is supposed to say 'sorry, I switch over to area Q now, I know that you feel that your work is quite important, maybe more important than area Q's work, but I am not supposed to spend any more time on you until next week'. This is going to make people unhappy with the sysadmin, which is a stressful and unpleasant experience for them. People don't like inflicting those experiences on themselves.
(The actual practical result is likely to be either overwork or that once again the actual time split is not the time split you intended.)
I feel strongly that the consequence of both pathologies is that management or at least team leadership should be deeply involved in any such split-sysadmin situation. Management should be the ones saying 'no' to areas (and taking the heat for it), not sysadmins, and management should be monitoring the situation (and providing support and mentoring) to make sure the time is actually winding up being split the way it's intended.
(There are structural methods of achieving this, such as having areas 'purchase' X hours of work through budget/chargeback mechanisms, but they have their own overheads such as time tracking software.)
If you like, of course, you can instead blame the sysadmin for doing things wrong or not properly dividing her time or the like. This is the 'human error' explanation of problems and as always it is not likely to give you a solution to the problem. It will give you a great excuse to fire people, though. Maybe that's what you actually want.
2015-09-25
Do we want to continue using a SAN in our future fileserver design?
Yesterday I wrote that we're probably going to need a new Linux iSCSI target for our next generation of fileservers, which I optimistically expect us to begin working on in 2018 (when the current ones will be starting to turn four years old). But as I mentioned in an aside, there's a number of things up in the air here and one of them is the big question of whether we want to keep on using any sort of SAN at all or move to entirely local storage.
We had a number of reasons originally for using an iSCSI SAN, but in practice many of them never got used much. We've made minimal use of failover, we've never expanded a fileserver's storage use beyond the pair of backends that 'belong' to it, and while surviving single backend failures was great (cf), a large part of those backend failures was because we bought inexpensive hardware. If our current, significantly better generation of hardware survives to 2018 without similar large scale failures, I think there could be a real question about carrying on the model.
I've written about our general tradeoffs of a SAN versus disk servers and they remain applicable. However, two things have changed since writing that last year. The first is that we now have operational experience with a single fileserver that has a pile of disk space and a pile of load on it, and our experience overall is that we wish it was actually two fileservers instead. Even when we aren't running into OmniOS issues, it is the fileserver that is most prone to have problematic load surges and so on, simply because it has so much disk space and activity on it. One thing this has done is change my opinion about how big a disk server we'd want to have; instead of servers as big as our current fileservers with their paired backends, I now think that servers roughly half the size would be a good choice (ie, with 8 pairs of data disks).
The second is that I now believe we're going to have a real choice of viable OSes to run ZFS on in 2018, and specifically I expect that to include Linux. If we don't need iSCSI initiator support, need only a medium number of disks, and are willing to pay somewhat extra for good hardware (as we did this generation by avoiding ESATA), then I think hardware support in our OS of choice is likely to be much less of an issue. Put bluntly, both Linux and FreeBSD should support whatever disk controller hardware we use and it's reasonably likely that OmniOS will as well.
There are unquestionably downsides to moving away from a SAN (as I covered, and also). But there are also attractive simplifications, cost savings, and quite possibly performance increases (at least in an all-SSD environment). Moving away from a SAN is in no way a done deal (especially since we like the current environment and it's been quite good for us) and a lot can (and will) change between now and 2018, but the thought is now actively in my mind in a way that it wasn't before.
(Of course, part of this is that occasionally I play around with crazy and heretical what-if thoughts about our architecture and systems. 'What if we didn't use a SAN' is just one iteration of this.)
2015-09-08
How we disable accounts here (and why)
In my opinion, the hardest part of disabling accounts is deciding what 'disabled' means, which can be surprisingly complex. These days, most of the time we're disabling an account as a prelude to removing it entirely, which means that the real purpose of disabling the account is to smoke out anything that would cause people to not want the account to be deleted after all. Thus our goal is to make it look as much as possible as if the account has been deleted without actually deleting anything.
These days, what this means is:
- scrambling their password, so they cannot log in to our Unix
systems, access their files through our Samba servers, read their
email via IMAP, and so on. If necessary, this gets 'reverted'
through our usual password reset process for people who have
eg forgotten their password.
(Given that Samba has its own password store, it's important for us to actively use
passwdto scramble the password instead of just editing/etc/shadowto lock or disable it (cf).) - making the user's home directory and web pages completely
inaccessible (we '
chmod 000' both directories). This blocks other people's access to files that would be (or will be) deleted when the account gets deleted. Explicitly removing access to the account's web pages has been important in practice because people sometimes forget or miss that deleting an account deletes its web pages too.(I believe this will stop passwordless SSH access through things like authorized keys, but I should actually test that.)
- making the user not exist as far as the mail system is concerned,
which stops both email to them and email through any local
mailing lists they may have.
(This automatically happens when someone's home directory is mode 000, and automatically gets reverted if their home directory becomes accessible again.)
- entirely removing their VPN credentials and DHCP registrations. Both of these can be restored through our
self-service systems, so there's no reason to somehow lock them
instead.
- find and comment out any crontab entries, and stop any user-run
web servers they have. All of this
should normally stop anyways because of mode 000 home directories,
but better safe than sorry.
- set their Unix shell to a special shell that theoretically prints a message about the situation. We use this more as a convenient marker of disabled accounts than anything else; the scrambled password means that the user can't see the message even if they actually tried to log in to our Unix systems (which they may not really do these days).
We don't try to find and block access to any files owned by the user outside of their home directory, because we don't normally remove such files when we do delete the account (which is one reason we need filesystem scans to find unowned files).
If we're disabling an account for some other reason, such as a security compromise, we generally skip making the user's files inaccessible. This also keeps email flowing to them and their mailing lists. In this case we generally specifically disable any SSH authorized keys and so on.
Sidebar: Keeping web pages without the rest of the account
This is actually something that people ask for. Our current approach is to leave the Unix login there, scramble the password and so on, and empty out the user's regular home directory and set it to mode 000 (to block email). This leaves the web pages behind and keeps the actual ownership and login for them (which is important because we still use Apache's UserDir stuff for people's web pages).
We haven't yet had a request to keep web pages for someone with CGIs or a user-run web server, so I don't know how we'd deal with that.
2015-09-07
Why we wind up deleting user accounts
In a comment on my entry on optimizing finding unowned files, Paul Tötterman asked a good question:
I'm surprised that you actually remove users instead of just disabling them. Care to expand on that?
At one level, the answer is that we remove users when account sponsors tell us to. How our account management works is that (almost) every user account is sponsored by some professor; if the account's sponsor removes that sponsorship, we delete the account (unless the person can find another sponsor). Professors sponsor their graduate students, of course, but they also sponsor all sorts of other people; postdocs, undergraduate students who are working on projects, visitors, and so on. There's no requirement to withdraw sponsorship of old accounts and it's customary to not do so, but people can do so and sometimes do.
(For instance, we have no policy that graduated grad students lose their accounts or have them disabled. Generally they don't and many of them live on for substantial amounts of time.)
But that's not the real answer, because I've glossed over what prompts sponsors to take action. Very few professors bother to regularly go over the accounts they're sponsoring and decide to drop some. Instead there tend to be two underlying causes. The first cause is that the professor wants to reclaim the disk space used by the account because the other option is buying more disk space and they'd rather not. The second cause is that we've noticed some problem with the account (for example, email to it bounces) and the account sponsor decides that removing it is the simplest way for them to resolve the situation. This usually doesn't happen for recent accounts; instead it tends to happen to the accounts of people who left years ago.
(Account sponsors are 'responsible' for accounts that they sponsor and get asked questions about the account if there are problems with it.)
Our current approach to account removal is a multi-stage process, but it does eventually result in the login getting deleted (and sometimes that happens sooner rather than later if the sponsor in question says 'no, really, remove it now').
2015-09-05
Why we aren't tempted to use ACLs on our Unix machines
One of the things our users would really like to have is some easy way to do ad-hoc sharing of files with random collections of people. In theory this is a great fit for ACLs, since ACLs allow users themselves to extend various sorts of permissions to random people. Despite this appeal of ACLs, we have no interest in supporting them on our machines; in fact, we go somewhat out of our way to specifically block any chance that they might be available.
The core problem is that in practice today, ACL support is far from universal and not all versions of it behave the same way and are equally capable. What support you get (if any) depends on the OS, the filesystem, and if you're using NFS (as we are), what the NFS fileserver and its filesystem(s) support. As a practical matter, if we start offering ACLs we're pretty much committed to supporting them going forward, and to supporting a version of them that's fully backwards compatible with our initial version; otherwise users will likely get very angry with us for taking away or changing what will have become an important part of how they work.
(The best case on an ACL system change is that people would lose access to things that they should have access to, partly because people notice that right away. The worst case is that some additional people get access to things that they should not.)
Given that neither ACL support nor ACL behavior is anywhere near universal, a need for backwards compatibility is almost certain to limit our choice of future OSes, filesystems, and so on. Do we want to switch the fileservers to FreeBSD, for example, but NFS to ZFS on FreeBSD doesn't support the ACL semantics we need? We'd be out of luck and stuck. If we want the most future freedom we have to stick to the lowest common denominator, and today that is Unix UIDs, GIDs, and basic file permissions.
(This sort of future compatibility is not a universal need, of course. There are any number of environments out there where you build systems for specific needs and when those needs go away you're going to toss the systems. In that setup, ACLs today for one system don't necessarily imply ACLs tomorrow (or the same ACLs tomorrow) for another one.)
2015-09-02
Thinking about the different models of supplying computing
It's the time of year when new graduate students show up here, so one of the things on my mind has been the various ways that computers can be supplied to people in an environment like ours. There are at least three that come to mind.
First is the 'bring your own device' model where every incoming graduate student (or professor) is expected to bring their own computer (probably a laptop) and, as a corollary, to look after it. Perhaps we'd supply some niceties like external screens to hook up to them. The BYOD approach is popular partly because any number of people are going to do this anyways.
Then there is the 'hardware only' model, where we hand a computer to every new graduate student but make no attempt to manage or control it beyond that; the graduate student can run whatever they want in whatever configuration they want. Probably we'd preinstall some OS in a recommended configuration just for convenience (and many grad students would leave it as-is). Lots of people like this model for its freedom and similarity to the BYOD experience (at least until the OS install blows up in their face).
The final model is managed desktops, where we both supply hardware and maintain the OS installed on it. On the one hand, we guarantee that it works right; on the other hand, people lose the freedom to run whatever they want and have to generally live with our choices. 'We don't support that' will probably get said a lot.
(Note that these are not necessarily a good set of options for any environment other than our peculiar one.)
As you might suspect, in practice right now we have a mix of all three options. The historical evolution of our environment is that we started out providing fully managed computing because computing was too expensive for any other answer, but over time the decrease in computing costs (especially compared to staff costs) has caused more and more people to shift towards BYOD and 'here, have a box'.
(I will skip a discussion of trying to do managed Windows installs and just say that we were and are primarily a Unix shop without much expertise in that area. This leads to non-technical issues beyond the scope of this entry.)
I'm mulling this over partly because how computing get supplied to people has a big impact on what services they're interested in consuming from us (and how). For one obvious example, in the days when we provided serial terminals on everyone's desk, having Unix servers for people to log in to was a big deal and they were very important. Today an increasing number of people here have probably only used our login servers to change their password.
(Since we're a Computer Science department, you could actually argue that we should actively push people to interact with Unix because Unix is an important part of effective, practical computing and so something they should be learning. But that's another debate entirely.)