2012-03-31
Our sysadmin environment
I've said before that we don't have anything like a traditional ops or 'devops' environment here. This raises an obvious question: what is our work like?
The simple way to describe it is that my group is traditional university sysadmins (well, for departmental computing). This means that our job is to provide an environment with various services, things like network connectivity, reasonably decent file storage, printing, email, some machines that people can log into to do various things (including run big computational jobs), and a reasonably flexible web server that people can put stuff on. But that's pretty much it.
(What's in this central environment is mostly set by the department's computing committee, mostly based on what's easy and cheap to do and of sufficiently broad use to the department.)
If a graduate student wants something that we don't already provide, it's usually up to them (and their Point of Contact) to set it up, and if they have a program or a web service or whatever that is falling over or not performing well enough, it's entirely up to them to troubleshoot it. If a graduate student wants something that requires root permissions or a new system daemon or whatever, something they cannot set up as an ordinary Unix user, the answer is generally 'no, we can't provide that'. The escape hatch from this limited functionality is that people can put their own machines on the network and obviously do more or less whatever they want on them.
(We're willing to install standard Ubuntu packages for people, but we won't run new system daemons. So we'll install the PostgreSQL packages but not run a system PostgreSQL instance.)
This means that we do only one part of the traditional 'ops' jobs, and we do it in a very hands-off way. We don't get handed software to deploy, we don't get asked to set up various packages (eg, 'this thing needs an Oracle database, set one up for us'), and we don't even get new machines to install for people. We do have some custom software and systems that are part of our environment, but they're all stuff we've written and have full control over; we own them from start to finish, plus they're only providing internal services.
Even in a university, not all systems are run this way. There are plenty of important public-facing systems and services around here that are run in a much more conventional industry-like 'ops' way, with code and site deployments, the sysadmins being responsible for the service staying up, and so on. (Some of them lead to very interesting war stories and challenges.)
(Points of Contact probably have a more industry-like 'ops' job, but even there my impression is that graduate students usually retain a lot of responsibility for deploying their own software and keeping it running.)
2012-03-29
Scalable system management is based on principles
Here is something that I strongly believe:
Scalable system management is based on principles, not software.
If you get the important ideas that underly scalable system management (even if this is just a gut level understanding that you couldn't clearly articulate), the software ultimately doesn't matter and isn't necessary; you can build your own if and when you need it (although there are good reasons to use standard software). Conversely, if you do not understand the principles all of the best practices software in the world will not necessarily help you. There is very little software that will actively prevent you from managing systems in ways that turn out to be a bad idea.
(This is a variant of the old aphorism that you can write Fortran in any language.)
Or to put it directly: you do not get scalable system management just by using Cfengine, Puppet, Chef, or today's hotness. You get it by understanding what you need for scalable system management and then using whatever tools are necessary.
(Thinking otherwise is a cargo cult approach, where you believe that you can get the same results just by going through the same motions with enough fidelity to the originals.)
What makes this especially unfortunate in my view is that the actual principles of scalable system management are generally not really set out anywhere. Most people are left to fumble towards an understanding through (painful) experience or to pick things up through osmosis from documentation for good management software and how people use it. Perhaps it doesn't help that the standard style of system management writeups is generally long on descriptions of tools but short on the why of it all.
(Also, perhaps part of the issue is that people who've reached the point where they can write these things up wind up feeling that the principles of scalable system management are so obvious they don't really need to be mentioned.)
(It's my suspicion that the same thing is true of scalable software deployment and probably other things in the modern 'devops' arsenal, although I'm theorizing without actual experience here.)
Sidebar: some definitions
By system management I mean, well, managing systems; installing them and maintaining them and keeping them running. By scalable system management I mean a way of doing this where you can scale up the number of systems you run without having to scale up how many people you have managing them.
(System management is part of what gets called 'operations', but not all of it.)
2012-03-28
How I (once) did change management with scripts
When I read Philip Hollenback's latest entry and it mentioned someone doing (change/system) management through shell scripts (instead of, say, Puppet), my first thought was 'hey, I've done that'. So I might as well write up how I did it, either for someone to use or in case people want to marvel at the crazy person.
(Now, a disclaimer: by now this was more than half a decade ago, and some of my memories of the fine details have undoubtedly faded (ie, are now wrong).)
The basic environment this happened in was a lab environment with (at its height) on the order of a hundred essentially identical PC machines running Linux (this is the same environment where we needed efficient update distribution). Most of the system management was handled through packages and automatic package updates, but every so often there was something that was best handled in a shell script.
Each separate change was a separate little shell script, all of which
lived in a common directory (actually one directory for each OS
release). Script filenames started with a sequence number (eg they had
names like '01-fix-something'), and scripts were run in sequence. The
driver system kept track of which scripts had already succeeded and
did not re-run them; a script that exited with a failed status would
be retried the next time the driver system ran. The driver system ran
once a day or (I believe) immediately after system boot, and processed
scripts after applying package updates. Scripts were expected to check
if they were applicable before doing anything and exit if they weren't
(with status 0 if they were definitely not applicable to this system or
with status 1 if they should be retried the next time).
(If I was doing this again I think I would make the driver script not run further scripts if an earlier one failed. In our case all of the scripts were basically independent, so it didn't matter.)
There was no mechanism to rerun a script if it changed; if I changed a script and wanted to have it rerun, I needed to give it a new sequence number. If a script became unnecessary for some reason, it was just removed.
All of this is actually quite short and simple to implement, and it worked quite well within its modest goals. It was not particularly difficult to write scripts, they were automatically executed for you, all machines were kept in sync, and a newly (re)installed machine would automatically pick up all of the current customizations. These days, you would put the entire directory of scripts into a VCS (and you might distribute it by having the workstations check out a copy from the central repo).
2012-03-23
Sometimes you get lucky
We had a building power failure today in the building with our main machine room (and thus all of our core servers). When we realized what was going on and got to the machine room, we made an extremely unpleasant discovery; as far as we can tell, the automatic transfer switches in front of our UPSes, well, didn't transfer. Instead they all entered some sort of faulted state where they provided no power.
(The UPSes all at least claimed to have good battery charge and to not have run down, although it's possible they were lying and they had all died by the time we got there despite appearing healthy.)
This was very bad. The automatic transfer switches and UPSes are our primary defense against ZFS metadata corruption during power failures; with them non-functional we were completely exposed. When the power returned, we restarted the fileservers one by one and held our breath as each ZFS pool and its filesystems came up. In the end, all of our pools survived.
(We expect to find a number of repairable checksum errors when we scrub all pools this weekend.)
I have no great lesson here beyond what's in the title: sometimes you get lucky. Good luck happens just as much as bad luck, and it's what we had today.
(If you want a small lesson, I think it's that testing what will happen in a real power failure may be surprisingly hard. Real power failures seem to involve all sorts of things happening to the line power that are not necessarily very much like what happens when you pull a power cord out or flip a PDU master power switch.)
2012-03-22
The problems of operations and sysadmin heroism
Back in DevOps and the blame problem I noted that operations has a problem getting praised, because people generally feel that the computers should just work. This leads to what I call the heroism problem for ops.
In practice, ops can easily get praised in exactly one situation: when it's clear to everyone that something exceptional is going on, that it is not just business as usual. In short, you get praised if you (visibly) fix a panic situation, and the more exceptional your efforts to fix the panic situation the better. Put together a solution with chewing gum and bailing wire after staying up all night? All the better.
Everyone can probably see the perverse incentives that this creates. If you are rewarded for cleaning up after floods but not recognized for building flood prevention instead, pretty soon you start losing enthusiasm for trying to argue your bosses into funding that flood prevention. And in a real way this is a lot like the devops blame problem; when you reward some things and penalize others, you have told operations what your priorities are whether you like it or not.
But it gets worse, because here's the thing: this heroism is often attractive. Not just attractive because you're rewarded for it; intrinsically attractive. Heroism means that you get to make a difference in a challenging situation, one that stretches you and calls on all of your ingenuity and cleverness. It is troubleshooting writ large. We can all see that opportunities for heroism are the seeds of great stories, not stories of disasters but stories of triumphs against the odds. Who doesn't want to be part of that?
(This is especially the case if your routine job is not challenging, exciting, or even very interesting.)
Heroism is also corrosive in the long term. It is directly corrosive to lives; it is a young person's game. It is corrosive to engagement. If you have constant opportunities for heroism, people will burn out because very few people can be adrenalized all of the time; if you mostly don't have opportunities for heroism yet heroism is the only really rewarding thing about the job, people are going to check out. And, I think, it is corrosive to your ops culture. When heroism is the rewarding thing, you are implicitly creating a group of troubleshooters instead of anything else (it's certainly what you're encouraging people to get good at). Troubleshooters are certainly useful, but a well rounded ops environment needs more than that.
2012-03-14
Configuration management is not documentation, at least not of intentions
In this Sysadvent paen to configuration management I read the following little bit:
[...] In using a config management system, you are implicitly documenting the system's "desired state" - Why is the system configured this way? [...]
No you aren't. If you use very well named configuration management classes and variables and so on, you may have at best started to document the why of your configuration. Otherwise, configuration management documents the how of your configuration but it can only lightly touch on the much bigger picture of why.
(Here I want to distinguish between a CM configuration itself and any comments that you add to the CM configuration. Using a CM system doesn't require writing comments, and writing documentation on the why's of a configuration doesn't require putting that documentation into a CM configuration file as comments.)
Documentation on why needs to cover two aspects of the system, neither of which CM captures. The first aspect is why the system exists at all; what is the high level picture of why the system and the services running on it do and how they interrelate to other machines and services. Your CM system can tell you that this system runs Apache, but there's nothing in the CM configuration itself that will necessarily tell you why. The second aspect is why this system is set up the way it is, things like why you chose a particular daemon and why its configuration is the way it is. There may be vitally important information buried in these decisions, for example the painfully acquired knowledge that on machines with X memory you cannot set Apache parameter Y larger than value Z, but a CM configuration is again silent on all of that.
(And there is also things like why you didn't use some attractive setting.)
There is also a meta-issue, which is that a CM configuration is usually an incomplete specification of the real system. Using a CM system is all about telling it what you do to a target system, ie more or less what you change on it. If you don't need to change something, if the system comes set up for it correctly from the start, it's quite likely that this knowledge will not be in your CM configuration. This is great for compact CM setups, except that once again it means that your CM configuration is missing important information about the system.
(You can use a CM system to redundantly specify everything about a system's configuration, carefully telling it to do things like enable all of the Apache modules that you need even though they're all already enabled in the default install. But I really suspect that most people writing CM configurations are not that bloody-minded and determined; instead they specify the changes and additions that they needed to make to the base system to get things working.)
2012-03-12
Why it matters whether your software works when virtualized
It's not really a secret that I love doing test installs of machines and software in virtualized machines instead of on physical ones. It's generally a lot faster, it's certainly a lot more convenient to do this work from my quiet desk instead of in a noisy lab area or machine room, and I can often snapshot images and roll back to snapshots in order to skip doing tedious rote stages over and over again. However, every so often I run into something that doesn't work in my virtualization environment; the most recent example are a couple of Solaris 10 Java patches. When this happens I get unhappy.
Generally, if something works on a virtualized machine I can assume that it works on real hardware (modulo issues with hardware drivers). When something doesn't work virtualized, maybe it will work on real hardware and maybe it won't (one would like to think that Oracle wouldn't release patches that hang on real hardware, but I'm not quite that trusting). So why not just test it on real hardware? The problem is what happens if whatever it is actually doesn't work even on real hardware. On a virtual machine, recovery is simple; it's a snapshot rollback (if I took one). On a real machine there's often no such thing, so a failure may mean a from scratch reinstall (losing all of my work to date, or at least forcing me to recreate it). Reinstalls on real hardware are annoying in all sorts of ways and more than that, they're a waste of my limited time.
So, let me put it compactly: things working when virtualized gives sysadmins confidence that your software will work on real hardware; not working virtualized takes away some of that confidence. The more that people use virtualization, the more important this is. Yes, even if you only officially recommend or support running on real hardware.
(Even if you don't officially support it, the benefits of doing testing in virtualized environments are so great that many sysadmins will try it anyways. If you break this, they will get grumpy. If you actively sabotage working virtualized, well.)
2012-03-03
Two ways I increase the security of SSH personal keys
It's time for me to toss some pennies into the pond of advice about good ways to use SSH securely and conveniently. However, I first need to point to my general remarks about SSH personal keys; the following things I do are a tradeoff. They work for me but they may not be right for you.
Here are two things that I do with SSH personal keys (aka SSH identities) that I don't think are always done. Both of them are useful for increasing security.
First, I have a different SSH identity for each machine that I run SSH from. This means one key for my home machine, another key for my office workstation, a third key on my office laptop, and so on. I never reuse the same identity on different machines, even if it would be convenient and I consider the machines equivalently secure (and give the machines the same access permissions for my other accounts).
(The semi-exception to this is that my account on our login and compute servers has a SSH identity for itself. Because my home directory is NFS mounted on all of them, the identity is shared across all of them.)
Second and importantly I use the from="..." feature in
.ssh/authorized_key to restrict what sources a particular identity
will be accepted from. Each identity is restricted to the known, likely,
or plausible origin IP(s) for the machine that it's associated with. For
my main keys (on my home and office machines), this is a very small list
(both machines have static IPs). The laptop key is restricted to the
plausible wired and wireless connections that I use it with. Doing this
drastically limits the damage of key disclosure, to the point where I
could practically publish the secret key for one of my core identities
here on the blog without it doing you any good (and not because the
machines it works on are firewalled from the Internet, either).
(Since our systems accept password authentication over ssh, I have a fallback if I really need it.)
There are many keys that can have their sources limited this way, even if the limits are relatively broad. In many cases there's simply no need to accept a key from everywhere, and not infrequently you can be very restrictive. Excellent candidates for this are all of the keys that you use to allow automated access for scripts, backups, and similar things.
(We have a number of identities that we use to allow access this way, some of them to fairly dangerous accounts and systems. All of them have strong origin restrictions. Occasionally this is inconvenient, such as when we have to remember to add a new system to the origin restriction, but in general it makes us happier.)