2013-06-30
Our pragmatic approach to updating machines to match our baseline
A commentator on my entry on our approach to configuration management asked a good question:
the one thing that is problematic, is development of the "gold instalation standatd". when i make some changes, sometimes it's more work to get all the older machines to the new standard state. Do you solve this some way, or the machines are singletons even in the time?
Our answer is that we're pragmatic about this and as a result it depends on why we're changing the baseline installation. First off, changes to the baseline are basically always because of changes to at least some of the actual systems; the real question is thus not whether we update some systems to the new baseline but whether we update all of them to it. The answer to that depends on the change.
Some changes are things that we actively want on all of our systems (or all of the applicable type of system, like login servers) because they're driven by the users requesting things like 'can you add package X to the login servers' or us discovering we need to turn off some new vendor security feature. Obviously these get updated on all of the relevant servers (or at least all of the ones that we care strongly about); the update to the baseline is just to make sure any new or rebuilt servers also get this change. Some changes only really apply to certain sorts of machines but we updated the baseline to do them on every machine because it's easier that way and it does no harm. In this case we don't run around updating the machines the change doesn't really apply to, even though this means that a newly (re)built version of the machine will be different from the current version.
(In theory this is okay because the difference won't create any functional difference.)
One way of summarizing this is that we usually don't bother changing machines if we think that the change won't have any observable effect (in practice, not in theory; if we'll never notice whether or not a package is installed on a machine it qualifies, for example).
2013-06-22
Automatedly overwriting changed files is not a feature
A commentator on an earlier entry wrote in (small) part, about the advantages of automated configuration management systems:
2) Enforced Consistency and Change Management: With every box picking stuff up from chef on a scheduled basis, changes to important functions are automatically set back to what they should be, rather than someones fiddle or tweak. [...]
I've seen this view expressed in any number of places, to the point where it seems to be common wisdom in some sections of the sysadmin world. I think it is making a terrible mistake.
If people are modifying local files on individual machines, what you have is a failure of process. Something has gone wrong. This should not be happening. Cheerfully eradicating those changed files does two things; it covers up the evidence of a process failure and it probably breaks parts of your environment.
(After all, we should assume that people actually had a reason for making the changes they did to a local file and they wanted (or needed) the results of those changes. If you have people randomly editing configuration files on random machines for fun, you have even bigger problems.)
It's my belief that automated configuration management should not be silently covering up the evidence of a process failure, for all of the obvious reasons. Silently overwriting local changes with the canonical master version sounds good in theory but should not be the default behavior in practice. It's better to warn when a local change is detected, although that takes more work.
(Another way to have this happen is for some other program or automated system on a local machine to be fiddling around with the file. One frequent offender is package updates.)
Sidebar: on not shooting the sysadmin
At this point it's popular to blame the person who made the local change (and to say that overwriting their change will serve to teach them not to do that). This is a failure too. People are rational, so that sysadmin was doing something that they thought was either the right thing or at least necessary despite it being wrong. You should treat this as a serious process failure because it demonstrates that somehow this sysadmin wound up with an incorrect picture of your local environment.
By the way, one of the ways that people wind up with incorrect pictures of the local system environment is that the local system environment is too complex for mere fallible humans to actually keep track of. This gives you fragile complexity.
(In this specific case, one thing to do is to have a label in all of your configuration files mentioning where the master version of the file is located. Then people at least have something that will remind them, possibly in a high stress emergency situation, about how to do things the right way.)
2013-06-20
The question of whether to rewrite an old but working service
So, we have this service. It is a web application (and some associated bits and pieces) to let our users manage some server-side anti-spam settings, things like what level of (server-side) spam filtering they want, whether they'll accept greylisting (although we don't call it that), and so on. It works, it's been there for years, and it has one problem: for reasons beyond the scope of this entry, it's a black box with no maintainer.
(Actually that's a bit of an exaggeration. We know a bit about how it works and what's inside the box, enough to know that it's not ideally built and it's not how we'd do things today. This is one reason no one has devoted the time necessary to read and understand all of its code.)
For some time this has left me wrestling with the question of whether or not I should rewrite the service. On one side is the fact that the service fully works as it is now; it's functional and it doesn't need any attention, it just keeps on quietly going properly. On the other side is the fact it's a black box that we don't understand and can't do anything to. It's not a crucial service, but still, if it breaks during some software upgrade we've kind of got a problem.
(This has actually happened a time or two. We made quick hack changes to get it working again and quietly backed away.)
On the third side is the question of whether there's anything we want to do that the service isn't currently doing or whether this would just be a 'no functional changes' rewrite. Precautionary rewrites for the sake of putting our stamp on a program are perhaps the worst sort of rewrite there is, even if we dress it up as 'understanding the program'.
This is actually a general issue that sysadmins face every so often. We often inherit some random functioning system that we didn't build and don't have much clue about (ranging from software programs to actual servers that are configured who knows exactly how) but are now responsible for. Should we leave them alone as long as they work and don't cause problems, or preemptively redo them to avoid bigger problems down the road? I don't think there's a universal answer and I'm not sure there are better guidelines than just some general and obvious handwaving.
(In our case we actually had all of our agonizing preempted by local events. It turns out that we need some sort of general per-user email control panel web application, of which anti-spam settings are clearly a subset. This still leaves me with lots of issues to agonize over but now they're design and coding issues.)
2013-06-19
Our approach to configuration management
A commentator on yesterday's entry suggested that we're already using automated configuration management, just a home-grown version of it. To explain why I mostly disagree I need to run down the different sorts of configuration management as I see them:
- No configuration management: you edit configuration files in place on
each individual machine with no version control. Your best guess at
what changed recently is with '
ls -l' and your only way back to an older configuration file is system backups. - Individualized configuration management: you use some sort of
version control but it's done separately on each individual machine.
Rebuilding a copy of a dead machine is going to be a pain (and involve
restoring bits from system backups).
- Centralized configuration management: you have a central area with
canonical copies of your configuration files for all of your
machines (under version control of some sort because you aren't
crazy). But you still have to update machines from this by hand
(or make changes on the machine and then copy them back to this
central area).
- Automated configuration management: when you change something in your central area it automatically propagates to affected machines. You don't have to log in to individual machines to do anything.
For the most part we have centralized configuration management, with the master copies of all configuration files living on our central administrative filesystem, but not automated configuration management. Only a few things like passwords and NFS mounts propagate automatically; everything else has to be copied around in an explicit step if we change it (sometimes by hand, sometimes with the details wrapped up in a script we run by hand).
(Actually now that I think about it we have a surprising amount of automatic propagation going on. It's just all in little special cases that we usually don't think about because, well, they're automated and they just work.)
I could give you a whole list of nominally good reasons why we aren't automatically propagating various things, but here's what it boils down to: if sysadmins are the only people changing whatever it is, it doesn't change very often, and it doesn't have to go everywhere, we haven't bothered to automate things because it doesn't annoy us too much to do it by hand. When one or more of those conditions changes we almost invariably automate away.
(That actually suggests a number of openings for a system like Puppet. For a start it can probably handle the actual propagation on command instead of having us manually copy around files.)
2013-06-18
What's in the way of us using automated configuration management
Every so often I poke at Puppet or Chef or one of the other automation systems and consider if we could really use it. And every time I do I find it a hard sell, even to myself (much less hypothetically to my co-workers). Today I've decided to try to write down my collection of technical reasons (to go with other reasons):
- We already have a scripted install process and build instructions
for all of our machines, and also more or less scripted package
updates.
- We install almost nothing on our machines other than vendor packages or things scripted as part of the install process; we don't have applications that must be deployed to random machines.
- It's quite uncommon for us to add even vendor packages to our
machines after their initial deployment. It generally only happens
when users want some additional Ubuntu package on the login or
compute servers, which is pretty rare.
- Many of our machines are singletons, where we only have one of the
particular sort of machine; one external MX gateway, one central
mail server, one print server, and so on. The few machines that are
significantly duplicated naturally tend to have the most automated
install process.
(Installation of login or compute servers is basically completely automated. Given racked hardware we can have a new one up and running about as fast as it can unpack many, many Ubuntu packages on to its disks.)
- We (re)build machines only very occasionally; deploying a new machine
is a rare occurrence. We don't get new hardware in very often, we have
few enough machines that they almost never break, and we only 'upgrade'
OS versions at most every few years as new Ubuntu LTS releases come out.
- We consider it a feature that it takes manual steps to deploy a change
to a singleton machine. To put it one way it acts to insure that you're
paying attention to how your change actually works.
- We have our own automated distribution mechanisms for things like the passwd file; these would have to be coordinated with or hooked into any general automation system.
The short version is that we've already automated most everything we commonly do to more than one machine. I don't seem much room for an automated configuration management system to come in and do new things for us; this means that it would have to replace existing, already developed and working automation. There's some benefit to using standard tools for automation but I'm not convinced that there's a lot.
It's possible that I'm missing things that Puppet, Chef, et al could do for us because the usual examples I read are bright cheerful 'let's deploy a canned web server configuration on to a random machine' ones and we don't have that problem. In a chicken and egg problem, I can't find the energy to read the documentation for Puppet because I suspect that I won't find anything we can use.
2013-06-17
My job versus my career: some thoughts
One of the things I've said to people in the past is that while I have a job I didn't really have a career. Not in the sense that my job is unstable or unsettled (it's actually been rock-solid) but in the sense that I had no idea of where I wanted to go, no particular vision of my future. With no real view of what I wanted I've never had any strong basis to do things like evaluate my current situation, consider other options, or assess my progress towards, well, what should I be progressing towards? To be progressing towards something implies having some objective and I've never had any grounds to establish such a thing.
This has led to me having a huge inertia in my job (which has generally been both pleasant and interesting). I've changed jobs here at the university only once and even then it took a huge upheaval to do it; this puts me way out on the 'time at a single employer' and 'time at a single job' curves for computer people.
One consequence that I've been thinking about off and on is that I don't even know what it would take to attract me away from my current job, since I don't have anything I'm aiming for and I'm not sure there's anything in particular missing or wrong about my current job (things where another job would simply make me happier).
(It's possible that I'm deluding myself here for various reasons. Universities are comfortable places but at the same time they're places that basically can't value IT as much as some places do out in the outside world (cf).)
Writing Wandering Thoughts has made me somewhat more aware about this (because it's prompted a certain amount of self-reflection), but what's done more to bring this to mind is reading about the interesting sysadmin-related things that other organizations are doing (Twitter has been especially good for exposing me to this). I still don't know where (if anywhere) I want to go but at least it gives me more of an idea of what's out there.
By the way I have no idea if it's important to actually have a career as such, in this sense. If you're happy with your job (and paid well enough), do you really need anything more or would it just add extra stress to your life? And balancing the relative happiness of the known present versus uncertain potential futures is a hard problem.
(If you're bored or unhappy in your job it's another matter, of course. Then you at least want to figure out what'd make you happier and move towards it (which is easy to say but potentially hard to do).)
(This is one of the entries in which I ramble, partly writing to myself.)