2013-12-25
Procedures are not documentation
In the introduction to this good and interesting Sysadvent entry on automatically testing your web security I hit a sentence that raised my hackles right up:
Writing automated tests for your code is one of those things that, once you have gotten into it, you never want to see code without tests ever again. Why write pages and pages of documentation about how something should work when you can write tests to show exactly how something does work? [...]
(Emphasis mine.)
I cannot put this strongly enough: WRONG. Procedures (including tests, checklists, and even configuration management) are not documentation and cannot substitute for it. You really do want to have actual documentation (although good procedures can reduce how much of it you need).
Procedures more or less tell you the 'what' of your systems, but they do not give you any background behind that 'what'. Contrary to the implication in the sentence, they often can't really tell you how things work (just that they do). They can't give you the 'why' of the design and set up of your system, the rationale and logic behind it; knowing this is crucial information for making good modifications to the system later. They don't distinguish between core features and requirements and things that are just accidents of the current implementation. Procedures also can't tell you why you aren't doing something, which can be quite useful to know.
Tests are especially prone to this problem because tests by their nature must be specific and they're totally mute about the why of the test. For example, if you're doing a 'does it work' test of a web server and looking for specific HTML output is that specific HTML output a requirement or simply a recognition signal, so that it could be any other recognizable chunk of HTML on that page?
(Also, a test suite is almost always an imperfect reflection of reality, which means that if you attempt to reverse engineer reality from your test suite you are guaranteed to get something wrong.)
Note that these are not new issues. Programmers have an entire set of practices around writing good tests, detecting bad tests, understanding tests, and so on.
(Documentation surrounding procedures can tell you this information, if it exists and is maintained, but that's different. That's actual documentation. Some people will immediately say that you should never write tests or other procedures without at least some surrounding documentation to give at least basic context and I'll fully agree with them, but that's not quite the position being advocated here.)
2013-12-17
You probably don't want to use Make to build your generated files
When I started working here, one of the early things I worked on was generating files of information for various things (such as getting files of all valid local email addresses). Partly because I was very clever back then and partly because we were doing this on ancient, underpowered Solaris machines, I did this the obviously efficient way: I used a Makefile to control the whole process and made sure that it did only what it absolutely needed to. If everything was up to date the whole process did nothing.
(In the process I found a Solaris make bug, which perhaps should have been a warning sign.)
Since then I have come around to the view that this is almost
always being too clever. There are two problems. The first is
that it's very hard for your Makefile to be completely accurate and when inaccuracy sneaks in, files don't
get (re)built as they should. This is very frustrating and leads to
the other issue, which is that sometimes you want to force a rebuild
no matter what. For example, perhaps you think that the output file
has gotten corrupt somehow and you want to replace it with the current
version. You can sort of handle this with Make, of course; you can
provide a 'make clean' target and so on and so forth. But all of this
is extra work for you to create a better Makefile and for everyone when
they use this Make-based system (and it's probably still going to go
wrong every so often).
The truth is that a Makefile-based system is almost always optimizing something that doesn't matter on modern systems. Unless the generation process is very expensive for some reason, you're not going to notice doing all of it every time and therefor you're not saving anything worthwhile by only doing part of it. It's much easier to rip out the Makefile and replace it with a simple script that always generates everything from scratch every time. At most, optimize the final update of the live versions so that you skip doing anything if the newly generated files are identical to the existing files.
My repeated experience is that the result is simpler, easier to follow, and easier to do things with. As a sysadmin you have the reassuring knowledge that if you run the script, new versions of the files will get generated using the current procedure and (if necessary) pushed to the final destinations. You don't have to try to remember what magic bits might need to be poked so that this really happens, because it always happens.
The exception when this is not being too clever is when the full generation process is so expensive and time consuming that it is worthwhile (or even utterly necessary) to optimize it as much as possible. Even then you might want to consider ways of speeding it up in general before you start taking bits of it out most of the time.
(This is where people replace shell scripts with Perl or Python or even Go and at a higher level try to see if there's some better way to get and process the information that the system is operating on. Note that, as always, it's almost always better to optimize the algorithms before you optimize the code.)
2013-12-08
Sometimes the right thing to do is to stop (and even to give up)
I'm generally someone who is happy to keep chasing an oddity or a mystery, to keep plugging away at the problem to at least chart it out and perhaps figure out what is going on. I suspect that this is something that a lot of sysadmins feel; if there is something wrong, we itch to figure it out and put it to right. And the satisfaction of finally succeeding is an excellent feeling. But sometimes this is absolutely the wrong thing to do. Sometimes the right thing to do is to stop with the mystery not understood, or even to give up entirely.
I've been continuing to work away on our disappearing ESATA disk problem since I wrote about it; I've tried more things, gotten more specific information, and the whole thing has gotten weirder. But at the end of this past week we decided to stop all of that. I managed to get the system to a precariously balanced point where it's stable and that's that. In fact we're going further than just stopping with a stabilized system; in the longer run we're giving up on it entirely and will be migrating the whole thing to different hardware. We'll write off the disk enclosure as a loss (the server is a generic one and can be reused for other things).
The direct reason that this makes sense is that we have gone far enough to establish that something very odd is going on. Even if we continue investigating and discover exactly what the problem is we have no confidence that we'll be able to fix it, and in the mean time we have managed to stabilize the system as-is. Until we can at least identify the problem, we can't trust the enclosure in general. We could do a bunch of experiments to chart out what disks we can add to the enclosure where and still have an apparently stable system, but that wouldn't make us trust it and if we can't trust it we don't want to use it.
But the bigger reason to stop is the cost/benefit ratio of continuing to investigate the problem. I could easily spend a bunch of time and effort conducting experiments to map out the precise contours of the problem (and maybe find some clues to its cause). But by far the most likely result of these experiments is a pile of data on a disk enclosure that we no longer trust. In the best case we have minimal expansion in this enclosure and we're certainly not going to buy any more of them, so the smart choice is to say 'this is good enough, we've spent enough time on it'.
Or in short: sometimes you lose. When you are losing, the smart thing to do is to recognize that and lose fast. This is painful, since we don't like to lose, but it's also best. Try not to let it get to you.
(This would be more obvious if staff time was considered a cost on par with hardware, but universities almost never think about staff time that way.)
PS: yes, this entry is being written in part to make me feel better about throwing in the towel on this issue. We're all squishy humans with those awkward emotions.
2013-12-05
Some thoughts on a body of knowledge for system administration
Earlier this week I read an entry from this year's SysAdvent, Introducing the Guide to Sysadmin Body of Knowledge. If we're going to talk about a sysadmin body of knowledge, the first thing we need to talk about is whether this BoK is intended to be descriptive or prescriptive.
A descriptive BoK essentially restricts itself to an inventory of practices with descriptions about what good or bad things can happen when you use the particular practice. That's why it's a descriptive BoK; it simply describes things. A descriptive BoK generally should make an attempt to be even-handed or at least honest because otherwise it's not really honestly descriptive.
A prescriptive BoK says 'these are best practices and you should do them'. This almost necessarily comes with a side order of 'these practices are bad and should be avoided by all right-thinking people'. There are two problems with this. The first problem is that this is intrinsically a strong editorial stance on how people solve problems in system administration. This is going to be controversial.
The larger problem, a problem which also afflicts a descriptive BoK, is that system administration is nowhere near being a settled and fully developed field. Best practices in system administration are evolving on an ongoing basis as people come up with new solutions, try them out, refine them, work out how to make them simpler and easier to use, and so on (sometimes also with 'and discover that they don't work'). A prescriptive BoK that says 'do things this way' is freezing the state of the field as it is right now (at best). Unless the BoK is constantly updated, tomorrow and next month and two years from now it will still tell you to do things in what was the best practices for today, not where the field has moved to by then.
To get a more solid idea of what this might mean for the field, imagine that you had a BoK for the field that was compiled five years ago. How many of what are now considered best practices would be mentioned? How many now-deprecated things would be recommended (or at least not disavowed strongly)? One can get an idea of this by looking at old books on system administration (which is basically all of them today) and asking what they're missing.
A prescriptive BoK is usually considered more desirable by people because it tells you what to do, but this very feature makes it more harmful when it's out of date. The out of date BoK is not only silent on new things, it implicitly tells people not to do them. To do things in what is now the new best practices you must actively go against what the BoK is telling you to do. The result is that a respected prescriptive BoK would effectively freeze the parts of system administration that it described; new (best) practices will move through our field only very slowly until the BoK was revised and people learned about it.
A related issue is that people are probably only going to be willing to reread the BoK so many times before they throw up their hands and abandon it entirely as too much work to keep up with. Again this has implications for a constantly evolving field where you should really be revising the BoK every few years.
Now let me be optimistic (or at least temper my pessimism here). I do think it's possible to create a body of knowledge with general knowledge and principles that have proven timeless or are likely to be, and that this would be a quite useful thing to do (especially if someone can talk coherently about various underlying principles). But such a body of knowledge is not going to deliver specific actionable 'do this thing' advice, it's just going to be a high level guide.
(This elaborates on some things I brushed lightly over back in my earlier entry on professional knowledge, certification, and regulation.)
2013-12-04
sudo is not an auditing mechanism
Here's something that I hope everyone understands about sudo but
that I want to say explicitly anyways: sudo is not an auditing
mechanism, not for sysadmins who are expected to use general and
unrestricted root powers. There are at least three different problems
with sudo as an audit mechanism:
sudocannot track what an intruder does asroot. Once the intruder runs, say, 'sudo bash -i',sudois simply not logging things any more. Only a very cooperative intruder will helpfully run all of their commands throughsudo. Once a less cooperative one escapes into a general root environment, that's it.sudocan't even reliably tell you when an intruder has escaped into a general root environment if the intruder is at all subtle. Consider the followingsudoinvocation:sudo less /var/log/exim4/mainlogThis is a perfectly reasonable way to examine your Exim mail logs. It's also a perfect way for an intruder to run arbitrary commands without any obvious signature in
sudologs, sincelesslets you start a shell with '!bash'.- even if you assume no malice,
sudocan't necessarily tell you if one of your sysadmins accidentally ran a command. Again, consider:sudo vi /some/fileIn the course of editing this file, I want to wordwrap a paragraph of something. As a hardened
viperson I immediately and reflexively type!}fmt, except that I accidentally type!}famtinstead and you have afamtcommand. This isn't logged bysudo, of course. Depending on the resulting output I may realize that something has gone badly wrong (and I may even realize what I've run), or I may just say 'whoops, some sort ofvimistake again,uand try again'. Certainly in the latter case I'm very likely to swear up and down later that I couldn't possibly have runfamt, I was just editing a file, andsudo's logs will back me up.
This isn't to say that sudo's logs are useless, because of course
they aren't. In many cases they'll be able to mostly or completely
reconstruct the history of root actions. This can be very useful for
troubleshooting (especially positive troubleshooting, where you want to
know 'did I try <X>' and the sudo log will say 'yes, you did') and
reverse engineering what you did when you were flailing around in a
hurry and not keeping careful notes.
(If you feel like playing the blame game they can also be used to prove that someone did not do some particular thing that they should have during an incident. Doing this is a terrible idea but that's another blog entry and anyways it mostly consists of pointers to sources like John Allspaw.)
What sudo's logs can't do is record definitively or relatively
definitively what happened and what commands were run as root (and by
who). If you need this level of knowledge and certainty, you need more
than sudo logs. At the extreme you need SELinux level audit logs.
Update: It's been pointed out that recent versions of sudo can
log all session IO and allow you to examine them after the fact. This
adds stronger semi-auditing but not in any convenient form (consider how
hard it would be to see the !}famt vs !}fmt error in an IO log, for
example).