Wandering Thoughts archives

2009-04-27

The problems of over-documenting things

There is a certain school of thought in system documentation that believes, to stereotype things, that there is no such thing as being too explicit or having too many examples. Much of Sun's Solaris documentation makes a great example for this school.

Unfortunately, these people are wrong. There is such a thing as too much documentation, because having too much has a number of problems:

  • your documentation becomes less and less readable, as the important things are buried under a flood of examples, cross references, and low level walkthroughs of how to do everything in sight. All of this is irrelevant clutter if I am trying to understand your system.

  • your documentation becomes less useful as a reference work, because it is harder to skim it to extract the useful piece of information that I need to jog my memory.

  • it is potentially insulting to your audience (especially if you are writing it for a specific local audience), because it implicitly assumes that the people reading it don't already know all of the basic things and have to be walked through everything in detail.

    (Even if people don't find it actively insulting, they are probably going to assume that your documentation is not aimed at them and they should go find something else.)

In short, belabouring the obvious takes up valuable space and people's limited time, distracts people, and can annoy them. (And that's what writing really detailed documentation is.)

In theory you can get around some of these problems by pushing your detailed examples and so on off to appendices. This avoids some of the problems but it still has the drawback that you are writing extra material, material that in my opinion is mostly pointless.

(This is not to say that examples and being 'obvious' are always bad things; per DocumentationAssumptions, sometimes they're necessary.)

OverDocumentationProblems written at 01:51:45; Add Comment

2009-04-26

Lighttpd, CGIs, and standard error

Here's an issue that I ran into recently: from version 1.4.20 onwards, lighttpd has stopped passing stderr to CGI processes, should one be peculiar enough to run CGIs on lighttpd (which can be done). If you need error logging from your CGIs, you will have to roll something on your own.

(Fedora 10 has lighttpd 1.4.22 and so is affected by this.)

There is a bug report for this, but I don't know if it has been accepted by the lighttpd developers as an actual bug that will get fixed. That two versions of lighttpd have been released with this issue suggests that, as a minimum, it is not being considered an important issue.

(I suspect that relatively few people use lighttpd to run plain CGIs; most people probably use FastCGI or SCGI instead if they need dynamic things.)

The simplest thing is to just write a cover script for your CGI that dumps standard error into a logfile, but this means that it won't be timestamped. If you need that, you'll have to post-process your program's stderr while leaving standard output alone, which the Bourne shell unfortunately makes rather difficult.

(My solution was to write a program that ran a subordinate command while capturing and timestamping its stderr and leaving its stdout alone. This is relatively trivial in anything that has access to Unix system calls; I wrote mine in C for obscure reasons.)

LighttpdCGIStderr written at 02:03:18; Add Comment

2009-04-21

Why your ticketing system should not be accessible to users

Here's a thesis:

The problem I see with making ticketing systems accessible to users is that you periodically need to have a genuinely private conversation among the sysadmins in response to user requests. I'm not sure how often this happens in typical organizations; ours is somewhat atypical in how we structure support, and possibly as a result it seems to happen fairly often here.

If your ticketing system is (internally) public, then you can't have these conversations in the ticket; you have to find some other place for them. The best option is probably to have them in email that is not copied to the ticketing system, but even that means that your ticketing system is no longer the single place where the entire issue is kept track of, which raises the question of what it's for.

(Per OptionalTicketing, I don't think that it should be the official place that users interact with the sysadmins.)

Arguably your ticketing system can still be your tracking mechanism, but I think that it's going to be a relatively weak one; at best it can remind you of what's still in progress (and how important or urgent it is), since you may have to go to your email to find details. (You probably don't want to even mark tickets as 'we're having a private conversation', so you will also have to remember that you need to check email for the full status for particular tickets.)

The other problem of starting to use email for some conversations about tickets is that it is easy for such a conversation to wind up covering things that probably should be in the ticketing system so that the user can see them. In turn, this is going to frustrate either or both of you and your users (you as you have to copy information into the ticketing system; your users as the ticketing system no longer really reflects reality).

PrivateTicketing written at 23:35:22; Add Comment

2009-04-18

What users see as benefits from sysadmins

Here is something important about system administration: us merely doing our work does not give our users a benefit, not in the sense that users perceive benefits. This is because, fundamentally, the job of sysadmins is to make the systems work the way that they should to start with, much like the job of the janitorial staff is to keep the building clean the way it should be.

(System administration is somewhat different from janitorial work in that what we work on was not 'clean' in its natural, pre-user state, but users don't care about that.)

One of the consequences of this is that it is not a compelling sales pitch to users to tell them that if they do extra work, our work will be easier or we can do more of our job (or do it faster). Users don't care about how easy or difficult our job is, and we should already be doing all of it to start with. (In fact, from a user perspective most of our job should be unnecessary.)

There are situations where you can persuade users that this is not the case, but I suspect that such situations are all deeply dysfunctional to start with; they are visibly understaffed, or they contain lots of things that just intrinsically break all the time (for reasons that are clearly not the fault of the sysadmins) or the like.

To provide 'real' benefits to users we must go above and beyond making the system just work; we must make it work better than the users expect, make things more convenient than they thought possible, and so on. Then we can make a compelling case to users that if they do a bit of extra work, they get actual (perceived) benefits.

(Disclaimer: all of this is for what I will call 'operations' system administration. If your job is to build new and novel things, then I suspect that there will be less of a perception that things should just work to start with.)

UserSysadminBenefit written at 01:59:13; Add Comment

2009-04-13

Your ticketing system should be optional

There's a fair bit of enthusiasm for ticketing systems in the sysadmin world, and it's not hard to see why. But I'm going to be a contrarian here: while ticketing systems are all well and good, you should absolutely not require users to use them in order to interact with you.

The problem with ticketing systems is that they're like bug trackers; they're internal systems that are almost always filled with fields and procedures that exist for your needs. Your users should not have to fill in these fields and jump through those procedural hoops, because they don't give your users any benefit; they just help you.

This doesn't mean that you shouldn't have a ticketing system; it just means that you should take freeform problem reports by email or whatever, and put them in the ticketing system yourself. And if some users want to file their own tickets, it may make sense to make it available to them.

(It follows that the more complicated your ticketing system is, especially in the number of fields you have to fill in to make an initial report, the worst it is for users. Unless they are unusually interested in how your run the systems, they probably have no idea what the right answers are for all of those questions that they're being asked.)

OptionalTicketing written at 01:48:42; Add Comment

2009-04-09

Why ssh needs to verify host keys

Suppose that your ssh does not check host keys and an attacker has not so much successfully impersonated one of your machines as persuaded you to connect to his machine instead of yours, without you knowing. How does the attacker 'win'?

We can see the obvious wins by what ssh turns off if you have StrictHostKeyChecking set to 'no' and a key fails to verify. First, the attacker wins immediately if they can get you to type your password to their ssh daemon, since they can just have their ssh daemon log it. Second, the attacker can win if they can get access to various bits of your local session, such as your X display, any forwarded local ports, or especially your authentication agent. Even a straight text connection is potentially dangerous, since many terminal environments these days can be coaxed into taking various actions in response to escape sequences.

With more work, the attacker can also probably conduct a standard man in the middle attack, passing things along to log you in to your real target and then capturing any further passwords or sensitive information you access in your session. (It's only 'probably' because your real target might not accept your authentication from the attacker's server, if you've set restrictive from="..." permissions on your keypairs or the like.)

Up until now, I naively thought that there were few risks for automated scripts using keypairs for passwordless logins; on an impersonation attempt the login would just fail because of missing keypairs. The flaw in this logic is that I believe there's nothing to stop an attacker's custom ssh server from just unconditionally accepting your keypair and letting you 'log in'.

(In theory the ssh protocol could require the server to demonstrate that it knew the 'public' half of your keypair, but I don't think that it does.)

Then if your script is pushing something, the attacker can get a copy of it; if your script is pulling something the attacker can feed you whatever data they want to. The attacker can also probably do a standard man in the middle attack to monitor what your system expects to pull (or push), and thus capture your real data and analyze what your system expects. (The qualifiers on 'probably' are as before; it depends on what your real target machine will accept the keypair from.)

(Credit where credit is due: a conversation with Pete Zaitcev about my previous entry started me thinking carefully about this issue.)

WhyVerifyHostKeys written at 00:17:19; Add Comment

2009-04-07

Handling ssh to generic hostnames

(This idea is not from me, it's from R Francis Smith. It is just sufficiently nifty and wrong that I'm going to write it up for posterity.)

Suppose that you have a generic hostname, a hostname that either is multiple machines (with multiple IP addresses) or a virtual host that gets pointed to different physical machines from time to time. Further suppose that inside your environment, your users ssh to that machine, or at least want to. The traditional problem with this is that for good reasons ssh's host key checking will start screaming about mismatching host keys the moment that you wind up talking to a different physical machine that has, of course, different host keys.

So, the ingenious evil solution for this problem is to have a Host stanza for the generic hostname in your /etc/ssh/ssh_config that turns off the various ssh host key verification options, so that ssh never even notices the mismatched host keys and thus never complains about them. Yes, this is kind of unpleasant, but it is better than the alternative (which is very close to not having useful generic hostnames), and you can make it less risky by turning off password-based authentication methods and other dangerous things.

This is a somewhat limited solution to the problem, since it only works within your systems. But that's probably the only place that you want it to work anyways.

(The simple evil solution to the problem is to give all of the physical hosts for the generic hostname the same host key. You probably don't want to do this.)

Sidebar: how to turn off ssh's host key checking

The options that you want are:

StrictHostKeyChecking no
UserKnownHostsFile /dev/null

With these set, ssh can do all of the host key checking it wants to but it's never going to get anywhere, and so never gets in the way.

(I will assume that the generic hostname is not in your global known hosts file, because there is no reason to put it there since it doesn't have a constant key.)

SshToGenericHosts written at 22:48:26; Add Comment

2009-04-02

System status announcements and where your users are

Commentators on my earlier entry brought up using Twitter for system status announcements and the like. I think that this makes sense for a lot of places, but I don't think it makes sense for us, and it has to do with where our users are; not in a geographical sense, but in a network and Internet sense.

For many Internet companies, your users are outside, on the general Internet, trying to get 'in' to you, and so using a status reporting system that is outside on the general Internet makes sense; it is where your users already are. But for us, most of our users are inside trying to get out, so we need an inside status reporting system to tell them about things.

We can have an outside status reporting system too; some of our users are outside (using VPNs and so on to get in), and other users may prefer to get status updates through the outside service when possible (they already use it for other things or they like the interface better, and most system issues will hopefully not interrupt the connection between inside and outside). But for us the inside system has to be the primary, because we need to be able to easily post to it when the link to outside isn't working.

(And similarly, it may make sense for Internet companies to have an inside echo of their outside status feed, although I know less of those issues.)

UserLocations written at 01:37:49; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.