Wandering Thoughts archives

2011-10-31

Why 'quiet' options to programs aren't as useful as you think

Every so often, someone writes a program that is overly chatty by default (usually in the interests of being user-friendly) and then thinks 'I know, I'll make sysadmins happy by giving it an option to be quiet'. They are then often surprised when sysadmins seem to find the new option not all that useful or compelling. The particular example I'm thinking of here is Debian's apt-get, but there are others.

I can't speak for other sysadmins, but I can tell you why I'm not enthused. It has to do with trust. In order to use your 'quiet' option, I have to trust that it isn't too quiet, ie that you haven't turned off any important messages along with the unimportant ones. If your quiet option turns off too much it's actively dangerous to use, in theory more dangerous than seeing all of the messages at default verbosity. Unfortunately for this trust, you've already established that you have a bad idea of what messages to print when simply by making me need a quiet option. Unless you've quite thoroughly documented what the quiet option turns off, I'm generally going to avoid it and take my chances.

(Paradoxically I'm more willing to forgive overly quiet programs that need a verbose switch almost all of the time. Possibly this is just my biases speaking, because at least in theory I should be just as worried about the same issue.)

Of course, in practice an overly verbose default insures that while I may see important messages I won't actually notice them and thus won't react. But it feels different to know that I had a chance to see a message than to worry that I've accidentally suppressed it entirely (and through a deliberate choice), even if the practical results are the same.

(One way to put it is that one is a mistake and the other is negligence. That's not quite accurate, but as analogies go I think it's pretty close.)

WhyNotQuietOptions written at 22:56:54; Add Comment

2011-10-30

Why we have a VPN

I recently read Die, VPN! We're all "telecommuters" now (via Hacker News), which prompted me to think about why we have a VPN server for external access and why we're likely to keep it for the foreseeable future. What it boils down to is two factors.

The major reason why we have a VPN because of limitations. We have a VPN because of insecure internal software that we can't expose to the Internet, because of a presumed lack of security of some of the machines on our network, and because we simply don't have enough public IPs to directly expose all of our machines on the Internet even if we wanted to. So, for example, in order to get access to our Samba server your IP address must be inside our firewall, and that's it. There's very little that we can do about these limitations (although I suppose the still theoretical advent of IPv6 will deal with the last issue).

The minor reason is because any number of things do not have a better or at least more convenient authentication scheme than 'you have a University of Toronto IP address'. This includes both internal and external resources (such as access restricted journals that the UofT subscribes to). Even when there are additional magic ways of getting access to these things, having a UofT IP address remains the most convenient way for our users and our VPN means that they can have one regardless of where they actually are.

(These two conveniences combined are why I set up an IPSec tunnel for my home machine so that it has an internal IP address.)

Sidebar: a little technicality

I was careful to say 'for external access' up there, because we also use our VPN as part of our internal wireless infrastructure. We're required to authenticate all wireless access, and we also want as much wireless traffic as possible to be encrypted. The easiest way to do this is to get wireless users to immediately bring up a VPN connection, which creates both authentication and full encryption of wireless traffic. I think that this doesn't really count as a pro-VPN reason because it's effectively an implementation detail (and one that we could move away from with better wireless technology).

WhyOurVPN written at 01:34:27; Add Comment

2011-10-25

A reason not to automate: policy flexibility

We have some bookable compute servers, machines that can be reserved for a single person. When we first introduced them and for a long time afterwards, demand for them was low and we handled all of the booking and related tracking by hand (mostly with email). Demand has been picking up lately and as a result, we've automated much of the process.

(The trigger for automation was when the machines actually got queues of people waiting to reserve them. Handling a single active booking per machine was relatively easy, but queues meant that we needed to track who was next and it became important to end a booking on schedule.)

However, one of the pieces that we did not automate was the process of actually making reservations. It would not have been difficult to do it, but after discussing it among ourselves we made a deliberate choice to leave it out. Ultimately we did it for a simple reason:

When you automate something, what your automation does becomes your de facto policy.

Regardless of what your actual policy is, the practical reality is that people will view what the automation allows and doesn't allow as the real policy. If your automation doesn't allow it, in practice your policy forbids it. If your automation allows it, in practice your policy allows it.

There are two problems with this. The obvious one is problems with your code, including cases that you didn't think to check; for example, one person reserving all of the bookable compute servers at once. When this happens you are in the position of having to take back something that you gave the person, which generally makes people angry. Sometimes you will have to make 'new' policy on the spot to justify it, which makes people more angry.

The subtle problem is that automation is always cruder than real policy, because real policy is implemented by human beings who are prepared to be flexible when it's called for. As part of this, automation also doesn't really allow for appeals or pleading your case as a special exemption; automation just says 'no, that's not allowed'.

We have a general policy on fair reservations of the bookable compute servers; it's reasonably simple and reasonably easy to articulate. But when we talked things over, we decided that we didn't want to actually code it and thus freeze it, because in reality the policy is more flexible; we're prepared to make exemptions, we'll encourage two people who both really want the same machine at once to talk to each other about it (instead of just the first person to make it to the CGI saying 'I got it and that's that'), and so on. We think that preserving all of this is much more valuable than saving some time by automating a form.

(This is a lesson I first learned in the world of MUDs, where any system you automated had this risk; people took what the code did or did not do as being the game reality, instead of simply being an approximation of it. 'The code let me do it so it's okay' and 'the code says this is what happens' both cause problems. Sadly I forget it periodically, given that I was the person who initially thought about automating even the 'make a reservation' portion of the bookable server reservation system.)

AutomationAndPolicy written at 00:41:12; Add Comment

2011-10-21

How I'm capturing only the last portion of standard error

As part of my migration to Fedora 15, I have been dealing with a buggy program that crashes a lot. It has an option to dump lots of debugging information as it runs and collect stack backtraces when it crashes, and of course the developers want you to do all of this when you report problems to them. Inconveniently, right after I installed all of the debugging packages necessary for good stack backtraces the program seems to have stopped crashing. For now. I don't trust it and when it crashes next I certainly want to capture everything necessary for a good bug report.

(Since the program uses threads, I suspect that it has races that are now being masked by it slowing down to print all of that debugging output.)

Now, there's a problem: this is a program that I want to leave running all of the time. All of that debugging information adds up very fast; I was looking at many tens of megabytes of output a day, most of which was going to be pointless (when the program crashes the developers are only going to want to the last bits of the debug logs). What I wanted to do was only keep the last so much of the program's debugging output, not all of it.

This is of course the general issue of log rotation and standard error. I've written about this before but now I actually needed a program to deal with the problem, something that would capture and rotate the program's ongoing log messages to keep only the last so many of them. Looking back now, the first comment on that entry has a useful index of tools for doing this, but at the time a couple of days ago when I needed this I didn't remember the entry so I reached for the first tool I had at hand that I knew could do the job: djb's multilog.

I have multilog sitting around because I am still using dnscache as my local caching DNS resolver for various reasons (I'm planning to switch to unbound at some point, but configuring it for my setup is a huge pain). Multilog has many features and is total overkill for this specific need, but I knew I could make it go and that counts for a lot when a sysadmin is in a hurry.

To save myself (and anyone else) the trouble of sorting through the multilog manpage, here is the command line you want:

program | multilog '+*' sBYTES /tmp/where

/tmp/where must be a directory. BYTES is the size of individual log files; multilog will keep ten of them (you can change that by adding nNUM before the directory). The '+*' just tells multilog to put all log messages it sees in the log it keeps in /tmp/where. Since I have the disk space I'm using 1 megabyte log files with the default ten of them.

There's very little to say about the end result. It works. I wrapped the whole invocation of the crashing program up in a little script so that I can forget about the whole thing, at least until it crashes. (And if it never crashes, well, that's what I wanted in the first place; ten megabytes of rotating logs is a minor price to pay for it.)

CapturingLastNStderr written at 01:57:29; Add Comment

2011-10-17

What I want out of stable device names

There's a modest mania in various systems of giving sysadmins stable (or 'consistent') device names. Solaris has done it for a long time, I believe that Irix had some version of it, Linux has made two attempts to do it for network devices, and so on. Before people design yet another system that makes yet another attempt at it, I believe it's worthwhile to step back and ask what we actually want out of 'stable' device names.

My bias is that two things make device names a pain to manage: device names that aren't what you'd get if you reinstalled from scratch, and device names that depend on what physical slot hardware has been placed in. Change-dependent device names mean that the same hardware has different names on different servers depending on what exactly you did to the server; this complicates management for the obvious reason. Physical slot dependent names mean that things go off the rails if you do not always install your expansion cards in exactly the same spot. This can easily happen from either mistakes or different needs due to the servers being in different spots in racks and the like.

(This assumes unique hardware, eg your servers only have one additional Ethernet card. If servers have two different Ethernet cards and you shuffle their relative locations from server to server, things are a lot less clear.)

Around here, there are three main sorts of hardware changes that we make to systems: we transplant a system to completely new hardware (often but not always it's to identical hardware), we shift what physical slot an expansion card is in (for example after we discover that we don't really want a network card in the server's bottom slot because then it's very hard to get the network cables out), and we add a new card to a system.

When I transplant a system (ie, move its system disks) to completely new hardware, the simplest explanation of what I want to happen is that I want to device names to be identical to what I'd get if I (re)installed the system on this hardware from scratch. If the new hardware is physically identical to the old hardware, this should normally result in the system reusing the same device names.

(It's not hard for the system to detect when it has been transplanted. There are very few explanations for all of your old devices disappearing at once, especially if an identical set of devices with new hardware identifiers all appear at the same time. Most especially if you know that some of the devices are onboard devices, not ones on expansion cards.)

When I shift an expansion card from slot to slot, I almost always want the device names attached to the expansion card to stay the same. The device names are effectively tied to the function of the card and that function isn't changing just because it's better to have the card in the top slot instead of the bottom one. The only exception to this is if I have several cards providing the same resource; if I shuffle the ordering of the cards, I think that the safest thing is to keep the device names attached to the card location instead of the specific physical cards.

When I add a new card to a system, I generally want existing devices to keep their current names. Certainly I think that this is the safest default option. However, I also want an easy 'rename to what you'd get if installed from scratch' option because often this is the best option for long-term management.

Some of you are going to disagree vociferously with this set of views (and for good reasons). That's okay; what it really means is that how you name devices is a policy issue, not just a technical one, and there is no 'right' answer that works for everyone. This in turn has implications for how systems should handle device (re)naming.

(The more I think about it the more I think that systems should allow device names to have aliases and then have several different naming policies for different sorts of aliases.)

StableDeviceNamesDesire written at 01:47:57; Add Comment

2011-10-12

The true cost of sysadmin time (actually, of anyone's time)

Here's a question for you: what's the cost to the organization of having a sysadmin spend an hour sorting out a user's problem? At one level the answer is straightforward; you take the sysadmin's fully loaded salary, work out the equivalent hourly wage, and say that that's the cost.

But this is too narrow a view. In most places sysadmins don't have idle time; there is always something that they could be doing (either currently needed work or long term work to improve the environment). When the sysadmin's time is all taken up, that hour of time spent on the user's problem is an hour of time not spent on something else. This means the true cost of a sysadmin's time is the value of what else they could have been working on. The big cost of a sysadmin's time is not necessarily in dollars and cents. Instead it can be in opportunity costs, the forgone gains that result from not doing other work.

Of course, this generalizes. Everyone in busy; everyone has things they could be doing, often valuable things.

(These things may or may not be clearly reflected in the organization's bottom line, for various assorted reasons.)

This is not exactly a new idea. In fact sysadmins and programmers have spent years bemoaning that they're too busy with immediate needs to undertake necessary and valuable long term work. One interpretation of this is that the true cost of doing the short term work is not being properly recognized. From a salary point of view it looks like it is reasonable but in fact it is startlingly expensive because you're forgoing things with a high overall payoff.

(This is not quite the same idea as technical debt, but it's close.)

TrueSysadminTimeCost written at 00:43:55; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.