Wandering Thoughts archives

2010-07-31

It's the indirect failure modes that will get you

The University of Toronto's Internet link went down recently (well, became really slow and lossy, so we may just be being DDoS'd or something). I'm at home, so when I noticed the link problems I shrugged and carried on; it's not as if my home machine depends on stuff from work, so I didn't expect anything beyond the annoyance of not being able to get to work networks.

(Although the network being unreachable was going to be somewhat inconvenient, since I had a WanderingThoughts entry to write.)

Except that all of my web browsing was achingly slow. Epically, totally slow. Pages would only come up very slowly, or come up but the browser would say they were still loading. This was quite puzzling; my network link wasn't busy and it's not as if I proxy my web traffic through work. A check of my DNS setup confirmed that I was using my local caching DNS server and that server wasn't bouncing everything through work.

And then I looked at my DNS server's query logs:

[...] query [...] www.flickr.com.cs.toronto.edu.
[...] query [...] www.flickr.com.toronto.edu.
[...] query [...] www.flickr.com.

An uncomfortable light dawned. I had work's domains configured as my search domain list in /etc/resolv.conf and I had the ndots option set very high (for bad reasons), so every hostname resolution attempt was trying several university domains first. Normally I don't notice these because I promptly get negative answers from work's nameservers, but with the university's Internet link down those queries instead had to time out before the lookup could move on to trying the real name.

It turns out that modern web pages use a lot of different things from a lot of different domains. When each of these domains takes plural seconds to resolve, loading pages gets really slow. Slow on the initial load (as the browser resolves the actual website IP address) and then slow to finish, as the browser tries to fetch additional resource after additional resource.

This isn't a direct failure mode, where I was routing traffic through work; instead it was an indirect failure mode, where a couple of configuration options had an inobvious effect that was itself relatively invisible in normal operation. Direct failure modes are easy to see and relatively easy to remember; you can, for example, see that all of your traffic goes over your VPN to work, a VPN that is not working. Indirect failures are much less obvious and so are much more interesting (in the sense of causing excitement) and hard to notice in advance.

Sidebar: my ndots mistake

Many years ago when I first ran into the ndots option in resolv.conf, either it behaved differently than it does today or I just wound up with a mistaken impression about how it works. Back then, I believed that queries for names with at least ndots dots in them entirely ignored the resolv.conf search path and only ever looked up the absolute hostname. Since we love using abbreviated hostnames around here and local subdomains can have any number of dots in them, this implied that essentially no small value of ndots was safe. Thus I set a very large one and grumbled, and carried all of this forward when I configured my home machine.

This is not how ndots works today; today, ndots just sets the point at which the resolver will try an absolute hostname before trying your search path instead of only trying an absolute hostname only after running all the way through it. This is safe, and implies that an ndots of 2 is generally what I want (since I make frequent use of '<host>.<subdomain>' to refer to various machines at work).

IndirectFailureModes written at 02:53:15; Add Comment

2010-07-29

Some brief notes on OpenSSH's known_hosts hashing

A number of current distributions of OpenSSH default to storing host names and IP addresses in ~/.ssh/known_hosts in a hashed form, in order to make it harder for an intruder to work out where else you have an account on that you access from this system (this is the HashKnownHosts option for ssh). Since I recently wound up digging into this and the details are underdocumented, here's what I know about how this works.

The summary is that this is your traditional one-way cryptographic hash. The specific hash is a SHA1-based HMAC, but I strongly suggest not writing any code that knows that. The host name or IP address is treated like a password and hashed together with a random salt; both the salt and the HMAC result are stored in the known_hosts line. Matching the line later is done by extracting the salt, HMAC'ing your candidate hostname with it, and seeing if you got the same encrypted result.

(The salt appears to have relatively strong randomness.)

This means that checking to see if a particular host is present in a known_hosts file requires computing a separate HMAC for each line in the file. I imagine that this is not a problem in practice since most people have relatively short known_hosts files and SHA1 HMAC is relatively fast. As with unencrypted hostnames, it's possible to have multiple entries for a given host in known_hosts, each with a different key; if all of the hostnames are hashed, this may not be at all obvious.

(See sshd(8) for how multiple entries for a single host work. The short answer is that OpenSSH considers itself to have found a known host key if any of them match.)

This all means that hashed known_hosts files are system independent and will continue working fine when moved to a different host.

(As it turned out, the problem I was seeing was because my new test system had a different system known hosts file. Once I fixed that, everything worked, but I almost went off on a complete wild goose chase worrying about potential system dependent hashing of known_hosts. Having a hashed known_hosts did make it less obvious that the other host's key wasn't even in it, though.)

KnownHostsHashing written at 01:55:39; Add Comment

2010-07-23

A realization about why my inbox keeps being my to-do tracker

I have a problem. It's not a unique or novel problem; I understand it's one that a lot of people have. My problem is that over time my email inbox quietly winds up getting used as my to-do tracker, basically regardless of how I'm theoretically trying to keep track of this stuff. I leave messages in my inbox to remind me to do things, and the latest go-round of this has now reached the point where I email myself notes about things I want to do.

Today I had an insight about why this happens: because my inbox has visibility. I look at my inbox regularly, and in fact I have to look at it, because dealing with new email that comes in is part of my job as a sysadmin. Looking at my inbox means that the to-do items are visible, which reminds me of them, which drastically raises the chances that they'll get done.

Nothing else in my environment has comparable visibility; there is no other system that I have to check to do my job in the way that I have to check email. (To the extent that there are other systems I check that often, those systems don't have any natural way of showing varied messages. Sure, I could make my little script that monitors mail queue sizes also show me the top of some to-do file, but it would be completely artificial and I would take it out in a week.)

Looked at that way, it's no wonder that I keep drifting back to using my inbox to hold to-do items. It also illuminates the problem with using my inbox for this, which is that I lose track of sufficiently old things to do. This happens because at a certain point there are enough messages in my inbox that the old messages lose visibility because they aren't within a screen or two of the most recent messages; at that point they almost might as well not exist, based on how much further attention they'll get.

This suggests certain things about any to-do program or technique that I want to be successful. Clearly I need to work out some natural way to make the list visible, in fact to shove it in front of my face on a regular basis. If I can't do that it should at least be highly accessible, in a sense the opposite of my low distraction email notifier; there should be something on the screen all the time to remind me of my to-do list, and it should make getting to the actual list as easy and simple as possible so that I invoke it frequently.

(Somehow making it part of my browser start page might do the trick, although I'm not sure if that would feel natural enough.)

WhyInboxTodo written at 00:45:54; Add Comment

2010-07-22

Why keeping /etc under version control doesn't entirely help

One of the reasons that I'm not too enthused with various schemes to put /etc under version control is that they don't really give me what I really want out of the whole exercise.

First, let's assume that you have somehow divided your /etc repository up into a lot of separate modules in order to keep things straight. Without loss of generality we can look only at the evolution of a single module on a single system over time. The problem is that you really have three separate strands of development in action:

  • the evolution of the system's stock version of the files, if there is any.
  • the abstracted evolution of your general local version; what sort of changes and customizations you make and how you change this.

  • the merger between the first two, where you customize your general local changes for the current base state of the system's files.

(If you are trying to have several different systems or sorts of systems use the same repository, things get even more tangled.)

A purely time-based history will get you some tangled mixture of these three strands (probably some of them will only be implicit). It is possible to use branches, rebasing, or both to try to keep the strands separate, but at that point you start needing significant tool support (or mistake-prone manual intervention) because you can't just automatically checkpoint the state of /etc after any change; you need to figure out the context of each change and put it in the appropriate branch and then do the remaining work.

(There are still good reasons to have time-based snapshots of all of your configuration files along with some commentary on what changed and why, but there are a lot of mechanisms for doing that. Putting /etc under version control may or may not be the simplest one in any particular environment.)

EtcVCSLimitation written at 01:30:16; Add Comment

2010-07-21

The easy way to do fast OS upgrades

We recently went through the experience of upgrading all of our ZFS fileservers from Solaris 10 update 6 to Solaris 10 update 8. This took somewhere around twenty minutes of downtime per fileserver, most of which was waiting for ZFS pools to slowly import.

You might wonder how we got an OS upgrade to go so fast. The answer is that we cheated, twice.

The first way we cheated is that we didn't upgrade the OS; instead, we (re)installed Solaris 10 update 8 from scratch. This is our traditional approach with most of our servers (anything that doesn't have important local data, and we try not to have servers with important local data). We need to be able to reinstall servers anyways to cope with hardware problems, and once you have a well-tested reinstall process you might as well use it for everything.

The second way we cheated is that we didn't reinstall S10U8 on the same machine. Our ZFS fileservers have swappable disks, so we did the install on a spare server (with identical hardware) then swapped the new S10U8 disks into the actual physical fileserver during the downtime. And then, of course, we had to fix up all of the places on the system that knew what host it was and what hardware it was running on, which is really why the downtimes took more than a minute or two.

(This also gave us a rapid fallback if we had to; we could have just pulled the S10U8 disks and put the S10U6 disks back in.)

Now, various OSes have various sorts of software based fast upgrade schemes, and some of them even work reliably. But you can be pretty sure that swapping disks will work for anything, provided only that you can rename a system and move its system disks between hardware units, and you're going to want to work out how to do both of those anyways.

(Sadly, these days systems are increasingly welded to the hardware that they were installed on in various perverse ways that require annoying amounts of effort to reverse or override.)

FastOSUpgrades written at 00:22:26; Add Comment

2010-07-18

How I solve the configure memory problem

For my sins I sometimes build programs from source that use standard autoconf-created configure scripts. The whole autoconf system has a number of problems, which you can read people rant about at length, but my problem is that I almost invariably build things using various different non-default arguments. Which leads to what I call the 'configure memory problem': when it comes time to rebuild the package for whatever reason, how do I remember which configure arguments I used and recreate them? Especially, how do I have this sitting around in a convenient form that requires as little hand work as possible?

(Yes, configure will write all of this information in various magic files. Which it puts in the source directory, and which get deleted if you do a forceful enough 'make clean', and so on.)

I've gone through a number of evolutionary steps on this problem; not worrying about it at all and then discovering that I'd forgotten the necessary magic arguments and had to recreate them, putting a little script file that runs configure with the right arguments in the source directory where it will get deleted when I clean up the source directory, and putting the little script file in the parent directory where I will lose track of it. None of them were ultimately satisfactory, so my current solution is a master script that I named doconfig.

My version has a bunch of complications, but at its heart doconfig can be written as the following simple script:

#!/bin/sh
CONFDIR=$HOME/lib/$arch/src-configs

dname=$(basename $(pwd))
if [ -x $CONFDIR/$dname ]; then
   exec $CONFDIR/$dname
else
   echo $0: cannot configure $dname
fi

(The complications exist to deal with all of the cases where the directory you need to run configure in is not a handily named top level directory. My version has an overly complicated file that maps some amount of trailing directory components to a script name, so that I can say that X11/fvwm-cvs/fvwm maps to the fvwm-cvs script.)

As you might expect, this turns out to be a handy way of capturing my knowledge of how to configure specific packages on my systems. Its simplicity means that it's easy and fast to take advantage of, and that I write scripts means that it's easy enough to augment them for various situations (such as handling both 32-bit and 64-bit builds, where they need different configure arguments).

MyConfigureSolution written at 01:03:21; Add Comment

2010-07-17

More building blocks of my environment: tkrxterm, tkssh, and pyhosts

Given my rxterm and sshterm scripts, I need some way to run them in order to do useful things. The obvious thing to do is just to run them from the command line, but this requires that you have a command line sitting around and in my disposable environment this is generally not the case.

tkrxterm and tkssh are two very simple TCL/TK programs that do essentially the same thing. They each throw up a window with a label and a text entry field, let me enter text, and then run the appropriate command with the text as the arguments when I hit return in the text entry field. I have my window manager set to bring up an applications menu when the middle mouse button is pressed in the root window, and one or the other of these is the top entry on that menu.

(At home I pretty much only use sshterm, so it is the top entry; at work, I have enough bandwidth to use rxterm by preference.)

If I'm opening several windows on a single machine, though, it's kind of annoying to have to keep calling up one or the other of these programs. So I have a third command, pyhosts, which is designed to make it more convenient to repeatedly open windows on random machines. The easiest way to explain how it works is to show you that rarity on WanderingThoughts, a picture:

Pyhosts window

The empty area just below the top is a text entry field; the apps1 and mailswitch labels are buttons. If I enter a machine name in the text field and hit return, pyhosts starts an rxterm (or an sshterm if I flip it to do that) and adds a label button for the machine. Clicking on the label starts another rxterm or sshterm. Using shift-return (in the text entry field) or the middle mouse button (on the label button) runs rxterm -r to get a root shell instead of a normal shell. Machine labels are sorted in alphabetical order and only the most recently used four are kept, in order to keep the size of the window down.

(Because I am grimly decluttered in my computer interfaces, one can delete machine labels with the right mouse button and I routinely do so, reducing the pyhosts window to its minimum size.)

I am not sure that this pyhosts setup is the last word in getting decently convenient access to a reasonably large collection of random machines, especially given my flirtation with sshmenu and other hacks. However, I have not yet attempted to come up with a better approach.

(My normal full desktop environment isn't a Gnome desktop, so I can't just use sshmenu et al. Some day I will figure out how to have Gnome applets and Gnome's alert area without having to run all of Gnome.)

Credit where credit is due department: pyhosts is my adaptation of a program that I inherited from a previous coworker (who I believe may have inherited it from yet another coworker), because I saw it and thought it was a nifty idea. Most of the code is not my own.

(The chainsaw marks are my work.)

Sidebar: an interface that didn't work

I have experimented with hooking things up so that if I select a machine name (in the X selection sense) and call up a menu entry, it automatically runs an sshterm or rxterm to that machine. This sounds like it would be neat and convenient; you could do things like get email mentioning problems on machine X, highlight the machine's name in the email, pop up the menu, and bang, have a login on the machine to poke around.

In practice, I don't seem to run into this sort of scenario often enough to make this feature worth remembering. It was faster to just retype the machine name (or select and paste it, which is a couple of mouse clicks) than it was to find and invoke the special 'fast' way.

ToolsPyhosts written at 00:53:00; Add Comment

2010-07-12

On (not) logging calculated statistics

The more I look at the statistics logged by various systems and programs, the more I've come to a conclusion: logging calculated stats as well as the raw stats is almost always a waste of time. Log analysis programs can just as well change units, compute (nominal) average per time interval, and so on, and in the mean time logging both just clutters up my logs (and sometimes subtracts clarity, often when it is not obvious that it is a calculated stat instead of a real one).

This is not a completely hard rule. Sometimes the calculated stat is both immediately useful enough for people glancing at the unprocessed logs and hard enough to work out by hand in one's head that calculating and logging it is warranted. But my gut feeling is that these cases are pretty rare.

(If your system logs only the calculated stats and doesn't record the raw information, especially if it aggregated the raw information together, you're probably annoying me. Hiding the raw data just makes it harder for me to diagnose problems that you didn't think of, where I really want the unprocessed information so that I can try to extract as much from it as possible.)

This doesn't apply to programs that just present data instead of logging it; for this sort of thing you want the information to be in as friendly a format as possible, so turning unwieldy raw stats into nice friendly calculated ones is a good thing. But watch out. Sometimes the only way of getting useful data to log is to capture the output of that 'friendly' presentation program, and then sysadmins are going to want the stats in as close to a raw format as possible.

(Note that one problem with friendly calculated stats is that the same formatting that makes them attractive to humans makes them harder for programs to parse. As a person, I like seeing things like '10 GB'; as a programmer who now has to parse that field back to some value so I can sort it or compare it with other fields, I like it a lot less.)

NotLoggingCalculatedStats written at 00:21:58; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.