2010-05-31
Another building block of my environment: sshterm
I am unreasonably fond of running X programs remotely; it's always
struck me as one of the niftier bits of X, and I like to use it as
much as possible. But even I have to admit that it's not always the
right answer, and thus sometimes my rxterm script
isn't what I want. For those times I have another script, which I
unimaginatively call sshterm.
Sshterm is the direct inverse of rxterm; instead of using ssh to run
a remote xterm, it uses a local xterm to run ssh to the remote
machine, with some trimmings. Because this is a lot simpler a job than
rxterm has, the script is a lot shorter, but it does have a few
important features that complicate it a bit. First, it puts the remote
machine's name in the xterm title so that I can tell my xterms apart
(although many shell environments immediately overwrite the window title
anyways, the behavior is not yet universal). Next, it turns the xterm
red if I am ssh'ing to something with 'root@' in the hostname, just
like how I have 'rxterm -r' set up. Finally it has an option to run
gnome-terminal instead of xterm (and makes everything work just the
same with it).
(It turns out that there are a certain number of things that just
work better in a UTF-8 gnome-terminal environment than in my plain
xterm one. Usually these are programs that try drawing elaborate text
graphics, such as certain Debian and Ubuntu package installation tools.)
In theory sshterm accepts a -r argument, just like rxterm. In
practice I never use it and instead just tell sshterm to connect to
'root@wherever' when I want to be root somewhere.
In a sense sshterm is a silly command; it's not very difficult to
start a terminal window and then type ssh into it. But in practice
it's been one of those little lubricants that make things enough easier
that I use it all the time, because it handles all of the little fiddly
details for me.
Sidebar: on marking 'root (terminal) windows'
I have a personal twitch where I want all windows where I am root to be clearly visually distinct, so that they instantly stand out when I look at them (even if I'm vaguely distracted). Some people use the shell prompt for this (and I do to a certain extent), but I find that this doesn't stand out quite enough for my tastes, so many years ago I arranged to make the foreground text be a pretty strong red in such windows, instead of my usual black.
In theory one could probably do this with xterm escape sequences.
In practice I do it with xterm command line options, which has the
drawback that it doesn't work in windows where I started out normal
and then su'd to root later. Fortunately I don't do that very much,
especially with tools like my rxterm script around.
(gnome-terminal has no command line options to control the foreground
text colour. Instead you have to create a new profile with a different
text colour and then use a command line option to set the initial
profile.)
2010-05-24
Give your personal scripts good error messages
Sysadmins have a habit of accumulating little personal scripts to make our lives more convenient, often things that make perfect sense to us but are too peculiar or specific to be worth installing generally. Since these are personal hacks, they're often bashed together in very casual ways; if it mostly works for us, it's good enough.
From personal experience, I now have a suggestion for these scripts: make sure that they have clear or at least comprehensible error messages, not little cryptic ones. If you do not, there will come a day when you get that cryptic error message, generally quite a long time after having written that script, and you will sit there going 'what the heck does this mean?' and scratching your head. And then you will wind up retracing and reverse-engineering your script, and that is just plain embarrassing.
This is especially important if you put in the error message about a can't-happen situation, or at least one that your script doesn't handle, because these are just the sort of things that come up a year or two after you've written the script. By the way, I strongly advocate making any personal scripts that summarize and amalgamate information explicitly check for situations that they don't handle. This saves them silently produce very wrong answers and possibly having you act on those answers.
(It's common for me to take shortcuts and make assumptions in such scripts, since after all they only have to work in our very specific environment. Of course, sometimes these aspects of our environment change and my assumptions blow up.)
A good error message is one that is clear and complete enough that you know what went wrong without remembering how the script works. Especially if it's for an error you never expect to happen, err on the side of verbosity and over-explaining things; terse error messages make sense only when you're going to see them reasonably often.
Sadly, doing this well can be harder than it looks. When you're writing a script you're immersed in what it does and how it works, so even if you think in these terms it's quite easy to over-estimate how much you'll remember six months or a year from now when you actually get that error message. Speaking from more personal experience, even rewriting the error message after you've had to retrace the script a year later doesn't entirely help, since of course now you understand the script again.
(And on that note, I should go revise an error message or two.)
2010-05-16
You should also document why you didn't do attractive things
I recently needed to do something to our MoinMoin-based wiki. As it was currently configured, doing that thing was a parade of annoyance, so I wound up rummaging through the configuration options and found one that significantly simplified my task. Now, suddenly, I had a dilemma.
Our existing MoinMoin configuration didn't have this option turned on, and the person who configured our MoinMoin instance isn't around any more to be asked questions. So, had they overlooked this option when they set up the wiki, or had they tried it and discovered that sadly it didn't actually work or worse, had some undesirable side effect elsewhere?
(We have documentation on the configuration settings that they use, but as common it covers what got changed from the defaults. And this option defaults to off.)
So, I have a suggestion: when you are documenting how you configured something, you should take a moment to write down all of the attractive-looking options and approaches you tried out but that turned out to not work, caused problems, or whatever. Otherwise you risk someone coming along later, seeing that you have not done something that will make their life easier, turning it on themselves on the assumption that you overlooked, and having things explode.
Conversely, documenting things that you have tried but not used gives those people some confidence that you actually overlooked this convenient option and they can thus turn it on with only ordinary precautions and concerns.
By the way, your future self is likely to be one of those people (unless you have a better memory than I do for things that I tried and rejected).
PS: this generalizes to more than software configuration files.
(To add a conclusion to my MoinMoin story, so far it seems that the configuration option I found works fine and has no bad side effects. I'm happy.)
2010-05-14
A sysadmin mistake: shooting your virtual foot off
Here is a mistake that we've actually made more than once.
We have NFS fileservers, and to enable basic NFS server failover) each of them has both its real hostname (and IP) and a virtual IP alias. The real hostname is relatively long (we use the names of cities that start with 'san'); the fileserver's virtual hostname is short ('fsN' for some single-digit N). The result is that when we log into machines, we almost always use the virtual hostname since it's shorter, easier to remember, and what we care about.
(Sometimes when we can't recall which physical machine is which fileserver, we actually work it out by logging in to fsN and seeing what hostname it has. Hey, it's easier and faster than any other method I can think of.)
Suppose that we want to take a virtual fileserver's IP alias off the network, either to move it between physical servers or because we're about to do something that would cause NFS clients to get spurious 'permission denied' error messages on NFS operations if they could actually talk to the fileserver. So along we go; we log in to the machine, do various other prep work, and bring down the virtual IP:
$ ssh root@fs9
[... other stuff ...]
[root@sanwhatever-fs9]# ifconfig e1000g1:1 down
And suddenly our ssh session hangs. People sit around scratching their heads and worrying about the machine crashing until suddenly the light dawns: we just shot our virtual foot off. Well, it's more that we just sawed off the branch that we were standing on.
Oh sure, we were thinking 'we logged in to the machine and took down the virtual IP alias, why did our session hang?'. But that's not what we actually did. We logged in to the virtual IP alias, because that's what the convenient short name maps to. It's just that normally the difference between logging in to the machine via the virtual IP alias and its real IP address doesn't matter, so we forget this picky distinction. However, when you're logged in to an IP address and you take that IP address down, well, yes, you lose your connection.
(From one perspective, this is an example of an abstraction failing under a corner case. Normally we can use 'fsN' and 'sanwhatever' as the same thing, as an abstraction; this is one of the cases where we can't, but it's easy to forget because we're so accustomed to the abstraction.)
2010-05-11
Retrospectives are uncommon
Something I have been mulling over as a result of this entry (and being prompted to write it) is how uncommon retrospectives are in writeups of things. Over the years in all of the usual sources, both old and new, I've seen quite a lot of writeups of the form 'this is the shiny thing that we've implemented and here is our short term experiences'; heck, I've written any number of them myself here on WanderingThoughts. But I've not seen very many looks back after a year or three, or after people have had the chance to do a second version of whatever system (or at least considered and rejected doing it).
(This pattern has more or less held true well before people started writing blogs; I am pretty sure I saw much the same effect in LISA proceedings in the early and mid 1990s.)
Writing up things when they are new and novel and you are enthusiastic about it is not a bad thing, and often retrospectives are less interesting than the original writeup (especially when everything works well). But I suspect that there's usually something interesting that's learned after a year or two, and I have to wonder how much we're missing by not writing and publishing retrospectives more often, or at least thinking about them.
A retrospective has a lot of potential ground to cover even if things went well in general. What worked and what didn't? What surprises came up? What would you change if you did it again? What important thing turned out to be missing, or conversely what did you spend a lot of time on that turned out to be unnecessary?
(When things turned out not to work out after all, a retrospective is even more interesting although often more painful to write.)
And even if the answers to these questions are all boring or you don't write anything up in the end, I think that there's clearly value in regularly looking back at our work this way. For better or worse, a lot of system administration is relentlessly forward looking; if something is working well enough, it's easy to put it out of our minds even if there are things we could learn from it.
(Having said all of that, I have no idea if I'll have the time and energy to put this into action on any of the various things I've written up here. Which neatly illustrates the problem with all of this; who has the time?)
2010-05-09
Why diskless Unix machines lost out
A long time ago, diskless Unix machines were all the rage. These days, they've all but vanished (although they live on in some specialized applications, such as LTSP). From my perspective, what made diskless machines lose out is threefold: performance issues, cheap disks, and complexity. Of these, the real killer issue was complexity.
The hard core of the performance issues is not so much how fast a single machine could access its 'disks' (although this should not be underestimated itself) but how fast a whole bunch of them could do this all at once. Gigabit Ethernet may be as fast or faster than modern disks, but that's only if you're the only one trying to talk to your fileserver; have a few people trying to do that, and the fileserver's network becomes a bottleneck very fast even if its disks can keep up.
(You can argue that single client/single server numbers are now comparable for local disks and NFS filesystems. This may be true (I haven't measured), but it definitely wasn't in the past. For a long time, NFS was slower to significantly slower than local disks, especially for certain sorts of write operations.)
The hard core of the complexity of diskless machines is that Unix systems were never redesigned to allow you to really share all filesystems between machines, especially including the root filesystem. Instead, they always needed their own root filesystem and per-machine storage and some degree of per-machine administration to go with it. Once you have to have this storage and administration, where the actual storage lives is usually a secondary issue; you might as well put it where it is cheap and fast and common and does not cause you problems.
(There's nothing intrinsic in Unix that requires a per-machine root filesystem (eg, see here), but no one has ever seriously tried to build a general-purpose Unix that way.)
(This entry was inspired by reading Scalable day-to-day diskless booting, because I disagree strongly with their view that having the OS on local disks is incompatible with large scale administration. The truth is that building systems that use a single common system image is both hard and completely unsupported by current Unixes; you can probably make it work, but you'll be building it from scratch on your own. If you don't have a single system image, you need automation regardless of where your separate system images live and how much or little space they take up.)