Wandering Thoughts archives

2010-03-31

Looking back at a year of our disk-based backup system

It's now a bit over a year since we first deployed our disk-based backup system (although I only wrote it up in May, after we built the second machine). This makes it a good moment to look back and talk about how things are going, especially since someone I know asked me just this question recently.

On the whole the answer is that things are going well and quietly; with one exception, there haven't really been any surprises or gotchas. The one exception is that we are seeing somewhat more read errors on the hard drives than we expected or entirely like. It's unlikely to be a bad batch of drives, since we also put some of those drives into our iSCSI backends and they haven't been having anywhere near the same rate of errors.

(We're reasonably careful about handling the disks, including letting them spin down before we remove them from their enclosures, but we do handle them and move them around more than desktop drives probably normally experience. Possibly consumer desktop SATA drives are more sensitive and fragile than one might expect.)

Since we are using these disks strictly for relatively short-term backups, I don't consider this a problem. Things happen to backups all the time; that's why you have more than one backup of anything. Our periodic longer-term archival storage runs are still done to LTO tape (as mentioned in the original entry).

One nice benefit that we didn't entirely expect is that our disk-based backups have drastically speeded up small restores of recently deleted files, which is the most common sort of restore request we get. We can now usually do these in a few minutes (and without having to get up from our desks to go move tapes around), which we quite appreciate.

DiskBackupSystemII written at 22:28:16; Add Comment

2010-03-29

My theory on why our worklogs work for us

I mentioned in the last entry that 'worklogs', email reports of what we've done on our systems, work for my current job here at the university but did not really work for my previous job, to the extent that we fell out of the habit of writing them. As it happens, I have some theories on why this happened; in fact, I think that there are two important contributing factors, one technical and one cultural.

The first is that we have an official search interface (and private on-web archive) of our worklogs. This makes them actively useful; they are not rarely consulted, essentially make-work history, they are live reference documentation that other people can and will actively use. This creates a strong cultural pressure to keep writing them.

(It also makes it easy to refer to previous worklog messages, since each has an archive URL, and in fact it's common for worklog messages to refer to previous ones for context or fuller explanation of procedures or whatever.)

The other reason is that worklogs are a communication method between members of the group and this communication is necessary because we are not siloed into little independent areas and projects; we all work on pretty much all areas of our systems. This is very definitely a cultural issue because my previous job wound up being relatively strongly siloed, where each person had their distinct speciality that only they worked on.

(My current job is culturally anti-silo, in fact, in that people actively try to become familiar with areas that others are working on and are uncomfortable with one person being, say, the email specialist. This doesn't mean that we always achieve perfect parity of knowledge and capabilities (and I'm not sure it's even possible), but we do try to make gestures that way.)

WhyWorklogsWorkForUs written at 02:00:43; Add Comment

2010-03-28

The evolution of checklists in my work

Over the past few years, I've become a real fan of using checklists. But I wasn't always this way, and I think how I evolved to using checklists in my work is an interesting illustration of how small cultural things can make a big difference.

As far as I can reverse engineer the evolution, it goes like this:

A few years ago I switched jobs within the university, moving to a new group. The new group had and has a strong culture of writing what we call 'worklogs', an email report of anything we do on our systems. Coming into the group, I picked up on this and started doing it myself.

(In theory my old job also wrote worklogs, but for various reasons we fell out of the habit and thus the practice of doing so.)

Our worklogs usually have more than just high level descriptions of what we did; they have full details, sometimes down to the actual commands and logs of their output. Again, this comes from our local culture. The easy way to do this is to draft your worklog message as you actually go through whatever you're doing.

If I'm writing down all of my commands and actions when I do things, right down to cut & paste, I might as well first write down the commands in my draft worklog message and then copy them into my root session instead of vice versa. That way I don't have to trim out shell prompts and other extraneous bits, and I'm lazy. And once I go that far, I might as well write down the full steps in advance, commands and commentary and all, essentially writing a full draft of my worklog message before I start doing anything.

Voila, I have just written a checklist.

(I did not go through all of these steps instantly; each of them was an incremental shift over time. Nor did I see where they were taking me before I wound up writing up a checklist and walking through it.)

ChecklistEvolution written at 01:11:37; Add Comment

2010-03-17

Another building block of my environment: rxterm

Like many sysadmins using Unix workstations, I spend a lot of time running xterms. Given that most of the time the remote X program I start with my rxexec script is an xterm, it's no surprise that I wrote another script to automate all of the magic involved, called rxterm.

Rxterm's basic job is to start an xterm on a remote system with all of the right options set for it; for instance, so that the xterm title and icon title have the name of the system that xterm is logging on. Like rxexec, rxterm has a number of options that are now vestigial and unused (but still complicate the code).

(Some people set the terminal window title in their prompt. I don't like that approach for various reasons.)

If this was all that rxterm did, it would be a very short script. However it has an additional option that complicates its life a lot: 'rxterm -r <host>' starts an xterm that is su'ing to root with my entire environment set up in advance (because you cannot combine xterm's -ls and -e arguments). Such xterms also get a special title and are red instead of my usual xterm colours.

Setting up my environment is fairly complex, because the things I need to do in the process of su'ing to root vary quite a lot from system to system. On some of them I can just go straight to su, but on others I need to run a cascade of scripts to get everything right. Rxterm has all of the knowledge of which system needs what approach, so I don't have to care. (Every now and then I need to tell it another exception.)

(In hindsight rxterm's approach to this problem is the wrong one, but that's something for another entry.)

Every so often I consider giving rxterm an option so that it will start a remote gnome-terminal instead of xterm. So far I keep not doing this because gnome-terminal's command line options are so different and the code isn't designed to cope with that, but by this point rxterm has so many historical remnants that I should probably rewrite it from scratch anyways.

(My short shameful confession here is that I had forgotten most of rxterm's arguments until I actually looked at the shell script in the process of writing this entry. Many probably don't work any more, and one actually has the comment 'Doesn't work any more? I lack the time to debug'.)

ToolsRxterm written at 02:51:32; Add Comment

2010-03-15

How to create pointless error reports (and how not to)

Linux's little love notes about software RAID consistency errors makes a perfect example of something that system administrators run into all the time: pointless error reports.

It's worth noting that a pointless error report is something different from a useless error report. A useless error report tells you that something has gone wrong but doesn't identify what it is, what exactly has gone wrong, and so on; you have to hunt that down on your own. A pointless error report shouldn't even have been generated in the first place, at least not in the form that you get it in. Noise from monitoring systems is one form of pointless error reports.

So what makes a pointless error report? The aforementioned software RAID errors have at least three things wrong with them, namely that the error happens all the time, that the 'error' is actually (in theory) something that happens routinely, and that there's nothing you can do about the error in practice. Complaining about non-errors that happen all the time that you can't do anything about anyways is pretty much the jackpot in terms of pointless error reports.

We can turn this around to create a list of what makes a good error report for sysadmins:

  • it is complaining about a real error (not a routine and theoretically harmless event)
  • ... that does not happen all the time
  • ... that is actively dangerous
  • ... that you can (and should) do something about
  • it contains a clear description of what is wrong
  • it contains all of the details about the situation that are known, provided that those details are useful for resolving the problem (and not merely useful for debugging the code)

Things that fail some of these criteria may be useful to log and capture for historical purposes, but they do not rise to the level of useful error reports. Failing any of the first four points makes an error report pointless; failing the last two makes it more or less useless.

I include 'is actively dangerous' on the list of important points because there are always things happening on any system that might be worthy of note, for example people trying brute force attacks on your ssh port. What should create error reports is not merely something wrong, but something that is bad enought that it needs to be dealt with. Someone failing to get in to your system with ssh is not worthy of a report; someone ssh'ing in to root and getting the password right but being refused access because you have PermitRootLogin set to no in the sshd configuration, now that is worthy of an error report.

GoodErrorReports written at 01:45:05; Add Comment

2010-03-02

A building block of my environment: rxexec

I have a confession: I kind of like reading about how people have their Unix work environments set up. In the spirit of sharing what I like, I'm going to write up some stuff about my own environment.

By now my personal Unix environment is highly evolved, which is to say that it is extremely customized (and somewhat littered with historical remnants). The customization rests on a whole pyramid of shell script tools and little programs so I am going to talk about them one by one, starting with the base of the pyramid and working up.

The job of my rxexec script is to run X programs remotely (hence, sort of, its extremely cryptical name). I started this habit back in the days before ssh, when you had underpowered workstations plus relatively powerful servers, and it made a real difference to run xterm remotely (putting essentially no load on the local workstation) instead of running a local xterm, rlogin, and perhaps a shell.

Since it predated ssh, the original version of rxexec used rsh to run a complicated set of magic commands on the remote machine. When ssh came out, a great deal of the magic was steadily replaced by more and more use of ssh features (as it became more and more pervasive around campus). Today, rxexec has only a few features over plain ssh -X <host> <command>:

  • it forces the command to be executed by the Bourne shell, no matter what my remote shell is.

  • it arranges to set $PATH on the remote end to include pretty much every place that X programs have ever lived, including local places, so that I don't have to care where today's system puts xterm.

  • it has a mode where it sets up my entire environment (as if I had logged in) before running the command, so that I don't have to use xterm -ls or the like.

    (This requires a very specific setup for my account on the remote system.)

  • it takes care to disassociate ssh from any controlling tty that it may have, because in my peculiar environment not doing so can cause things to go wrong.

I've deliberately chosen not to use 'ssh -f' in rxexec; I prefer to do any backgrounding outside rxexec, and this way I can use it in situations where I want to wait for the remote (X) program to complete.

Sidebar: what the rsh version of rxexec did

The basic incantation of the rsh based version was something like this:

echo "PATH=...; echo <magic-cookie> | xauth nmerge -; exec $command -display $display $@ <&- >&- 2>&- &" | rsh $host 'exec /bin/sh'

(There was a bunch of other code in rxexec that extracted the necessary Xauth magic cookie, mangled $DISPLAY, and so on. All of this became surplus when ssh became common, much to my relief.)

One reason for this complicated incantation was that I wanted to end up with no surplus processes lingering around on the remote machine (or for that matter, on my own workstation). The ideal outcome was to have only the remote X program itself still running; the rshd, the shell, and so on should all exit.

In the old days, this was drastically complicated by rshd's habit of gifting its children with random open file descriptors. If not closed, these would result in the rshd process not terminating until the X program did, much to my annoyance. (This is where the Bourne shell exec limitation could become very annoying.)

ToolsRxexec written at 01:03:26; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.