Wandering Thoughts archives

2005-08-24

Another aphorism of system administration

If a piece of information doesn't have to be correct for the system to work, sooner or later it won't be.

(Yanked out of my spam by ASN summary, where a version of it got a mention in passing.)

It's easy to see this aphorism in action. For example, if your host/domain name to IP address mapping information is wrong you notice right away. But nothing common breaks if the information to map from an IP address to a hostname is missing, and lo, it is missing all over the place.

And of course comments in source code are the classical case. Nothing breaks if the comment doesn't describe what the code is actually doing, so often the comment doesn't.

Corollary:

Attempting to validate a non-essential piece of information will inevitably turn up lots of perfectly valid systems that have the information wrong.

This most often comes up in antispam efforts, where people desperately starts attempting to insist on correct information in previously non-essential bits and discover that lots of people have it wrong or broken, often people that they still wanted to talk to. For example, for years almost no one cared what a SMTP HELO or EHLO greeting said, so real mail servers have all sorts of broken greetings.

There are two fundamental reasons for this:

  • People are lazy; they don't like doing things that seem to just be make-work. (This is one big reason why security is a pain, too.)
  • It's hard to notice incorrect information that nothing depends on.

Corollary: if you want to insure that some piece of information is correct, you must make something important check it and depend on it. The more important the better, because otherwise people may just ignore the fact that your checker is either screaming or broken.

SysadminAphorismII written at 17:35:56; Add Comment

2005-08-23

Diagnosing an install problem: a case study in indirect failures

Today I tried to Kickstart-install two SATA-based machines we have in for evaluation, booting them from USB memory sticks. Unfortunately it didn't work; something aborted about when our customization stage took over. (I didn't get to see what, because it promptly rebooted.)

One of my longer-term irritations with Anaconda is that you only get a binary choice of 'always reboot afterwards' or 'never reboot afterwards'; there is no option for 'reboot if all went well, otherwise sit there to let me look at diagnostics'. This lack somewhat slows down troubleshooting, partly because you first have to notice that something went wrong. (Since the machine came up thinking it was called localhost.localdomain, that was fortunately easy.)

Just as I wrote this, I logged in to my test machine and discovered that it too had come up as localhost.localdomain during a test reinstall run. This was good news, because being able to reproduce a problem is always good news. However, this gave me a puzzle: before, I had assumed that the broken machines came up without networking (the usual reason for coming up with such a bad name). But I was logging in over the network; how had the machine come up with a broken hostname but still on the network?

First hypothesis: maybe the test machine had gotten the wrong nameserver somehow. I looked at its /etc/resolv.conf, but it was listing our usual caching DNS server on the mailer machine (email is the most DNS intensive thing we do, so it's the best place for the cache).

At this point it's relevant to mention that electrical work in the building with our primary machine room caused us to shut down and then restart most of our servers.

Second hypothesis: 'oh my god, did the caching nameserver daemon fail to restart when we reboot?' Survey says: whoops, yes it did. Bad me for not noticing this for more than 24 hours; clearly we need better monitoring software.

Looking at the logfile showed that it was failing to start because it couldn't read /var/named/acl.conf. This was because /var/named was owned by the wrong group, and that was because I had corrected a mis-numbered named group in the course of preparing for our upgrade to Fedora Core 4 but had not changed the ownership on all of the systems. (And I had made the changes back in May or June.)

Our systems don't normally reboot and I didn't reboot when I fixed the 'named' group to have the right number, so the existing nameserver daemon on our mailer machine had kept on running (using the old numbers that matched the actual directory ownership).

We use multiple caching nameservers for redundancy, and it had started on at least one of the fallback machines, which meant that our existing systems could still do DNS lookups after we powered all the servers back on. But when done from a USB memory stick, the Kickstart install process only uses a single nameserver, which wasn't there, which caused Kickstart to call the machine localhost.localdomain.

Our customization process keys a number of things off the subdomain that the machine is in. 'localdomain' is not a recognized subdomain, so our customization process immediately aborted; in turn this more or less immediately rebooted the machine.

Since I found the root cause of this problem in the process of writing up a grumble about it and another problem I also hit, this may be a successful example of Rubber Duck debugging (sometimes called rubber ducky debugging instead).

DiagnosingAnInstallProblem written at 01:24:47; Add Comment

2005-08-22

On being nibbled to death by moths

Last Friday I used the phrase 'being nibbled to death by moths' to describe part of my day. I suppose I should elaborate on that, lest people think that Toronto has some really peculiar insects.

This is my way of summing up all of the routine five and ten and twenty minute bits of system administrator work that come up throughout the day. Reading a stream of email, making sure the backups completed, checking the logs, monitoring incoming spam streams, checking for important tech news (like security updates), talking with a co-worker about some pending issue, and before I realize it a great deal of the day has vanished.

Each individual bite is no big deal, but the entire cloud can nibble my day down to nothing before I realize it.

Sometimes there's nothing more important to do with the day than deal with all of the little things, but sometimes I realize that I've lost those solid, continuous hours of development time that I really needed to get something straightened out. Last Friday was one of the sort of days that was especially susceptible to it, because there were seductive pauses in my work while I waited for test reinstalls to complete; it was easy to get diverted more than I had intended.

This isn't a novel problem, especially for software developers. The routine advice for anyone doing software development is that you need several hours where you are 'in the flow' of programming in order to hit maximum productivity. The problem for me in system administration is that, unlike many software developers, dealing with the moths is a necessary part of my job too, and some of them need to be dealt with promptly.

I don't have any clever solutions, although I wish I did. Perhaps I should get more brutal about blocking off several hours in which I will refuse to be disturbed by anything short of someone calling me on the phone.

NibbledByMoths written at 23:53:06; Add Comment

2005-08-17

Parallelizing DNS queries with split

So there I was the other day, with 35,000 IP addresses to look up in the SBL to see if they were there. Looking up 35,000 IP addresses one after the other takes a long time. Too long a time.

The obvious approach was to write a SBL lookup program that internally worked in parallel, perhaps using threads. I was using Python and it has decent thread support, but when I started going down this route it rapidly started looking like too much work.

So instead I decided to use brute force and Unix. I had all of the IP addresses I wanted to look up in a big file, one IP address per line, so:

$ mkdir /tmp/sbl
$ split -l 800 /tmp/ipaddrs /tmp/sbl/in-sbl.
$ for i in /tmp/sbl/in-sbl.*; do \
  o=`echo $i | sed 's/in-/out-/'`; \
  sbllookup <$i >$o & \
  done; wait
$ cat /tmp/sbl/out-sbl.* >/tmp/sbl-out

What this does is that it takes /tmp/ipaddrs, the file of all of the IP addresses, and splits it up into a whole bunch of smaller chunks. Once I had it in chunks, I could parallelize my DNS lookups by starting the (serial) SBL lookup program on each separate chunk in the background, letting 44-odd of them run at once. Each wrote its output to a separate file, and once the wait had waited for them all to finish I could glue /tmp/sbl/out-sbl.* back into a single output file.

Parallelized, it took about five or ten minutes the first time around, and then only a minute or so for the second pass. (I did a second pass because the replies from some DNS queries might have been late trickling in the first time; the second time around they were all in our local DNS cache.)

ParallelDNSQueriesWithSplit written at 23:53:48; Add Comment

2005-08-16

Things that could happen to your backups

In case yesterday's backup horror story didn't scare you enough, here's an incomplete list of things that have been known to go wrong with backups. Are you sure that none of them are happening to your backups right now?

  • the backup program writes corrupted backups.
  • the backup program doesn't capture a usable system state because things keep changing even as it runs (databases are famous for this).
  • the backup program generates incomplete backups, especially when run in incremental mode. For example, many Unix systems have historically had problems backing up renamed files or renamed directories.
  • the backup program is not noticing or complaining enough about disk read errors. (This happened to us. We lost some somewhat valuable historical files.)
  • you're not actually backing up everything important on the machine. (Especially common on Unix systems if you add a filesystem and forget to tell the backup system about it. And again, this has happened to us.)
  • despite having set them up, you're not actually doing backups; a cron job has broken, someone is forgetting to run a necessary command, etc etc. (Lazy people happen a lot.)

  • backup media errors are being ignored.
  • things don't properly notice or handle the backups hitting the end of the media. (Embarrassingly, I once did this too; honestly, I thought that the tape robot automatically advanced to the next tape when it hit end-of-tape...)
  • your tape drive is failing to properly write the tapes.
  • your tape robot and/or backup system is not actually advancing to the next tape, it's just overwriting the same tape over and over without it or you realizing it.
  • your backup system is accidentally overwriting the backup media instead of appending new data to the end of it. (This has been so common that the Amanda backup system refuses to append to tapes to eliminate the possibility of this happening.)
  • your backup tapes are not getting properly rotated. (This is a famous 'lazy people' issue, where the minimum-wage worker you hired hasn't bothered to actually change tapes.)
  • your tape drive has drifted out of proper alignment; while it can read back tapes that it wrote, nothing else can. Woe strikes if (or when) you have to replace it or it gets repaired. (Exabyte tape drives used to be infamous for this.)
  • your tape drive isn't being made any more. If it breaks, can you get another one that can read your backup tapes back?

  • your backup system's index files that tell you what backups are on what tapes are not being backed up.
  • your backup system's index files are being backed up, but you don't know to where without the index files.
  • backups can only be restored by a program running on the same operating system (and architecture) that made them. Don't lose your last machine of that OS + architecture combination!
  • your commercial backup system requires a node-locked license even to restore files. If you lose the backup server, can you easily run the software on another machine?
  • your restore program has bugs, although the backups themselves are fine. (This has happened to us. It's at least somewhat fixable.)

  • your offsite backups aren't.
  • your offsite backups aren't recent enough.

While some of these are very hard to check for, the only way in general to be confidant that they aren't quietly happening to you is to test restoring from your backups periodically. Backups really need an end-to-end test every so often.

(Feel free to add more in comments, of course. Note that I'm pretty much focusing on things that could be quietly going wrong in your (low-level) backup system itself, as opposed to all the additional problems that you can have in a disaster-recovery situation.)

PotentialBackupProblems written at 00:49:20; Add Comment

2005-08-15

Check your backups

Backups have been in the geek news recently, courtesy of the chilling tale of the online comic Penny Arcade's backup failure. Making backups is important, but there's an equally important and far less appreciated piece: checking your backups to make sure that you can actually restore them.

You may say 'but, I'm sure my system would scream a lot if something was wrong'. Let me tell you a story about that.

Once upon a time there was a young and innocent system administrator. He had an Ultrix MIPS DECStation to take care of (which says something about how long ago this was), and part of taking care of it was backing it up. Dutifully he arranged tape backups; because he worked at a university, they were tape backups over the network to a remote tape drive.

Unfortunately the Ultrix backup program insisted on logging in to the tape server as the wrong user (and which user it used was hard-coded). No problem; this was a university, so he had full Ultrix source code. Changing which user rdump used was a simple text edit and recompile.

While doing this, he noticed that rdump's Makefile compiled things without optimization. Since this was on a MIPS-based system, where compiler optimization was important for decent performance, the system administrator fixed that when he recompiled rdump.

About a year later he accidentally did something to a relatively unimportant file and decided he wanted to restore it from a backup. He queued up the right tape, got the rrestore program talking to it, and rrestore promptly told him that the backup was corrupted. This could happen sometimes, so he tried another tape; then another; then all of them. Not one was good.

I will cut to the punchline: the MIPS compiler had an optimizer bug. Compiling rdump with optimization on ran into this bug (which was why the original rdump Makefile didn't do that), and the bug made rdump produced corrupted and unrecoverable output while (of course) thinking all was fine.

A year's worth of backups were literally worthless. The young system administrator had a small heart attack, thanked his lucky stars he had found this before he needed to restore anything important, recompiled rdump without optimization, and immediately scheduled some full backups. And tested them afterwards, just to be sure.

So, having stubbed my toe, I strongly urge you: test your backups by trying to restore at least something from them. If you don't, you don't actually know if you have backups, you just think you do. (And remember, 'optimisim is not a plan'.)

CheckYourBackups written at 01:35:15; Add Comment

2005-08-11

An aphorism of system administration

A great deal of system administration consists of not stubbing your toe a second time.

(Unfortunately I must mar this by adding 'on the same thing' as a footnote.)

SysadminAphorism written at 01:33:11; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.