Wandering Thoughts archives

2010-12-16

The elements of a non-event

Today we had an entire iSCSI backend fail. It was a heart-stopping non-event, something that took me perhaps 20 minutes to deal with. I'd like to run down some of the reasons why things worked out this way.

  • we have set up monitoring and it works (the former doesn't always imply the latter). Smartd on the iSCSI backend started mailing us about 'cannot open device' errors almost immediately, which are never a good sign, and the ZFS pool health monitoring on the ZFS fileservers raised its own alerts soon afterwards.

  • we had a hot spare backend set up and ready.

  • after feedback from my co-workers, our new custom ZFS spare management system was explicitly designed in part to make handling this situation easy and almost completely automatic.

    (My co-workers rightfully pointed out that replacing a whole backend worth of disks was one of the most tedious, time-consuming, and repetitive things that we need to do with spares. And also, sadly, one of the more common ones. Apparently we need better power supplies and UPSes.)

  • we have a documented procedure for just this situation. When disaster struck at 5:50pm with only me in the office, I did not have to try to remember everything necessary and where all the files were and what order to do things in; once I got my heart rate under control and calmed down a bit, all I had to do was look it up and follow the steps.

I cannot overstate the importance of the last factor. In honest but embarrassing fact, I started fumbling through the necessary steps from memory and got the order wrong before I calmed down enough to come to my senses and look things up. Well, not so much look things up as stumble over the documentation in the process of looking up the command I needed to run to do what I thought was the next step, at which point I felt rather foolish and sheepish.

(This is especially ironic because I wrote the documentation myself.)

NoneventElements written at 01:09:46; Add Comment

2010-12-11

The danger of having system programmers around, illustrated

My little issue the other day with DTrace makes a nice illustration of a variant of programmer laziness. As I sort of alluded to in the entry, although DTrace's function tracing limitation is documented, I didn't actually find it by reading the documentation. Instead I found it the hard and indirect way, by being a system programmer.

When I ran into a kernel function that DTrace couldn't trace, I wound up:

  • using the Solaris kernel debugger to verify that the function was actually present in the kernel, and then disassembling some of the functions that called it to check that the function hadn't been inlined into its callers (which in retrospect was a silly thing to check).
  • checking several other static functions to see if DTrace perhaps couldn't trace any static function. (It can.)
  • searching high and low (with find and grep) to see if DTrace had some hardcoded database or configuration file of what kernel functions it would trace. (It doesn't; it uses existing symbol tables, and per above they include all static functions.)
  • eventually I stumbled over the kernel code for DTrace's function tracing, skimmed through it, and wound up reading a nice comment that explained the whole situation.

Only after all the dust settled did I read the documentation and lo and behind, there it was documented for me (twice).

(This is far from the first time that I've done stuff like this. For example, I've read through Linux kernel source in order to work out when and where a specific errno result was generated. More than once.)

Now, if you are the right sort of system programmer, all of this makes perfect sense. The problem with documentation, as any sysadmin will frequently tell you, is that it is often incomplete, unreliable, or in fact outright wrong. When you are a system programmer (or have that mindset), the easy way to answer questions is not to waste time reading the fine manual, it is to proceed straight to authoritative things like reading kernel source and disassembling code. If you can see how the system actually works, you don't have to trust the manual, and you only have to be burned a couple of times by bad or incomplete documentation before this seems like a good idea.

Which is exactly the danger of having system programmers around: we're perfectly capable of going very overboard before we do basic things like checking documentation, because we think of things like disassembling code and reading kernel source as the fast way to do things. Sometimes we're right, but sometimes we're very wrong.

(It's like people with hammers, where you know that sooner or later every solution is going to involve nails, or at least hitting things.)

SystemProgrammerDanger written at 02:08:14; Add Comment

2010-12-10

How to get sysadmins to never use your software again

If your software keeps some sort of database or even just reads data files, and you want happy sysadmins, make it always be backwards compatible with the file formats of older versions. If you want moderately unhappy sysadmins, give it an automated migration process so that it reads the old format files and rewrites them in the new format.

(This makes sysadmins moderately unhappy because we can't back out of an attempted upgrade without restoring the old versions of the files, and as a corollary we can't switch back and forth between the two versions.)

Conversely, if you want to get sysadmins to never use your software again, make new versions of your program require active upgrade procedures, things that the sysadmins have to look up and do by hand (even if it is just running a program). You earn bonus points if part of the upgrade process is 'and check to make sure that nothing went wrong', and if it is a multi-step, multi-command thing that requires manual intervention and careful attention.

You might be tempted to say that any sysadmin who installs a new version of your program ought to also be willing to do some extra work along with it and besides, it's not that hard. There are two problems with this.

First, sysadmins often aren't actively choosing to upgrade to the new version of your program; instead we have the upgrade forced on us when we do things like upgrade our distribution. Second, we often aren't deeply experienced with your software and familiar with all of the fine details; instead we've taken a system package and slapped together something that worked. Then when your software turns an upgrade into a four alarm fire drill requiring major work to migrate a test version of live systems, test extensively, and then re-migrate the live system when the upgrade happens, sysadmins swear solemn vows to never ever be caught in the same trap again.

(Please don't suggest that we should ignore the version of your software that the distribution packages and instead compile from source. That too is a non-starter, for good reasons.)

On related news, Ubuntu 8.04 has MoinMoin version 1.5.8, Ubuntu 10.04 has MoinMoin version 1.9.2, and our support site currently uses MoinMoin. It is rather unlikely to do so in the future, because we are smart enough to only stub our toes once.

RequiredMigrationPain written at 01:15:20; Add Comment

2010-12-05

A log message format mistake that I've made

As an illustrated example of how not to do log messages, here is a mistake of mine that I recently had my nose rubbed in. Suppose that you have a service that authenticates requests in various ways; sometimes the request has enough information to grant approval right away, but other times you need to do a callback of some sort. Of course you want to log the success or failure of these requests so that when something is going wrong you can rule out the authentication system.

When I was doing this, I wrote two log messages:

checked X result 0 (succeeded)
allowed X without callback check

(The former also had the obvious variants if the callback check failed.)

These are perfectly good messages individually; they are clear and tell you what happened. They even have the X in the same place in the message. The problem with them is that I need to use two grep patterns in order to see all successful request authentications, which is both more annoying and something extra to remember (especially if one of the two variants is much less common than the other).

As much as possible you want people to be able to use a single regular expression (ideally a simple one) to match every variant of something, even if it makes each individual message uglier. Of course, part of the tricky bit of this is deciding what is just a variant of something else and what is sufficiently different that it should have an entirely new message.

It may help to ask yourself how you'd track things down if there seemed to be a problem with the system, but don't be satisfied with handwaving answers; actually write down the commands you'd use and make sure that they're correct, simple, and would get everything that you're interested in.

(Had I done this when I was writing the messages for this program, I could have saved myself some annoyance later on.)

Sidebar: why less-common variants are extra annoying

Unless you work with a particular system and its logs a lot, you generally don't remember all of the messages that it uses. The usual sysadmin approach is to go look back at a chunk of the logs (often the most recent ones) and then base your grep pattern on what you see. This approach can make it easy to miss less-common variants; they may just not have come up in the logs you looked at to work out what sort of messages the system logs.

(Even when you remember that there is another message format variant, now you have to actually find a message using it.)

LogMessageMistake written at 01:47:43; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.