2007-02-26
Things I have learned about effective sysadmin meetings
We've recently started having some regular meetings at work, which has given me an opportunity to look back at past meetings and reflect on what I think does and doesn't work in meetings for system administrators. (Note that I am pretty sure that some of these do not generalize to other sorts of meetings.)
Meetings can be an efficient way of holding discussions, but you need someone to corral the digressions. They are not usually an effective way of giving status reports, unless the status reports are really the starting points of discussions; otherwise, it's better to email everyone the status reports beforehand. Remember that people read faster than you can talk.
(The two classical cases for useful sysadmin meetings are making a decision and figuring out a problem; both benefit from the very fast back and forth you can have in a meeting.)
Meetings do not create social interaction or bring people together in and of themselves. Meetings can create social interaction if they cause people to work together, but you need something for people to be working together on to start with.
Meetings run much better with an agenda and someone who can push people along the agenda. Write the agenda down where people can see it; a whiteboard is better than in email, because it's right there in front of people and you can do things like erase a topic that you're done with. You don't want a chair so much as you want a moderator, someone to watch for when discussions are digressing and cut them off or redirect them.
Always have a time goal for the meeting. Announce that the time goal is less than the amount of time you have the room reserved for, and that the extra amount of time is just in case.
If you do not write minutes in some form, you have not really captured the information from the meeting. Don't try for comprehensive minutes, just go for enough to job people's memories, but do clearly write down any decisions and things to be done. If you're actually writing something for people who weren't at the meeting, you need to figure out what information they actually need from the meeting, and write up just that; comprehensive minutes are unlikely to be it.
(Comprehensive minutes exist not to transmit information but to allow blame to be assigned later by precisely identifying who said and did what. This is why they are carefully taken by political committees and board meetings.)
See also Greg Wilson, but note that he is talking about meetings for programmers, which I feel are somewhat different. In particular, if you are kicking around a problem it can be very handy to have a laptop or two that you can use to look up any additional information on the spot, instead of having to stop the meeting, go off to get the information, and then come back later.
(I tend to feel that if people are distracting themselves with laptops, they are not sufficiently involved in the meeting to actually be present and you might as well let them take off.)
2007-02-23
The downside of distinctive hostnames
Recently a co-worker pointed out that the downside to giving machines distinctive hostnames is that users become attached to them, and when you introduce newer, better machines the users don't migrate to them. Once I thought about it, this made sense; after all, by naming things we make them distinct, and thus no longer quite so generic and equivalent.
(At this point I am tempted to think thoughts about brand loyalty and the distinctiveness of brand logos.)
One contributing factor is that with many generic naming schemes it is very easy to derive a bunch of machine names from a single memorized one. If I remember 'cluster3' I can easily predict that there is probably a 'cluster5' I can also use; the same is not true of a name like 'epoch'.
In hindsight, I've seen this behavior in action before, for example at times when users are slow to migrate from a loaded server to a less loaded one. I even do this myself; I have no idea of which of our Linux login servers is less loaded or faster, I just use the one I found first.
(As it turns out, I probably didn't pick the best one.)
However much I like distinctive hostnames, I was forced to admit that my co-worker had a convincing argument, so our future user visible generic machines are probably going to have generic names. (At least we won't have to come up with a suitable naming scheme.)
(Server machines that users don't use directly will probably keep getting distinctive names.)
2007-02-22
My zeroth law of compromised machines
If you can't find anything wrong, you haven't looked carefully enough.
The immediate corollary is also important:
If you can't find anything, the intruders are still there.
The leading cause for not finding anything wrong on a machine you know is compromised is that you haven't detected the rootkit that is hiding things from you.
2007-02-19
Another aphorism of system administration
Noticing when something shows up is easy; detecting when it goes away is hard.
Like all aphorisms, this has exceptions. And if you want to see it that way, it's a corollary of an earlier aphorism.
(An aphorism brought to mind as I contemplate our DHCP configuration files and wonder just how many of those Ethernet addresses are currently mouldering in a dump somewhere.)
2007-02-18
Why we do NFS fileserving with a SAN
Our storage infrastructure here has a number of NFS servers sitting in front of a pool of SAN RAID storage boxes using commodity SATA disks. This is a somewhat unusual setup for a comparatively small environment like ours; a far more common setup is to have the disks directly attached to the fileservers.
We have a SAN setup for a simple reason: failover between the NFS server machines. We consider the server machines to be the things most likely to suffer failures, either hardware or software, or just to need downtime. With all of the storage in a SAN pool, accessible to any of the frontend machines, we can easily move NFS service from one machine to another.
The actual implementation uses virtual NFS servers and Solaris DiskSuite's failover support, which works quite nicely (although it is not high availability automated failover; we have to kick it off by hand). DiskSuite also lets us mirror important partitions across multiple SAN RAID controllers, so that they'll stay available even if a controller goes down.
It's not clear to me how to set up a similar relatively fast failover environment with directly attached SATA disks. I can think of two approaches; servers with a relatively small number of disks and you just have cold-spare servers waiting, or putting the disks in external shelves and giving all of your NFS servers spare SATA controllers. In either case, 'failover' would be enough work and user disruption that you would be unlikely to use it for things like applying OS patches.
(Then there is the really crazy approach where each server mirrors its local disks over the network to disks on another server via some disk-over-network protocol, whether NBD or iSCSI or the like. The downside is that you need twice as much disk space and either twice as many servers or servers with twice as many drive bays, and in the later case taking down a server means that you lose redundancy on two disk pools, not just one.)
(Credit where credit is due: the crazy approach was suggested by someone at Unix Unanimous.)
2007-02-15
Something all full-service backup systems should have
Having spent much of today wrestling with this very question, I have a small suggestion for people designing full-service backup systems (by which I mean ones that have individual file indexes and an environment for restoring single files):
Please provide a command that summarizes all of the versions of a file that you know about.
Most full-service backup systems can go back in time, so you can ask for things like 'the version of the file on January 28th'. But what I really want is some way to ask the backup system for a rundown of when the file was changed, created, or deleted, because this usually is what I actually want. What tends to happen, at least around here, is not that a user deletes a file and immediately wants it back, but that they notice that they have a damaged or deleted file now and need to get back the last good version, whenever that was.
For serious bonus points, support this for directories too, with filtering so I can look only at deleted or added files. That would make it easy to deal with the situation when a user tells us that they've accidentally deleted some files from a directory, but they're not sure exactly what they deleted (after all, it was an accident).
(Note that 'restore the directory as of <X>' is not a really good solution. Users don't necessarily notice the accidental deletion right away, and so they may have new or updated files in the directory that they don't want to lose; they don't want the old version of the directory back, they just want the old files back.)
2007-02-04
A sysadmin twitch about dump
In dump (and ufsdump, and other close cousins) you can specify the filesystem that you want to dump in two ways: by the name of its mountpoint, or by the name of the (raw) device that it's on. One of my little twitches is that I always specify the filesystem to dump by its mountpoint. Like a lot of my little twitches, this has a history behind it.
The problem is that at least some old versions of dump were perfectly willing to write their output to anything, including raw disk devices, and they had defaults for what filesystem to dump (and where to dump it to), and as a bonus they had an argument parsing scheme that made accidents really easy.
So, if you accidentally wrote, say:
dump 0usf /dev/rmt0 /dev/rrf0g
You could destroy a filesystem, as some people did once.
However, dump can't write to directories. So once I read that sad story in comp.risks, I started always using the filesystem mount point instead of the raw device; that way if I made a mistake, dump would just die with complaints that it couldn't write to its output.
(Another lesson that one can draw from this is to always run dumps from an account that only has read access to the raw disk devices.)