2007-01-31
The inherent fragility of complex systems (in system administration)
It's not that complex software systems are inherently fragile for the usual reason, because they have more places and pieces to go wrong than simple systems do; unlike physical machines, computer software has no mechanical wear and thus doesn't just break on its own (barring intrinsic flaws). Living in a digital world, computer software that works keeps working forever until something changes.
The real problem with complex systems is that it's very hard for people to keep track of all of the interrelationships, and thus to see the full effects of doing things. As a result, when you go to do something or change something, it's too easily to overlook a consequence and create an explosion.
(And it is very frustrating, because usually things are so obvious in hindsight. But this is because when you look back afterwards you don't have to try to keep track of everything, just the bits involved in the failure. Then you clearly see, far too late to be useful, how when you change A it causes B to shift sideways and so C goes completely off the rails.)
It does no good to tell people, yourself included, to study your complex system harder and to be more careful. People simply have a limit to how much they can hold in their head at once, and no amount of exhortation can change it.
(And system administration, to a first order approximation, is about change.)
Why I am not fond of DHCP in lab environments
Using DHCP to assign IP addresses is pretty popular in environments with lots of machines. You'd think that student labs, full of generic machines, would thus be a great environment for DHCP, but actually I disagree; I believe that (normal) DHCP is not a great match for a lab environment.
The problem with DHCP is that it ties the IP address to the wrong thing. In a lab environment you don't want a machine's IP address to be tied to its hardware; you want its IP address to be tied to its physical position in the lab, so that you can actually find it without having to search through the whole place.
(Given that automatically determining a machine's physical position is hard, I'd be happy if I could dynamically assign IP addresses based on what switch port a machine was plugged into; student lab wiring is usually pretty regular and static, and can thus be mapped easily into a physical position. And in theory you could get this information from sufficiently intelligent switches, and do it on the fly to make up DHCP replies.)
While you can do this with DHCP, you're doing so indirectly, which means that you can't move machines inside a lab without updating your DHCP configuration. Given that you have to do something when machines are moved around anyways, I prefer just giving machines static IP addresses and updating them directly when things move; it has less moving parts.
2007-01-25
A clever trick to deal with students powering off workstations
One of the eternal curses of student Unix labs is students casually turning off workstations the way they turn off other PCs when they're done with them. (A problem made worse by vendors putting glowing power buttons on the front panel of machines, where they become an easy temptation.)
In theory the answer to this is Wake-on-LAN. You probably have at least one inaccessible computer per lab, so run a daemon on that that notices when workstations are down and sends out a WoL packet to get them back up. In practice, WoL requires a fair number of things to be working just right and can be a bunch of work to get going.
Recently a co-worker shared an interesting low-tech solution to the problem. What we care most about is powered-off machines missing out the nightly updates and automatic maintenance (and somewhat having them ready for students in the morning). The simple way to deal with that is to set the PC BIOSes to power the machines on at 2am or so, somewhat before the scheduled nightly stuff starts.
This does nothing to machines that are already on but revives any machine that the students have powered off, and setting it significantly after your labs close means that students won't be around to turn them off again. (If you run 24-hour labs, I suppose you'll just have to hope.)
Also, modern machines with ACPI usually let you control what happens when the front panel power button is pushed. I recommend that you make it reboot the machine; making it do nothing will just encourage the students to reach around behind the machine and yank the power cord. (It is also cheap insurance against the machine actually needing a reboot to get out of some peculiar state.)
2007-01-21
Sometimes system administration requires a hacksaw
I have to admit that the hacksaw was kind of a special situation.
Our hacksaw usage came about when we were pulling an old rack from our machine room to make space for a new and better rack. The old rack had a bottom plate that was just a frame, with a big opening in the center, and long ago we'd routed the power cord for the rack's power distribution unit (PDU) through that opening (presumably to keep things neat) instead of just out the back.
Normally you'd just turn off the PDU and pull up its cable. Unfortunately, the PDU had one remaining power cord connected to it, which went to a power bar in the next rack over, which in turn was connected to two crucial systems we couldn't power down (partly because one of them is so old that we're not certain how many more power cycles it has in it). The net effect was a loop through the rack's bottom plate that was pinning the rack more or less into place; certainly we weren't going to get it out of the room.
So we (and by 'we' I mean 'someone else, while I watched a bit') took a hacksaw to the rack's bottom plate, cutting things apart enough that in the end we could pull the PDU's power cable out without having to turn it off. (I believe the PDU is now stashed in the rack that it helps power, probably on top of that rack's own PDU. This is par for our course.)
This probably sort of ruins the rack, but it was an old shallow rack that we have no remaining use for anyways, and it is in the same grand spirit that has seen us drilling out rack screw holes to widen them enough so we can screw rails in. (And to think that I used to innocently believe that racks only came in one universal flavour.)
2007-01-08
What I really want from an automounter
As peculiar as it sounds, I don't like the automatic mounting and unmounting of NFS filesystems. I don't have unreliable NFS fileservers and I don't like the various side effects of not having everything mounted; pretty much the only thing it seems good for is making our systems slower and more fragile.
However, one thing that the automounter is good for is maintaining
a nice, organized list of NFS mounts in a format that can be easily
distributed around without problems, and even automatically getting
it right on both the clients and the NFS servers without having to do
anything special. Having in the past built programs to try to maintain
/etc/fstab 'by hand', this is a service that I can reluctantly
appreciate.
So what I really want from the automounter is a magic flag that says 'mount everything (in this map) all the time', with a way to get the automounter to re-read the map and add any new NFS mounts (we rarely remove NFS mounts, so we could handle that by hand). The automounter noticing on its own when the map changes and automatically shuffling mounts around as necessary would be ideal.
While I could build all this by hand, the automounter is ideally placed to do it for me; it already has the maps, the tracking of NFS mounts, and so on. All it would need is a modest amount of additional intelligence and some additional options.
Unfortunately, my odds of persuading anyone else of the wisdom of this crazy idea are probably low. (And I suspect that current automounters don't explicitly check for changed maps and deal with them; they just deal with them implicitly because sooner or later the old NFS mounts that aren't in the maps any more all time out and get unmounted.)
(The automounter is on my mind lately because my (semi-)new job's current environment is heavily automounter based, which causes us periodic heartburn, and we are trying to figure out if we want to keep using the automounter in the new environment we're building or switch to static NFS mounts. I am unfortunately not enthused about either choice; one is convenient heartburn, the other is irksome custom toolbuilding.)