2005-10-23
One reason why I like Unix
; uptime 19:33:45 up 264 days, 12 min, [...] ; ps -e -o start,comm STARTED COMMAND Feb 01 init [...] Feb 01 X Feb 01 xterm Feb 01 fvwm2 Feb 01 wish8.4 Feb 01 exmh [...] Apr 12 ssh [...]
I don't think long uptimes are an exclusive Unix virtue; every operating system can and should have them. But there's machine uptime and then there's total user environment uptime, and my impression is that many systems today are far less good at the latter.
Not only has has my office workstation and its system programs been running since February 1st, but I've been logged in and running X Windows continuously since then, along with my window manager, my mail reader, and several other programs that I keep running all the time. I use the machine reasonably intensely; I routinely compile large programs, watch video, play music, and so on.
The ssh command started April 12th has been forwarding X for my environment on the remote machine (which has obviously been up since then; in fact it was rebooted then), and that environment has been running since then:
; ps -o start,command [...] 12Apr05 xrun [...] 12Apr05 xlbiff -title [...]
It is very nice to just be able to expect this kind of quiet, long-term operation from everything that I run; it makes the computer my servant, instead of me the computer's servant ('I am annoyed with life; quit some of your programs to make me happy').
(Now, mind you, I am out of touch with the Microsoft Windows world; it is quite possible that multi-month Windows sessions are now perfectly normal if you want to stay logged in that long. Data points from Windows people are welcome.)
(The observant will gather from this that I have not installed Fedora Core 4 on my office workstation. Surprise, surprise. At this point I may wait for Fedora Core 5, unless I get impatient with outdated software.)
2005-10-18
Another aphorism of system administration
If you haven't tested it, it doesn't work.
I got this from the Extreme Programming movement, but it's just as applicable to systems and procedures as it is to developing software. (Remember verifying backups? That's an instance of this aphorism.)
This has an important corollary:
If you aren't checking it, it's broken.
For example, when you have a spiffy hardware RAID system you should make very sure that you do indeed get notified when a disk goes bad and your RAID ceases to be redundant. Otherwise, you will sooner or later appear in a comp.risks digest story.
2005-10-14
On the naming of machines
The other thing we had to do the other day was name the new server we were bringing up, which is often harder than it looks.
Good names machine are important because they let you tell things apart. This means that good names have to be different from each other, so you can avoid fun games like 'was it ws3-05 or ws5-03 that had the problem?'
People usually resort to generic names like 'ws5-03' for two reasons: they need a lot of names, or they need a bunch of names with predictable patterns. Fortunately, there are alternative approaches.
One department here names computers after Toronto streets; servers are named after major north-south streets, and workstations after east-west ones. This has several nice attribute: they're certainly not going to run out, the names are short, different from each other, and already pretty memorable, and there's even a sequence to the names.
The only disadvantage this scheme has is that it's confusing and embarrassing to copy it, so everyone else at the university has had to come up with different ones.
In another group, we needed to come up with names for up to 60 workstations or so per lab that had a clear sequence, so we could easily map between sequentially assigned IP addresses and the workstation's name. Instead of using names like 'ws5-03', we decided to use the names of the elements. This has some advantages:
- there is a clear sequence that runs all the way up to 103 machines per lab. (Yes, we've had machines named 'Lawrencium'; it was convenient to steal the high-numbered names for other purposes.)
- the names are quite distinct; we aren't likely to mis-remember Chromium as Oxygen.
- there is both a long and a short form of the name, eg Mercury and Hg.
- it feels appropriately educational.
- some of elements have genuinely cool names. (My favorite is Technetium.)
I also like to think that users like logging in to machines called Chromium and Oxygen and Mercury more than they would like logging in to 'ws5-03', because it makes the machines feel less impersonal. I firmly believe that people just plain respond better to names than to numbers.
2005-10-13
Try things out with new machines
I just gave someone at work this advice today, so I might as well repeat it here. We were bringing up a new machine, a nice dual processor server with a hardware RAID-1, and I suggested to him that before he even thought about putting it into production he take the opportunity to yank one of the drives and see what happened.
There are occasions where you would do this to a machine in production, but not very many. New hardware (or idle hardware) is about your only chance to experiment, to see what happens, what goes wrong, and how to fix it.
For example, with hardware RAID there's a collection of interesting questions:
- how does the machine react to a drive going missing?
- does your monitoring notice the problem? (You have monitoring, right?)
- how do you re-add a 'replacement' drive?
- does anything odd happen if you just plug the old 'dead' drive you pulled back in and don't do anything else?
I certainly don't want to be finding this sort of thing out on a machine that's in production. (The users will probably be very irate if I make a mistake in the RAID BIOS and eat the good disk.)
And if something goes wrong, new machines have expendable software; you
can always reinstall it, there's nothing very important on the disks. (I
just reinstalled my Solaris 9 test machine yesterday and today, because
the first time around the / filesystem was too small.)
In fact, reinstalling your new machine can often result in a cleaner configuration, because the second time around you know much more about setting the machine up and just what you want and need. (And you'll have stubbed your toe already.)
So take the opportunity to live excitingly. Yank the UPS's power cord (for extra fun, plug it back in at the last moment). Pull that RAID disk. Have an 'accident' that locks you out of the root account. It's fun and educational, and only occasionally horrifying.
2005-10-04
Keeping changing systems stable
Back in June, I wrote about how unchanging systems should be stable. There is an important consequence of this:
To control stability, you must control change.
(This clearly follows from change being the only thing can destabilize your unchanging stable systems.)
I think that this is subtly different from taking changes carefully because the changes themselves could possibly be broken. This is the realization that any change may perturb the overall system, requiring it to be restabilized; it is the need to control what one has defined as the root cause of instability.
(There are certainly ways to structure systems so they are resilient in the face of change, although I'm not sure if this is a well understood area.)
This also gives me a new perspective on a lot of sysadmin twitches about change, especially change that's not under our control; for example, automatically applied vendor updates. It's not just that vendors might release updates with problems, it's that having a system change unpredictably all the time makes stabilizing it that much more difficult.