2007-01-31
The inherent fragility of complex systems (in system administration)
It's not that complex software systems are inherently fragile for the usual reason, because they have more places and pieces to go wrong than simple systems do; unlike physical machines, computer software has no mechanical wear and thus doesn't just break on its own (barring intrinsic flaws). Living in a digital world, computer software that works keeps working forever until something changes.
The real problem with complex systems is that it's very hard for people to keep track of all of the interrelationships, and thus to see the full effects of doing things. As a result, when you go to do something or change something, it's too easily to overlook a consequence and create an explosion.
(And it is very frustrating, because usually things are so obvious in hindsight. But this is because when you look back afterwards you don't have to try to keep track of everything, just the bits involved in the failure. Then you clearly see, far too late to be useful, how when you change A it causes B to shift sideways and so C goes completely off the rails.)
It does no good to tell people, yourself included, to study your complex system harder and to be more careful. People simply have a limit to how much they can hold in their head at once, and no amount of exhortation can change it.
(And system administration, to a first order approximation, is about change.)
Why I am not fond of DHCP in lab environments
Using DHCP to assign IP addresses is pretty popular in environments with lots of machines. You'd think that student labs, full of generic machines, would thus be a great environment for DHCP, but actually I disagree; I believe that (normal) DHCP is not a great match for a lab environment.
The problem with DHCP is that it ties the IP address to the wrong thing. In a lab environment you don't want a machine's IP address to be tied to its hardware; you want its IP address to be tied to its physical position in the lab, so that you can actually find it without having to search through the whole place.
(Given that automatically determining a machine's physical position is hard, I'd be happy if I could dynamically assign IP addresses based on what switch port a machine was plugged into; student lab wiring is usually pretty regular and static, and can thus be mapped easily into a physical position. And in theory you could get this information from sufficiently intelligent switches, and do it on the fly to make up DHCP replies.)
While you can do this with DHCP, you're doing so indirectly, which means that you can't move machines inside a lab without updating your DHCP configuration. Given that you have to do something when machines are moved around anyways, I prefer just giving machines static IP addresses and updating them directly when things move; it has less moving parts.