The inherent fragility of complex systems (in system administration)

January 31, 2007

It's not that complex software systems are inherently fragile for the usual reason, because they have more places and pieces to go wrong than simple systems do; unlike physical machines, computer software has no mechanical wear and thus doesn't just break on its own (barring intrinsic flaws). Living in a digital world, computer software that works keeps working forever until something changes.

The real problem with complex systems is that it's very hard for people to keep track of all of the interrelationships, and thus to see the full effects of doing things. As a result, when you go to do something or change something, it's too easily to overlook a consequence and create an explosion.

(And it is very frustrating, because usually things are so obvious in hindsight. But this is because when you look back afterwards you don't have to try to keep track of everything, just the bits involved in the failure. Then you clearly see, far too late to be useful, how when you change A it causes B to shift sideways and so C goes completely off the rails.)

It does no good to tell people, yourself included, to study your complex system harder and to be more careful. People simply have a limit to how much they can hold in their head at once, and no amount of exhortation can change it.

(And system administration, to a first order approximation, is about change.)

Written on 31 January 2007.
« Why I am not fond of DHCP in lab environments
Transparent versus non-transparent caching »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jan 31 23:51:06 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.