The big trick of running lots of systems
There's a lot of rules of running large-scale systems, with lots of machines. I'll probably be writing up my own version of them at some point. But they all really come down to one big trick:
Don't administer individual machines.
That's it. Everything else is in the implementation details. (Of course, the devil is always in the details.)
But what does it mean? More or less what it says: you should never deal with machines one by one, ideally not even if one of them is exploding. Dealing with machines one by one is somewhat like trying to get through a swamp on foot; you can make progress, but oh so very slowly, and slogging through the mud is very tiring.
This deep principle underlies a lot of large scale system
administration tools, including things like LDAP, NIS, and
automounters. (Which are just ways of making it so that you don't have
to worry about
/etc/fstab and so on on each
(Like the best big tricks this is in some ways a very Zen thing, so it's hard to find much to say about it that doesn't feel like belaboring the obvious.)
There's two sorts of large systems
A lot of what I think about and do is keeping large computer systems running. In an online discussion recently I've (re)realized that there are two different sorts of 'large systems', each with very different challenges.
One sort of 'large system' is large because it handles a lot, for example a mail system that handles email for 25,000 people on a single, beefy machine. Running a single large machine like this is mostly a matter of tuning, and it's not what I deal with. (Not that tuning a large complex system is easy; there are many complex and sometimes counterintuitive parts.)
The other sort of 'large system' is large because it has lots of machines. This is what I deal with (although somewhat on the small end of it). My experience is that when you have lots of machines, you have an entirely different set of problems than administering a single machine, no matter how beefy.
When you have large problems you'll often wind up dealing with lots of machines, because it's become the easiest way to scale up compute power for a lot of such things.
One important resource for this field is the annual LISA (Large Installation System Administration) conferences put on by Usenix every year (technically I believe they're organized by SAGE, which is a Usenix subgroup). Usenix has a number of past LISA proceedings available online on their web site (in the publications section); it's well worth your time to browse them and read any interesting papers.