There's two sorts of large systems
A lot of what I think about and do is keeping large computer systems running. In an online discussion recently I've (re)realized that there are two different sorts of 'large systems', each with very different challenges.
One sort of 'large system' is large because it handles a lot, for example a mail system that handles email for 25,000 people on a single, beefy machine. Running a single large machine like this is mostly a matter of tuning, and it's not what I deal with. (Not that tuning a large complex system is easy; there are many complex and sometimes counterintuitive parts.)
The other sort of 'large system' is large because it has lots of machines. This is what I deal with (although somewhat on the small end of it). My experience is that when you have lots of machines, you have an entirely different set of problems than administering a single machine, no matter how beefy.
When you have large problems you'll often wind up dealing with lots of machines, because it's become the easiest way to scale up compute power for a lot of such things.
One important resource for this field is the annual LISA (Large Installation System Administration) conferences put on by Usenix every year (technically I believe they're organized by SAGE, which is a Usenix subgroup). Usenix has a number of past LISA proceedings available online on their web site (in the publications section); it's well worth your time to browse them and read any interesting papers.