2005-07-20
Please produce useful error messages
I just helped someone whose Fedora Core 3 installation was producing the following error message:
# yum check-update Setting up repositories Cannot find a valid baseurl for repo: updates-released
This is a beautiful example of a Unix tendency that people gripe about: technically correct but completely useless error messages. While yum is correctly complaining that it cannot generate a valid URL for the 'updates-released' RPM repository, it would be much more useful if it told us why, with an error message such as:
Cannot fetch mirror list: unable to resolve hostname 'fedora.redhat.com'
This would have immediately led us to wonder why the machine could not
resolve the name, which in turn would have led us straight to the
actual problem, a broken /etc/resolv.conf.
Generating useful and accurate error messages is an art, and that means that you should think about it when writing programs. Especially think about how propagating useful errors back to the top level will impact your program's structure, because it often does. This especially applies when writing system programs, which are the ones that are often going to break in mysterious ways and leave people very worried and lost about.
Python makes it pretty easy to do a variant on this, where at each step of handling an error you prepend your packet of context information and then pass it upwards. The end result winds up with errors that look like this:
Cannot find a valid baseurl for repo updates-released: while getting mirror list: cannot fetch URL http://download.redhat.com/<blah>: Hostname not found.
This is long but at least complete, and lets you know what went wrong at the lowest level and all of the steps backwards to the high level failure. (Python programs often use simple strings as the error messages, but there is no reason why a GUI program cannot put more structure in and thereby display a more concise dialog. Possibly with an 'expand for the gory intermediate details' option.)
2005-07-17
Skills I use when troubleshooting
A while back I wrote about FutureSysadminJobs and suggested that people who wanted satisfying careers as system administrators over the long term should develop the skills to be troubleshooters. Which begs the question: what are those skills?
The first and most important skill is: you have to find all this interesting. Enjoying being a curious packrat is not entirely required, but I think it helps a whole lot, especially as you'll probably need to learn a number of things on your own time.
Other than that, based on my own experiences troubleshooting various issues I'm going to say:
Troubleshooters have to know how to program. Really program, not just write little scripts. Sometimes you'll write programs and sometimes you'll have to understand them, and you can't do either if you can't program yourself. (This is probably an unpopular view, but I feel that anyone who can't program is fundamentally crippled here.)
Troubleshooters have to know how to debug, which is harder than it looks. Debugging is part instincts and part paranoia and part obsessive completeness and almost entirely without useful textbooks, which means you have to learn it the hard way, by doing it.
Troubleshooters have to know how things work, because if you don't understand how things work you can't see where they can go wrong. (This means that you are going to be storing away a lot of trivia in your mind. It will help a lot if you like doing this.)
Troubleshooters need to know how to dive into big programs, zoom right in on the one little relevant bit, understand it, and then change it. This is a distinctly different skill than normal program maintenance, and like debugging you mostly get to develop it by being thrown in the deep end.
Similarly, you need to be able to dive into a complex system and work out what bit is doing what. Systems are more loosely coupled than programs, so I tend to think that this is a somewhat different set of skills.
Troubleshooters need to be able to learn fast. Part of that is being able to research things, to figure out what articles or books or chapters have the stuff that you need to know right away, and what bits you can skim or omit.
It's certainly helped me to know a number of different computer languages and be reasonably familiar with a number of different systems. Pick nicely divergent ones, so that you get exposed to a bunch of different ideas.
(I have probably omitted a number of things. I may update this later, and comments are welcome.)
2005-07-10
Tools versus frontends in systems programs
I feel that people writing programs for system administration should be forced to make a choice: are you writing a tool or a frontend? 'Both' is not a good answer, because nine times out of ten that leads to a program that does neither very well and irritates everyone, leading to things like Friday's entry on why apt-get is not my favorite program.
A system administration tool is is a reusable building block, part of the larger system. It's designed to be easily used by scripts and other programs, as well as by system administrators. By contrast, a frontend is 'user friendly' (and program-hostile): chatty, redundantly informative, and often interactive.
If you think about system administration programs this way, building a frontend without an accompanying set of tools, making it the only way to do something, is clearly a mistake. Also, tools are clearly more important than frontends; you can always build a frontend on top of tools, but you can't build tools on top of a frontend very easily (if at all).
Tools are vitally important because people are always needing to administer systems in new ways you didn't expect. If they have tools, they can; if they don't, they're trapped. (If you think you know everything that people are going to need from your program to administer their systems, you're fooling yourself.)
For all its issues, Red Hat's rpm is a tool, not a
frontend. This means it is easy to build other systems around the core
it provides (whether directly as a command or through using the RPM
libraries); in turn this has created a pile of frontends and other
tools based on this core, from up2date and yum to our local tools for
automated RPM management across a large number of systems.
A big part of my issues with apt-get is that apt-get is more of a frontend than a tool, and there is no tool program for doing the same things (or at least, if there is it is hideously underdocumented). This means that people who want to use apt-get as part of larger processes are forced into contortions that are at best awkward.
2005-07-04
The big trick of running lots of systems
There's a lot of rules of running large-scale systems, with lots of machines. I'll probably be writing up my own version of them at some point. But they all really come down to one big trick:
Don't administer individual machines.
That's it. Everything else is in the implementation details. (Of course, the devil is always in the details.)
But what does it mean? More or less what it says: you should never deal with machines one by one, ideally not even if one of them is exploding. Dealing with machines one by one is somewhat like trying to get through a swamp on foot; you can make progress, but oh so very slowly, and slogging through the mud is very tiring.
This deep principle underlies a lot of large scale system
administration tools, including things like LDAP, NIS, and
automounters. (Which are just ways of making it so that you don't have
to worry about /etc/passwd and /etc/fstab and so on on each
machine.)
(Like the best big tricks this is in some ways a very Zen thing, so it's hard to find much to say about it that doesn't feel like belaboring the obvious.)
There's two sorts of large systems
A lot of what I think about and do is keeping large computer systems running. In an online discussion recently I've (re)realized that there are two different sorts of 'large systems', each with very different challenges.
One sort of 'large system' is large because it handles a lot, for example a mail system that handles email for 25,000 people on a single, beefy machine. Running a single large machine like this is mostly a matter of tuning, and it's not what I deal with. (Not that tuning a large complex system is easy; there are many complex and sometimes counterintuitive parts.)
The other sort of 'large system' is large because it has lots of machines. This is what I deal with (although somewhat on the small end of it). My experience is that when you have lots of machines, you have an entirely different set of problems than administering a single machine, no matter how beefy.
When you have large problems you'll often wind up dealing with lots of machines, because it's become the easiest way to scale up compute power for a lot of such things.
One important resource for this field is the annual LISA (Large Installation System Administration) conferences put on by Usenix every year (technically I believe they're organized by SAGE, which is a Usenix subgroup). Usenix has a number of past LISA proceedings available online on their web site (in the publications section); it's well worth your time to browse them and read any interesting papers.