2010-07-20
The sysadmin view of messages from programs
A lot of programs and systems like producing warning messages in various circumstances, or just informative messages. Sometimes the authors of these systems are surprised when sysadmins have very grumpy reactions to those messages. Fortunately, I have realized that there is a very simple and general rule about when your messages will irritate sysadmins and when they won't, and it even ties into current jargon:
You should only produce actionable messages (by default).
An actionable message is a message where a sysadmin can and in fact should take some action in response to it. It is not an error (or your program would stop), but it reporting something that we should go deal with. Ideally it's reporting something that we must go deal with.
Non-actionable messages are noise because there is nothing sysadmins can or should do with them. Your program shouldn't produce noise both because it annoys people and because it makes the actual signal harder to see. In particular, anything that's expected to happen in normal operation is by definition not actionable, or at least it shouldn't be; your program should just deal with it.
This rule should not be a surprise, because it is the same rule that you should already be using for general user interfaces. We've long since passed the point where programs aimed at ordinary users keep popping up alerts about routine things (well, I hope we have) or asking people stupid questions. These days the only excuse you have for bothering the user is if there's something they have to deal with right now or your program can't proceed.
(And as everyone understands, you had better be prepared for them to blindly click whatever option makes your question go away the fastest. Extensions to sysadmin behavior are left as an exercise for the reader.)
The corollary to this is that generally there should be some way of shutting your program up about its actionable message because we've looked at the issue you've reported and concluded that it's actually not a problem in our environment. Perhaps you think that this issue is always a serious problem, no matter what; in that case, perhaps your warning message should be an error instead.
(Note that this is specifically about what messages get produced by default. When someone is debugging and digging for problems, sure, report all of the non-actionable things that you want. But this is not the default situation.)
2010-07-11
When is using SQL the right answer?
One of the things I've been thinking about as a result of my too much SQL war story is how to distinguish between when using SQL is the right answer and when it's the wrong answer. I've just started on this, but I've come to one obvious sign: data reduction.
The downside to just asking your SQL server for the entire database and doing all of the processing in your own program is that the SQL server ships you a lot of data. Thus, one important thing that good SQL queries do is reduce the amount of data that the server has to feed to your program. If your query is reducing a six million row record set to five hundred rows of results, you are likely making good use of SQL.
The same holds true for sub-components of your query, all of the things that you are joining with and selecting on and so on. If they are reducing the amount of data that goes to your program, that's a good sign. Not reducing the amount of data is not necessarily a bad sign, but I think that it is at least a warning sign. It means that that bit of your SQL is just mutating what data your program gets back, not reducing it. Sometimes mutating the data in SQL is the easiest and best way, but sometimes it is leading you down a dangerous path.
(In the future, I'm going to look much closer at how much work and
complexity my SQL has when I'm just mutating rows. Simple SQL mutations
like plain JOINs are probably a good thing; crazy complex things
like I tried to do are clearly the wrong answer.)
Bear in mind that this is a rule of thumb, not a rule, and thus there are lots of exceptions. Sometimes even doing all of the data reduction in SQL is the wrong answer; I suspect that the classical case is if you need multiple levels of reduction and summarization. You can do this in SQL but you need either temporary result tables or multiple queries that run over your full data, so it may well be simpler to have your program do all of the higher levels of reduction.
(For a simple example, consider my case. We needed to compute both the total daily volume for each NAT gateway and the per inside host daily traffic volume. Even if mapping inside hosts to NAT gateways was a process that was amenable to SQL, there is no way to get a single query to give us both results at once. The right answer in this case is to have the program use SQL to generate the inside host report and then calculate the per-gateway volume itself.)
2010-07-07
A gotcha with the Bourne shell's set -e and &&
Suppose that you have the following Bourne shell code:
set -e cmd1 && cmd2 && cmd3 echo all done
Now suppose that cmd2 exits with a non-zero status. Do you expect
the script to abort, or to print out 'all done'?
My assumption when I was writing a script recently was that the script
would abort; after all, I had set -e turned on. But this is not what
happens, and in fact most Bourne shell manpages spell it out explicitly;
set -e only exits if what fails is a simple command that is not
having its exit status tested. Everything except the final command in a
series of &&'s is having its exit status tested, so failure of cmd2
here merely causes cmd3 not to be run.
(Different Bourne shell implementations use different wording about
the exact conditions, but I suspect that they behave the same. See
the Single Unix Specification description of set
for perhaps the authoritative wording on it.)
If you are writing shell scripts, the immediate consequence of this is
that it is not entirely safe to start out writing a script with various
coded error checks and then later decide that you always want things to
just exit on errors and add a set -e to handle it all; you may find
that your script is not aborting when you want it to, or alternately
that the script's failure to abort is quietly hiding the fact that a
single command did fail and that something else didn't get run.