When is using SQL the right answer?
One of the things I've been thinking about as a result of my too much SQL war story is how to distinguish between when using SQL is the right answer and when it's the wrong answer. I've just started on this, but I've come to one obvious sign: data reduction.
The downside to just asking your SQL server for the entire database and doing all of the processing in your own program is that the SQL server ships you a lot of data. Thus, one important thing that good SQL queries do is reduce the amount of data that the server has to feed to your program. If your query is reducing a six million row record set to five hundred rows of results, you are likely making good use of SQL.
The same holds true for sub-components of your query, all of the things that you are joining with and selecting on and so on. If they are reducing the amount of data that goes to your program, that's a good sign. Not reducing the amount of data is not necessarily a bad sign, but I think that it is at least a warning sign. It means that that bit of your SQL is just mutating what data your program gets back, not reducing it. Sometimes mutating the data in SQL is the easiest and best way, but sometimes it is leading you down a dangerous path.
(In the future, I'm going to look much closer at how much work and
complexity my SQL has when I'm just mutating rows. Simple SQL mutations
JOINs are probably a good thing; crazy complex things
like I tried to do are clearly the wrong answer.)
Bear in mind that this is a rule of thumb, not a rule, and thus there are lots of exceptions. Sometimes even doing all of the data reduction in SQL is the wrong answer; I suspect that the classical case is if you need multiple levels of reduction and summarization. You can do this in SQL but you need either temporary result tables or multiple queries that run over your full data, so it may well be simpler to have your program do all of the higher levels of reduction.
(For a simple example, consider my case. We needed to compute both the total daily volume for each NAT gateway and the per inside host daily traffic volume. Even if mapping inside hosts to NAT gateways was a process that was amenable to SQL, there is no way to get a single query to give us both results at once. The right answer in this case is to have the program use SQL to generate the inside host report and then calculate the per-gateway volume itself.)