When is using SQL the right answer?

July 11, 2010

One of the things I've been thinking about as a result of my too much SQL war story is how to distinguish between when using SQL is the right answer and when it's the wrong answer. I've just started on this, but I've come to one obvious sign: data reduction.

The downside to just asking your SQL server for the entire database and doing all of the processing in your own program is that the SQL server ships you a lot of data. Thus, one important thing that good SQL queries do is reduce the amount of data that the server has to feed to your program. If your query is reducing a six million row record set to five hundred rows of results, you are likely making good use of SQL.

The same holds true for sub-components of your query, all of the things that you are joining with and selecting on and so on. If they are reducing the amount of data that goes to your program, that's a good sign. Not reducing the amount of data is not necessarily a bad sign, but I think that it is at least a warning sign. It means that that bit of your SQL is just mutating what data your program gets back, not reducing it. Sometimes mutating the data in SQL is the easiest and best way, but sometimes it is leading you down a dangerous path.

(In the future, I'm going to look much closer at how much work and complexity my SQL has when I'm just mutating rows. Simple SQL mutations like plain JOINs are probably a good thing; crazy complex things like I tried to do are clearly the wrong answer.)

Bear in mind that this is a rule of thumb, not a rule, and thus there are lots of exceptions. Sometimes even doing all of the data reduction in SQL is the wrong answer; I suspect that the classical case is if you need multiple levels of reduction and summarization. You can do this in SQL but you need either temporary result tables or multiple queries that run over your full data, so it may well be simpler to have your program do all of the higher levels of reduction.

(For a simple example, consider my case. We needed to compute both the total daily volume for each NAT gateway and the per inside host daily traffic volume. Even if mapping inside hosts to NAT gateways was a process that was amenable to SQL, there is no way to get a single query to give us both results at once. The right answer in this case is to have the program use SQL to generate the inside host report and then calculate the per-gateway volume itself.)

Written on 11 July 2010.
« People forget exceptions
On (not) logging calculated statistics »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jul 11 00:22:34 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.