2009-12-17
The good and bad of SQL
The good of SQL is that it is an excellent ad-hoc query and reporting tool. If you have an SQL database with a decently formed schema (and in many ways a denormalized schema is the easiest), you can bang together lots of things very fast and easily. Raw SQL is an excellent way to explore your way around a bunch of data, for example to see if there are signs that something odd is going on.
(It also makes it possible to answer silly amusing questions that would not be worth writing actual serious reports for.)
Essentially, raw SQL is a REPL interface for your data, with most of the REPL advantages that you have in any good language. Of course, a good SQL client (one with command line editing, history recall, and so on) helps a great deal; things slow down to some extent if you're editing files and re-running commands to get your latest dose of results, instead of hitting up-arrow and doing some editing.
(I say 'most of' the advantages because there are some interesting issues in trying to validate and debug complex queries.)
The bad of SQL is that it is a Turing tar pit. You can spend acres of time fiddling with your SQL queries, either to get them giving you output that is set up just right or to figure out how to do some tricky query or aggregation entirely in SQL, when perhaps you should have given up long ago and done at least part of the work in a programming language. Modern SQL database servers have just enough functions and additional features to make it look likely that you can do anything if you try hard enough, and there's that old programmer pleasure in discovering the latest neat trick for bending SQL to your will, especially since it's usually accomplished by stringing things together in ever more baroque sequences.
(Guess what I spent all today doing.)
Sidebar: today's PostgreSQL aggregation trick
Without going into why, suppose that you are interested in counting how many distinct /24 subnets are involved in something, except that your raw data contains bare IP addresses. PostgreSQL has no direct way to convert an IP address to its containing /24, but you can do it indirectly:
network(set_masklen(table.ip, 24))
This also neatly illustrates the Turing tar pit nature of the whole process.
2009-12-09
My views on inheritance versus interface
Due to some recent entries, I've been thinking more about inheritance and interfaces and I've come around to a particular view, which you could call either the duck typing view or the jaundiced C programmer's view. Put simply:
Interfaces are what you should care about from outside a module. Inheritance is, or should be, strictly an implementation issue for inside the module, except when inheritance is the simplest way to expose certain functionality for reuse by the outside world.
(In Python, Exception is one example of such exposed functionality.)
In (strictly typed) languages with neither duck typing nor explicit interfaces, you have no choice but to use inheritance as a standin for interfaces, but this is an artifact of the language's limitations; it's not a desirable state in general. In languages with explicit interfaces, you can use them directly. In a language with duck typing, you should use that or construct an interface system yourself.
I further think that insisting on using inheritance when you really meant interfaces places you on a collision course between two opposing view of how much freedom sub-classes have to modify the functionality of their parent class. If inheritance is interface, it makes sense to rigidly lock down parent class behavior against modification by children; you are enforcing the interface. If inheritance is exposing functionality, it makes just as much sense to allow sub-classes more freedom, even if this means that they could potentially do things that would violate assumptions that are relied on by parent methods.
In a duck typed language, insisting on using inheritance as an interface still exposes you to this conflict. As a result, it significantly constrains the freedom of the module implementor and adds to their work; they must now document and support how to do things with their 'interface' classes, and once they do this their freedom to revise the class implementation is limited to what they can do around the edge of the (public) documentation.
Since using inheritance this way isn't necessary in a duck typed language, I thus believe that doing so is a mistake; it imposes unnecessary costs. (Worse, the costs are not necessarily born by the people clamouring for the ability to do this, since they are not usually the module author.)
2009-12-05
Overcoming the drawbacks of preforking accept() servers
I was going to say that while preforking accept()-based servers don't
in practice have a thundering herd
problem, they do have two other issues, namely there's no way to tell
when you need to grow the number of processes you're using and no way to
tell if a process has frozen. (The more complicated preforking scheme has neither problem.)
However, some thought showed me that it's possible to get around these
problems while retaining the advantages of the pure accept()-based
scheme. The key change to the simple version is that each child
tells the master process when it goes idle and when it handles a new
connection (doing so through a pipe that the master process set up for
this purpose). Since it sees state transitions, the master process can
now easily keep track of when a child has been busy on a single request
for too long and kill it.
(Things will be more reliable under load if the child sends a timestamp in its messages, since the master may not process child messages immediately.)
Deciding when to reduce the work pool is relatively simple; the master process can keep a count of the minimum number of idle workers over the last N seconds. When this number gets high enough, it can either not restart workers when they exit after handling N requests or outright ask them to die (via a signal, for example).
Deciding when to spin up more workers is more challenging. The only
approach I can think of is for the master process to monitor the server
socket; if the socket has stayed accept()-able for some length of time
and no worker process has changed its state, you start another one.
(The simpler approach to this is just to say that processes are cheap and so you will always start your final pool size, instead of trying to grow and shrink the worker pool dynamically. This didn't make sense for Apache but probably does these days for a backend server that is not worrying about talking to slow clients.)