Thinking about when SQL normalization can improve performance
To follow up on an aside I wrote a while back in SQLNormalization, I'm going to take a crack at describing the sort of query where a normalized schema could win over a denormalized schema. First off, I'm going to restrict this to queries that need to join all of the normalized schema tables to create something that's equivalent to the denormalized table. As noted in a comment on the original entry, if you only need partial information normalization can provide you with an immediate win, but it's not really an interesting win; the interesting win is to beat a denormalized schema on its home ground.
Let's start by talking about why normalization costs you performance. The simple answer is that the database must pull data from multiple tables in order to answer a query, and doing this costs you extra IO operations. To turn this around and have a normalized schema perform better on a full cross-table query, we need to create a situation where the normalized schema can actually perform less IO.
If we start with the denormalized table, we can view normalization
as splitting the data from each row of the table apart into multiple
tables. Each normalized table has a size, the number of records it
hold, and when we do a query with
WHERE clauses involving the table
it will return some subset of the records, the result size. Because
normalization pulls out repeated, redundant data, some of the normalized
tables will be smaller than others, sometimes a lot smaller (a table of
department names might be much smaller than the table of employees).
We can say that the query (and the overall denormalized view of the
normalized data that's constructed with joins) narrows when it goes
through such smaller tables; a few records in those tables correspond to
many records in the denormalized view.
(We can also say that a normalized table is a full sized table where it has as many records as the denormalized table does. Often you will have at least one full sized table.)
We now have an answer: normalization saves on IO when the database engine can use such a narrowing to cheaply exclude a lot of (denormalized) records from the query's result set (or equivalently, to cheaply work out records to include). Now, this is not a simple answer because a lot of things affect whether or not the database engine can save any IO this way. For instance, if it's already doing a full table scan of a full sized table there is probably no savings possible (unless the denormalized table has abnormally large rows, scanning a full sized table involves as much IO as scanning the denormalized table would since both have the same number of records). Equally, if the engine can already get a small result set purely from index checks on a full table you will probably not get much further reduction from a narrow table.
Thus I think that the ideal query to save IO would have two
characteristics. First, it would involve a
WHERE restriction on a
narrow table that can't be answered from an index (or at least not from
an index that would exist in the denormalized table) and that excludes
a lot of records. Second, at the same time the
WHERE restrictions on
larger tables (especially full tables) should still leave you with lots
of records that will not make it to the final result set (the extreme
version of this is that you have no such
WHERE restrictions on the
full table at all). This creates a situation where the narrow tables do
most of the filtering of the result set and the database engine could
not do the equivalent filtering cheaply on a denormalized table.
Whether any particular database can achieve these results depends a lot on how smart the database's query planner and optimizer is and also how well it can estimate what the most effective way to execute a query is. In particular, I don't know how smart common database engines like PostgreSQL and MySQL are.
(Disclaimer: all of this is theoretical thinking. Consult a database professional if you want real answers.)