Thinking about when SQL normalization can improve performance

July 18, 2011

To follow up on an aside I wrote a while back in SQLNormalization, I'm going to take a crack at describing the sort of query where a normalized schema could win over a denormalized schema. First off, I'm going to restrict this to queries that need to join all of the normalized schema tables to create something that's equivalent to the denormalized table. As noted in a comment on the original entry, if you only need partial information normalization can provide you with an immediate win, but it's not really an interesting win; the interesting win is to beat a denormalized schema on its home ground.

Let's start by talking about why normalization costs you performance. The simple answer is that the database must pull data from multiple tables in order to answer a query, and doing this costs you extra IO operations. To turn this around and have a normalized schema perform better on a full cross-table query, we need to create a situation where the normalized schema can actually perform less IO.

If we start with the denormalized table, we can view normalization as splitting the data from each row of the table apart into multiple tables. Each normalized table has a size, the number of records it hold, and when we do a query with WHERE clauses involving the table it will return some subset of the records, the result size. Because normalization pulls out repeated, redundant data, some of the normalized tables will be smaller than others, sometimes a lot smaller (a table of department names might be much smaller than the table of employees). We can say that the query (and the overall denormalized view of the normalized data that's constructed with joins) narrows when it goes through such smaller tables; a few records in those tables correspond to many records in the denormalized view.

(We can also say that a normalized table is a full sized table where it has as many records as the denormalized table does. Often you will have at least one full sized table.)

We now have an answer: normalization saves on IO when the database engine can use such a narrowing to cheaply exclude a lot of (denormalized) records from the query's result set (or equivalently, to cheaply work out records to include). Now, this is not a simple answer because a lot of things affect whether or not the database engine can save any IO this way. For instance, if it's already doing a full table scan of a full sized table there is probably no savings possible (unless the denormalized table has abnormally large rows, scanning a full sized table involves as much IO as scanning the denormalized table would since both have the same number of records). Equally, if the engine can already get a small result set purely from index checks on a full table you will probably not get much further reduction from a narrow table.

Thus I think that the ideal query to save IO would have two characteristics. First, it would involve a WHERE restriction on a narrow table that can't be answered from an index (or at least not from an index that would exist in the denormalized table) and that excludes a lot of records. Second, at the same time the WHERE restrictions on larger tables (especially full tables) should still leave you with lots of records that will not make it to the final result set (the extreme version of this is that you have no such WHERE restrictions on the full table at all). This creates a situation where the narrow tables do most of the filtering of the result set and the database engine could not do the equivalent filtering cheaply on a denormalized table.

Whether any particular database can achieve these results depends a lot on how smart the database's query planner and optimizer is and also how well it can estimate what the most effective way to execute a query is. In particular, I don't know how smart common database engines like PostgreSQL and MySQL are.

(Disclaimer: all of this is theoretical thinking. Consult a database professional if you want real answers.)

Written on 18 July 2011.
« My view on ORMs
Our ZFS spares handling system (part 3) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jul 18 22:15:32 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.