Wandering Thoughts archives

2011-07-18

Thinking about when SQL normalization can improve performance

To follow up on an aside I wrote a while back in SQLNormalization, I'm going to take a crack at describing the sort of query where a normalized schema could win over a denormalized schema. First off, I'm going to restrict this to queries that need to join all of the normalized schema tables to create something that's equivalent to the denormalized table. As noted in a comment on the original entry, if you only need partial information normalization can provide you with an immediate win, but it's not really an interesting win; the interesting win is to beat a denormalized schema on its home ground.

Let's start by talking about why normalization costs you performance. The simple answer is that the database must pull data from multiple tables in order to answer a query, and doing this costs you extra IO operations. To turn this around and have a normalized schema perform better on a full cross-table query, we need to create a situation where the normalized schema can actually perform less IO.

If we start with the denormalized table, we can view normalization as splitting the data from each row of the table apart into multiple tables. Each normalized table has a size, the number of records it hold, and when we do a query with WHERE clauses involving the table it will return some subset of the records, the result size. Because normalization pulls out repeated, redundant data, some of the normalized tables will be smaller than others, sometimes a lot smaller (a table of department names might be much smaller than the table of employees). We can say that the query (and the overall denormalized view of the normalized data that's constructed with joins) narrows when it goes through such smaller tables; a few records in those tables correspond to many records in the denormalized view.

(We can also say that a normalized table is a full sized table where it has as many records as the denormalized table does. Often you will have at least one full sized table.)

We now have an answer: normalization saves on IO when the database engine can use such a narrowing to cheaply exclude a lot of (denormalized) records from the query's result set (or equivalently, to cheaply work out records to include). Now, this is not a simple answer because a lot of things affect whether or not the database engine can save any IO this way. For instance, if it's already doing a full table scan of a full sized table there is probably no savings possible (unless the denormalized table has abnormally large rows, scanning a full sized table involves as much IO as scanning the denormalized table would since both have the same number of records). Equally, if the engine can already get a small result set purely from index checks on a full table you will probably not get much further reduction from a narrow table.

Thus I think that the ideal query to save IO would have two characteristics. First, it would involve a WHERE restriction on a narrow table that can't be answered from an index (or at least not from an index that would exist in the denormalized table) and that excludes a lot of records. Second, at the same time the WHERE restrictions on larger tables (especially full tables) should still leave you with lots of records that will not make it to the final result set (the extreme version of this is that you have no such WHERE restrictions on the full table at all). This creates a situation where the narrow tables do most of the filtering of the result set and the database engine could not do the equivalent filtering cheaply on a denormalized table.

Whether any particular database can achieve these results depends a lot on how smart the database's query planner and optimizer is and also how well it can estimate what the most effective way to execute a query is. In particular, I don't know how smart common database engines like PostgreSQL and MySQL are.

(Disclaimer: all of this is theoretical thinking. Consult a database professional if you want real answers.)

programming/SQLNormalizationII written at 22:15:32;

My view on ORMs

Via Hacker News, I recently read ORM's hidden cost. As you might guess, I have an opinion on this.

First off, I agree with the article; ORMs have problems. Among other things that the article talks about, there are fundamental mismatches between the object oriented approach and the SQL approach to a problem. But ultimately the issues about ORMs are a lot like the SQL hairshirt issues: they matter in theory but they do not matter in practice to most people.

Despite all of these issues, ORMs work great at a small size and a small size is where almost all web applications live. Private and internal web apps are almost invariably small, and many public web apps will never grow very big. What matters in this domain is fast and easy development, which ORMs deliver handily because they give you a database that you don't have to think about or learn.

(Many people in this area would be just as happy with an object store that they could query. Really, that's what they're using SQL for.)

If your app or database gets big, this may not work any more and you'll need to get familiar with your actual database (which includes going to the SQL level). But if your app gets big you'll probably have to deal with any number of issues anyways, because big websites are fundamentally different from small websites; you'll get to care about load balancing, front end caches, structuring your HTML and CSS and JavaScript for fast loading, and all sorts of issues in addition to efficient use of your database.

(You can worry about these issues in advance if you want to, and I've done so sometimes for web applications (DWiki is one example). But it will probably slow down your development unless you're pretty familiar with the easy best practices in this area, familiar enough to do them by reflex.)

This implies that the first thing you should do when you're thinking about using an ORM is estimate how big a database and how active a web app you're likely to wind up with. Of course you should be doing that anyways, because it'll tell you how much you have to care about all of the SQL hairshirt issues.

programming/MyViewOnORMS written at 00:27:41;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.