My understanding of SQL normalization

June 2, 2011

In SQLHairshirtIssue, I brought up SQL normalization in passing in a way that may have left people thinking that it was done for performance. My understanding is that this is not the case, or at least not why you do normalization.

In short, normalization is not about performance; normalization is about not making your life unnecessarily painful later. To simplify slightly, normalization is more or less about not repeating yourself (with a sideline of not making extra work for yourself); each piece of information appears in only one entry in one table, instead of being repeated in multiple entries. This insures that if you need to add, change, or delete something you can do it in only one place and that you are sure to have made a complete change. Schemas that are not normalized require you to do more work when you change things, among other drawbacks.

Because normalization creates more tables (by splitting large kitchen sink records up), queries against normalized schemas normally join across more tables. This may or may not have a performance impact, and which way it goes depends very much on how smart your database engine is and what sort of query you're doing. However, since you're joining across more tables I would not be surprised if it often performed worse since you now need to do more disk IO in order to retrieve the same data (again, this somewhat depends on the specifics).

(I was going to try to describe the sort of query where I see a normalized schema winning over a denormalized schema, but right now I can't seem to write a simple explanation of the intuitively obvious picture in my head.)

Increasing performance by denormalizing your schema is a straightforward optimization tradeoff: you're making the common case of queries work faster at the expense of uncommon updates requiring more code and probably more database activity. Hopefully you have actually measured and know what is common and uncommon in your environment (and what's fast and slow).

At the same time, extensive normalization can still be overkill; it can create more little tables that you have to keep track of and make your SQL expressions more complicated (even if they run just as fast) because you are now joining over three different tables instead of having the data right there in one. At many database sizes and many database modification rates (especially for a typical small scale web application), the extra work that not having normalized data creates is basically in the noise while the annoyance of a neurotically normalized schema is quite visible.

(I tend to design normalized schema because it fits the way I usually think and like to work. But I'm both lazy and crazy, I really hate repeating myself, and I started out with SQL databases at the raw SQL and SQL schema layer instead of through an ORM.)

Written on 02 June 2011.
« The programmer's problem with WikiText systems
Ints, __slots__, and Python 3 »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Jun 2 01:11:43 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.