My understanding of SQL normalization

June 2, 2011

In SQLHairshirtIssue, I brought up SQL normalization in passing in a way that may have left people thinking that it was done for performance. My understanding is that this is not the case, or at least not why you do normalization.

In short, normalization is not about performance; normalization is about not making your life unnecessarily painful later. To simplify slightly, normalization is more or less about not repeating yourself (with a sideline of not making extra work for yourself); each piece of information appears in only one entry in one table, instead of being repeated in multiple entries. This insures that if you need to add, change, or delete something you can do it in only one place and that you are sure to have made a complete change. Schemas that are not normalized require you to do more work when you change things, among other drawbacks.

Because normalization creates more tables (by splitting large kitchen sink records up), queries against normalized schemas normally join across more tables. This may or may not have a performance impact, and which way it goes depends very much on how smart your database engine is and what sort of query you're doing. However, since you're joining across more tables I would not be surprised if it often performed worse since you now need to do more disk IO in order to retrieve the same data (again, this somewhat depends on the specifics).

(I was going to try to describe the sort of query where I see a normalized schema winning over a denormalized schema, but right now I can't seem to write a simple explanation of the intuitively obvious picture in my head.)

Increasing performance by denormalizing your schema is a straightforward optimization tradeoff: you're making the common case of queries work faster at the expense of uncommon updates requiring more code and probably more database activity. Hopefully you have actually measured and know what is common and uncommon in your environment (and what's fast and slow).

At the same time, extensive normalization can still be overkill; it can create more little tables that you have to keep track of and make your SQL expressions more complicated (even if they run just as fast) because you are now joining over three different tables instead of having the data right there in one. At many database sizes and many database modification rates (especially for a typical small scale web application), the extra work that not having normalized data creates is basically in the noise while the annoyance of a neurotically normalized schema is quite visible.

(I tend to design normalized schema because it fits the way I usually think and like to work. But I'm both lazy and crazy, I really hate repeating myself, and I started out with SQL databases at the raw SQL and SQL schema layer instead of through an ORM.)


Comments on this page:

From 70.31.31.32 at 2011-06-02 20:12:15:

(I was going to try to describe the sort of query where I see a normalized schema winning over a denormalized schema, but right now I can't seem to write a simple explanation of the intuitively obvious picture in my head.)

You have your customer order (history) and shipping information in a different place than your customer billing information.

When someone logs in with a user name, their ID is looked up and used for the duration of the session. When they finally want to do a check-out, you only hit the billing information (with people's credit cards) at the very end.

This also helps when if you want to do testing: you can grab the information from your PROD machines (all, or a subset), and copy it over to DEV/QA/STAGE/whatever, and know you're pretty close to recreating your PROD environment. Credit card companies are very particular about where the CC information goes, and so if your DEV databases instances only have anonymous "customerID" columns, then the security of your DEV machines isn't as big of a deal. You can (say) just grab the "customerID" stuff and randomly generate data for the rest of the fields (a la Perl's Data::{Generate,Maker,Faker} modules).

Similarly you wouldn't copy over the contact and shipping tables/databases either, as that can be classified as "personal data", which has legal issues on its protections in many places (including Canada AFAIK).

Similar things are probably done with health-care related stuff: most records only have a "patientID" column that ties them together, and there's one database/table that has their contact information (which is protect like Fort Know because of PHIPA in Canada, and HIPPAA in the US).

Written on 02 June 2011.
« The programmer's problem with WikiText systems
Ints, __slots__, and Python 3 »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Thu Jun 2 01:11:43 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.