In two previous blog posts I had touched on the subject of Data Warehousing and how and why that differs from the sort of Relational Database Management Systems (RDBMs) that we are now all so used to using. In the comments I got from one of those posts was a request to expand a bit more on the subject of “Normalizing” data. Normalizing data is at to root of RDBMs database structures. In my opinion it is far more art than science and there is rarely a 100% correct way to normalize data.
Over the years I have found to my eventual pain that over-normalizing data leads to diabolical SQL statements whenever there is a need to report on that data. So then, what is normalizing, why is it done and do you really care? I found a very interesting post by some IBM engineers who were around at IBM when the whole subject of normalizing data and RDBM’s was first tossed around. This was back in the 1980’s and I had always thought that this emerged because of a perceived need to reduce data duplication, like this.
If you have 1,000 employees in the same location and you want each to use the location address why have it in an employee table 1,000 times? Create an “employee” table and a “location” table, give each employee a unique identifier and each location a unique identifier and “relate” the two together (the relating bit is where Relational Database Management System concept comes in). At a basic level you could put a locationUID column in the employee table and simply insert the UID from the location table in that column. The great thing then is that any address changes need only be made once, in the location table and can apply to all 1,000 employees immediately.
Despite this obvious advantage it turns out the major reason for RDBM’s was more in response to databases getting larger and a need to minimize or reduce resource use (CPU cycles, memory etc.) as both were significantly and relatively more expensive at that time back in the 80’s. One other key set of issues which changed dramatically since the 80’s is the typical database size in terms of numbers of columns etc.
At that time data was often entered via punch cards, an 80 column punched card averaging 20 fields per record was typical. In today’s RDBMs it is not uncommon to see hundreds of columns in thousands of tables in a single RDBMs database. It is this which often necessitates the diabolical SQL with over gratuitous levels of joins in order to run simple reports. Hence my evolving series on the needs for Data Warehousing… Data Warehousing Part 1 OLAP and OLTP Data Warehousing Part 2 Dimensional Modeling
Comments on: "Ab Normal-ized Databases – An Historical Thought" (3)
“Normalizing data is at to root of RDBMs database structures. In my opinion it is far more art than science and there is rarely a 100% correct way to normalize data.”
I think I know what you’re trying to say, but to be pedantic, I’d like to point out that DB normalization is in fact science. It all comes down to duplication of data, or the attempt to remove it.
As far as practicality is concerned – I agree, there is no 100% best practice. You need to design for particular needs, and sometimes that involves higher or lower normal forms, dependent on the need (sometimes duplication means faster queries).
This article on Wikipedia sums it up fine: http://en.wikipedia.org/wiki/Database_normalization
I’d also read the related articles on each order of normal form if you’re interested in the theory behind it.
Sammy, thanks for taking the time to comment and I do not disagree with your thoughts because my feeling that normalization is more of an art is drawn from my own experiences and opinions. I have seen so many DB designs and few if any have been right to avoid duplications without complicating reporting unduly. Much of that has come from DB’s growing organically beyond anything envisioned when they were first created.
Mike – I know. =) Like I said, I was just being pedantic =)