In two previous blog posts I had touched on the subject of Data Warehousing and how and why that differs from the sort of Relational Database Management Systems (RDBMs) that we are now all so used to using. In the comments I got from one of those posts was a request to expand a bit more on the subject of “Normalizing” data. Normalizing data is at to root of RDBMs database structures. In my opinion it is far more art than science and there is rarely a 100% correct way to normalize data.
Over the years I have found to my eventual pain that over-normalizing data leads to diabolical SQL statements whenever there is a need to report on that data. So then, what is normalizing, why is it done and do you really care? I found a very interesting post by some IBM engineers who were around at IBM when the whole subject of normalizing data and RDBM’s was first tossed around. This was back in the 1980’s and I had always thought that this emerged because of a perceived need to reduce data duplication, like this.
If you have 1,000 employees in the same location and you want each to use the location address why have it in an employee table 1,000 times? Create an “employee” table and a “location” table, give each employee a unique identifier and each location a unique identifier and “relate” the two together (the relating bit is where Relational Database Management System concept comes in). At a basic level you could put a locationUID column in the employee table and simply insert the UID from the location table in that column. The great thing then is that any address changes need only be made once, in the location table and can apply to all 1,000 employees immediately.
Despite this obvious advantage it turns out the major reason for RDBM’s was more in response to databases getting larger and a need to minimize or reduce resource use (CPU cycles, memory etc.) as both were significantly and relatively more expensive at that time back in the 80’s. One other key set of issues which changed dramatically since the 80’s is the typical database size in terms of numbers of columns etc.
At that time data was often entered via punch cards, an 80 column punched card averaging 20 fields per record was typical. In today’s RDBMs it is not uncommon to see hundreds of columns in thousands of tables in a single RDBMs database. It is this which often necessitates the diabolical SQL with over gratuitous levels of joins in order to run simple reports. Hence my evolving series on the needs for Data Warehousing… Data Warehousing Part 1 OLAP and OLTP Data Warehousing Part 2 Dimensional Modeling