Search the site

From the Bookstore:

The Data Governance ImperativeThe Data Governance Imperative

Steve Sarsfield

Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information (TM)Executing Data Quality Projects

Danette McGilvray

 

Data Driven

Tom Redman

 

Data Quality Assessment Data Quality Assessment

Arkady Maydanchik

 

 

Education partners
sponsors

  

 

« Data Quality Blog Roundup - October 2009 | Main | Profit by Data Quality Best Practices »
Tuesday
03Nov2009

5th Normal Form: The Achilles Heel of ETL & Data Warehouse Projects

imageThe lack of awareness of what fifth normal form is (or that it actually exists) is a major threat to data quality and integrity in all organisations.

The highest threat exists in those organisations carrying out Data Warehousing (DW) or Extract, Transform, Load (ETL) Projects.

In this post, John Owens, creator of the Integrated Modelling Method and an expert in business systems and process analysis provides a detailed account of 5th Normal Form, how it is breached and how to ensure data quality and integrity is maintained.

5th Normal Form: The Achilles Heel of ETL & Data Warehouse Projects

What is 5th Normal Form?

I first started talking to people about 5NF some years ago when lecturing for Oracle on relational systems design. The definition that was available at the time included phrases like "relations that obey no symmetric constraint", etc. and made very little sense.

I have searched the Internet and have been still been unable to find a good definition anywhere. On the contrary, many of the definitions I have found are completely wrong because, what they define as 5NF, is in fact a breach of it!

I find that Fifth Normal Form is best explained by an example of how it is breached.

 

An Example

Imagine the following is a table that lists manufacturers, the products they produce and the retailers that sell these products:

image 

Over enthusiastic data analysts could look at this table and say: "There is redundancy here. We should normalise this table further.”

They would then break the table down into three separate tables like these.

image

image

 image

Redundancy has been removed, so problem solved? Far from it!

If you now do a select * joining the three tables the query will return the rows below:

image

The row marked with the arrow is a spurious row. In the original table Debenhams did sell irons but only those made by Morphy Richards. In their drive to remove redundancy, the analysts have destroyed the integrity of the data.

This example shows that the original three-column table conforms to 5NF whereas the three normalised tables breach it.

 

How Do Breaches of 5NF Arise?

The most common ways in which 5NF is breached are:

  • Normalising tables in databases without first creating the normalised structures in an ERD
  • Mechanistic and simplistic modelling in the Corporate Data Model
  • Importing data from disparate databases into a single database without first modelling the required normalised structures in the Corporate Data Model
Normalising Tables in Databases

This error has been described above in the erroneous normalisation carried out on the table with the columns Manufacturer, Product and Retailer.

Simplistic Data Modelling
The most common cause of this is the mechanistic resolution of many-to-may relationships in a data model.

An example will once again help to explain this.

The following diagram shows three many-to-many relationships linking the Data Entities Manufacturer, Product and Retailer:

image

One rule of data modelling is that many-to-may relationships are resolved by introducing intersection entities. Applying this rule mechanistically and without real analysis of what the above data structure really represents would give us the following structure:

image

But does it breach 5NF?

One simple way to find out is to trace a route from any entity and see if you can traverse the other entities and get back to where you started. Doing this for the original structure and for the resolved structure would give us the following:

image

image

In both cases we can trace a route back to where we started. This is an indication that the structure is breaching 5NF.

But what have we done wrong? We have followed the basic rules for resolving many-to-may relationships. How can the solution be wrong?

We have made two basic errors:

  1. The first of these is that we have mechanistically resolved the many-to-may relationship without asking, “Do we need to know about this relationship and, if so, what information do we need to hold about it?” All intersection entities represent information about a many-to-may relationship so, unless we need to hold information about a many-to-may relationship, we should not resolve it.
  2. The second, yet more crucial error is that we have created data structures in isolation from the Function Model. The key rule for quality data modelling is that the only data that should be included in a Corporate Data Model (and in any resulting database) is that required to support the Business Functions of the enterprise.

From this it is obvious that, in order to know what data we need to hold and what the structure of that data should be, we must first model the Business Functions.

 

The Function Solves the Problem

The three original Data Entities of Supplier, Product and Retailer could be valid in an electrical goods distribution business.

So what Functions in such a business would create a data structure that would bring these three Data Entities together?

The Marketing Department might want to establish those products and manufacturers that would be preferred by retailers with whom the company would like to do business.

Let’s call this Business Function “Establish Potential Customer Preferences”.

The data structure that would support this Function would look like this.

image

If we convert this data structure into a database, then the Data Entity “Retailer Preferences” would be implemented as a table that had columns called Retailer, Product and Manufacturer.

This is exactly like the table we started with in our original example.

So by driving the data structures through the Business Functions we automatically arrive at a properly normalised structure.

 

Merging Data from Disparate Databases

We have seen from the above example how 5NF can be breached by modelling errors. In Data Warehousing and ETL projects it can inadvertently be breached by bringing together data structures that in their original environment complied with 5NF, but when brought together, do not.

It is quite possible that an enterprise has an application with a table in its database with the columns Manufacturer and Product, another application with a table that has Manufacturer and Retailer and a third application with a table with Retailer and Product.

These tables in their own separate worlds will cause no problems to the business and will return correct results query after query.

But bring them all together into a single database as three separate tables (as they existed in their original applications) and the business has a major data quality and integrity problem. 

Summary

Fifth Normal Form is not a myth. It is essential to data integrity in all relational databases and in data warehouses that derive their data from relational databases.

Breaches are caused by:

  1. Mechanistic modelling in the Corporate Data Model.
  2. Fragmenting data tables under the guise of “normalising” them.
  3. Importing tables from disparate databases into a single database without first modelling and normalising the required structures in the Corporate Data Model.

Breaches can be avoided by adhering to following fundamental rules:

  1. This is the most essential rule: Only ever model data and data structures required to support the Business Functions of the enterprise.
  2. Never do data normalisation in a database. Always model the normalised structure in the Corporate Data Model first.
  3. Only merge data after you have modelled the required structure in the Corporate Data Model and have then created the appropriate structures in the database receiving the imported data.

 

Did you find this article useful? If so please share on twitter:

http://bit.ly/1lVjBw Read 5th Normal Form: The Achilles Heel of ETL & Data Warehouse Projects, by @JohnIMM on @dataqualitypro #dataquality

Useful Resources

Integrated Modelling Method: An Introduction

Business Systems Modelling: Function Modelling (Tutorial 1)

Business Systems Modelling: Data Structure Modelling (Tutorial 2)

IMM Website

IMM Approach

Business Function Modelling

Data Structure Modelling

Business Process Modelling

Information Flow Modelling

 

Author Profile

John Owens is passionate about bringing simplicity, power and elegance to the world of Business Systems Analysis, Business Process Modelling and BPM. He is an international consultant and mentor to a wide range of enterprises of all sizes in the UK, Ireland, Europe and New Zealand. He has put all of this knowledge into a set of books and the Integrated Modelling Method (IMM™) which is available at his website www.integrated-modeling-method.com. John is based in New Zealand and provides mentoring to enterprises of all sizes, from start-ups to large corporations, to aid them improve their business and increase their cash flow.

Reader Comments (8)

Have I missed something here? The example is too simplistic. How was the original table derived in the first place? Data cannot go to Fifth Normal Form until all preceding forms are satisfied. Where are the unique keys? If the source table had been normalised correctly to start with it would probably just contain ManufacturerID, ProductID and RetailerID. In that case the 2 Irons would clearly have different product ID's.
If the table is as it came then the product codes are nonsense. "Iron" clearly cannot be made by 2 different manufactures, the product has to be keyed uniquely to identify the Irons (first normal form). "Iron" in the example is a description and not a ProductID.

November 5, 2009 | Unregistered CommenterPhilA

Hi,

I enjoyed the example and accepting the possibility that I may be wrong I believe the example is overly simplified. I understand that a bad 5 NF form can create a flawed database design but what is the biggest inconvenience of a breach of 5NF form that can be translated into a compelling business case? I have a db design that may or may not emulate this breach of 5NF and I may do a feasibility study if I find a compelling business case.

November 9, 2009 | Unregistered CommenterSid

PhilA

The example has been kept deliberately simple in order to clearly demonstrate the point. It is far from simplistic as it clearly shows how a major breach of data integrity has been created. If one wanted to impress, one could think up a far more complicated example. However, the outcome would be the same; the overnormalised tables breach 5NF.

Only in the archaic reverse engineering technique of Relational Data Analysis (RDA) could it be said that "data cannot go to Fifth Normal Form until all preceding forms are satisfied". This is not the case when using techniques such as Entity Relationship Diagramming (ERD), as used in the Integrated Modelling Method. Here only First Normal Form needs to be specially addressed. All normal forms are satisfied by default, and simultaneously, by the application of a few basic rules.

The original table in the example was properly normalised and each column contains the unique identifier of a Product Type, a Manufacturer and a Retailer. You could change the unique identifiers for QUACKs (see below), such as PR9763, MFR998 and RET5667 but the result would still be the same. When the analyst makes the mistake of overnormalising this three column table, queries joining the the three tables resulting will return spurious rows.

An Iron is indeed a type of product that can be made by many manufacturers.

Your statement that "a product has to be keyed uniquely" confuses analysis with design. The unique identifier of any data entity (including product) is never a key or a code. It is always one or more meaningful attributes or a combination of attributes and relationships. (See more on QUACKs) below.

First Normal Form has nothing to do with keys. It is all about removing "repeating groups". In fact, no properly constructed Logical Data Model will have keys anywhere. Keys are design artifacts, not analysis artifacts.

The unique identifier of a product is always a description - never a code!!

QUACKs (Quack Unique Alternative Codes or Keys) are commonly mistaken for unique identifiers. This mistake is most commonly made by analysts who erroneously do their data modelling in the physical world of the database as opposed to the logical world of the data model.

For more information on QUACKs Keys and UIDs see the article at http://www.integrated-modeling-method.com/data-modeling/data-modelling-the-power-of-quacks-and-uids

Pick up a Free rapid Guide to Data Structure Modeling at http://www.integrated-modeling-method.com/imm-bpm-business-process-modeling-store

November 10, 2009 | Unregistered CommenterJohn Owens

Sid

As you will see from my response to PhilA, the answer is definitely not over simplified, rather deliberately kept simple to clearly demonstrate a major breach of data integrity. I used an equally simple example when lecturing on Relational Systems Design for Oracle UK. All Oracle database designers attend this course. None of them found it simplistic nor did the Oracle Corporation, who built it into the course.

A breach of 5NF is not an inconvenience. It is a major business problem. All reports from the database that use the tables in question are suspect and it is not possible to tell which rows are responsible.

If you can generate one spurious recored from 3 tables with just 3 rows in each then how many spurious rows could you generate from tables with 300, 30,000 or 3 million rows?

The business case is that having such a structure in a database destroys the integrity for the data concerned. Management cannot rely on any queries that join these tables. It represents a significant commercial risk to the business.

I hope that this helps.

November 10, 2009 | Unregistered CommenterJohn Owens

I think there is a serious problem with your exposition. I agree fully that 5NF isn't known widely enough, and that this is a case of overzealous decomposition, and that the example is connected with lack of knowledge of higher normalization. But your example and terminology are seriously off.

I mean, both the original table and the "overly normalized" ones are in fact in 5NF; all of the join dependencies are implied by key constraints. The problem here is not as much 5NF, as the fact that the original table cannot be further non-loss decomposed. Were it so that the decomposition in the example actually was non-loss, this would represent the standard, simplest example of a three column, all-key table not being in 5NF, and actually requiring extra decomposition as a result.

The real beef with normal forms is knowing a) when you don't have one so that you have to decompose further (the standard example), b) knowing when to stop (your example), c) knowing what you lose when you go far enough (e.g. non-loss decompositions can still be non-dependency preserving, and in general are beyond 3NF), and d) particularly in the case of 5NF, understanding that there are decompositions which cannot be carried out piecemeal (i.e. if you try splitting off one of the three tables first, the decomposition will be lossy, whereas it will again be non-loss if you split three ways in a single step; this is really the defining characteristic of 5NF as opposed to 4NF).

November 20, 2009 | Unregistered CommenterSampo Syreeni

Hi Sampo

To say "... both the original table and the "overly normalized" ones are in fact in 5NF " is a contradiction. A data structure cannot be in breach of a normal form and comply with it at the same time.

The statement "...whereas it will again be non-loss if you split three ways in a single step; " is not true. It is precisely this action that caused the breach of 5NF.

The terminology is spot on. The example is valid and powerful.

Many perceive 5NF as something so mysterious that it cannot be understood by mere mortals. Others would try to make it so. But it is essentially simple and can be simply explained. No need for complexity or jargon.

The major problem that many people have with this example is that it is so simple, yet so powerful. Surely something so "complex" as 5NF cannot be so simply explained?

The major point being made by this example is that breaching 5NF in this way is NOT a mere "academic gaffe" but a serious breach of data integrity within a database. Trying to reconstruct the relationships that were represented by the original table will always introduce spurious rows and it will not be possible to determine which these are.

A note on terminology: Decomposition is not the the means by which a data structure is moved from one normal form to another. First Normal Form is not "decomposed" into Second Normal Form, and so on. Decomposition is a technique used to break down hierarchical structures into greater detail. The structures that we are dealing with here are relational.

November 26, 2009 | Unregistered CommenterJohn Owens

Hi Jon,
While I understand what you are getting to from the 5th normal form, and I understand your logic, this problem seems to me to be better solved from a dimensional modelling perspective than a normal-form perspective.

To model the first table, you are asking the question 'which products are sold at which stores'. In the dimensional notation (which depends on fully normalised fact tables but ends up at them in a specifically formal way) this would be a 'factless' fact table.

(BTW You would also want to model for time in there - either as from and to, or period sold)

Using the dimensional method there would be no chance that you end up with the wrong hybrid that appears while stepping through the normal forms.

Have a think about it - is a factless fact table always the same as a 5nf normalised table?

Happy to debate and to translate between the different terminologies (as that is all they are)

November 30, 2009 | Registered CommenterNathan Jones

"To say "... both the original table and the "overly normalized" ones are in fact in 5NF " is contradiction."

It is not. I quote "overly normalized" because you haven't actually normalized there, only lossily decomposed. Two different things. Both before and after the table is in 5NF, but in the process you have lost other valuable things, like dependency-preservation and lossless recomposition.

"The statement "...whereas it will again be non-loss if you split three ways in a single step; " is not true. It is precisely this action that caused the breach of 5NF."

In your case. I was talking about the standard example. You really should amp up on your basic theory. Start with one of a) Fagin's paper which first formally defined PJ/NF (the original, strongest, and purest of the three separate normal forms which are today called 5NF), or b) in fact either of Codd's first papers on the relational model, where he already mentions as an example a situation where a two-way lossless decomposition wrt a natural join is not possible, while a three-way one is.

"The terminology is spot on. The example is valid and powerful."

The example is powerful, in that it shows that decomposition shouldn't be carried too far. At the same time it muddles up the issue of why, because the terminology is non-standard. You're not talking about a violation of 5NF, but the violation of going beyond it and in the process breaking losslessness.

"Many perceive 5NF as something so mysterious that it cannot be understood by mere mortals."

It isn't too difficult, true. As long as you don't decompose unless you have to, but have to when any "multivaluedness stuff" is about to happen otherwise, you'll end up with 5NF. But the fact thta somebody like you would go so far amiss does make me worry.

"The major point being made by this example is that breaching 5NF in this way is NOT a mere "academic gaffe" but a serious breach of data integrity within a database."

It is one, either way. But right now you're talking about the more serious gaffe, namely a lossy decomposition. Not going downto 5NF, from say 3NF, is much less bad than that.

"Trying to reconstruct the relationships that were represented by the original table will always introduce spurious rows and it will not be possible to determine which these are."

As I said, your decomposition is lossy, and now you say as much. A table schema not being in normal form never causes this kind of anomaly. It only makes it difficult to update the data while upholding the dependencies/constraints which apply.

At the same time, I have had to design a database where this precise, simplest kind of not-two-but-necessarily-three-way decomposition had to be made, in order to bring the database into 5NF, whereas it would have been only in 4NF before. When this happens, you don't lose anything, but you gain easy updatability and constraint maintenance.

"Decomposition is not the the means by which a data structure is moved from one normal form to another."

Oh yes it is. More precisely, every single normal form from 2NF to 6NF, DK/NF, and many other, more esoteric ones, is the result of decomposition with respect to natural join, that is, the result of projecting on (usually) overlapping sets of columns. Only 1NF is different in the normal cascade, because any data before that is not strictly speaking relational at all.

"Decomposition is a technique used to break down hierarchical structures into greater detail. The structures that we are dealing with here are relational."

Okay, let's entertain that terminology, then... Relations are well able to represent hierarchical structures, and that is precisely what happens when we go from whatever to 1NF. You only have to think about sets of attributes as defining a type, and then of the sub and supertype relations between them. Suddenly both 2NF and 3NF translate into ways of "dealing with the hierarchical", by breaking the type semilattice into multiple separate relations.

December 3, 2009 | Unregistered CommenterSampo Syreeni

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>