Necessity of Conceptual Data Modeling for Information Quality by Pete Stiglich

Data Quality

Aug 18

The ability to correctly identify the business entities and the way we want to model those relationships is pivotal to good information quality. Most organisations opt for physical modeling and create application specific schemas that often lack the high-level vision for how the business really needs to utilise its data

In this post, senior consultant at EWSolutions, Pete Stiglich, presents an excellent account of the importance of Conceptual Data Modeling on ensuring information quality.

Why not use his advice to benchmark your current modeling approach and identify areas for future improvement?

Definition of Information Quality

A short definition of information quality is “the fitness of information for an intended use”. While true, this is an obvious, but not particularly useful definition. First, let’s define what information is, in order to understand what information quality is.

I define information as high quality data that is well defined (has formalized meta data – business and technical) and is understood in the proper context with other pieces of data.

Another way to express this is:

Information = Data + Meta Data + Data Quality + Data Context.

The degree to which your data is defined, is of high quality, and is well understood in context with other pieces of data determines “the fitness of information”.

It is impossible to have good data quality without understanding what the data element is supposed to represent, what the valid domain values are, etc – in other words, meta data.

I will be focusing on the last part of the equation – Data Context.

Understanding the proper data context – how data relates to other data – should initially fall under the domain of a Conceptual Data Model.

What do we mean by Data Context?

Data Context can be likened to a puzzle where a puzzle piece is a data entity (a concept or thing of importance to your business about which you want to collect data) and in order to get the pieces to fit together you need to understand the proper relationship of the piece in question to the other puzzle pieces.

Like my 4 year old, you can try to force the connectors to make it fit but in the end, you do not get the right picture – even if it “fits”.

It is often much the same way in our information systems – they may seem to fit together well, but sometimes the data in our solutions does not relate at the level required by the business.

For example, a major bank did not know how many customers it had because the data was always stored at the account level. Identifying information about only the first customer on the account was available – only the names of the additional customers were captured, and then as a repeating group in a single column.

Another minor issue caused by this was that they could not always tell how much a customer owed the bank, without a lot of manual effort.

A Conceptual Data Model is the picture on the puzzle box that provides the vision of what the information puzzle should look like at the end of the day, regardless of whether your solution is a Data Warehouse, SOA, ERP, Master Data Management, or OLTP.

Note that the information puzzle picture will probably look significantly different from the solution Logical and Physical Data Models, but in the end, the information in our systems should accurately reflect the Conceptual Data Model.

The Conceptual Data Model is our star chart to keep us true to our goal.

What is a Conceptual Data Model?

A Conceptual Data Model is a diagram identifying the business concepts (entities) and the relationships between these concepts in order to gain, reflect, and document understanding of the organization’s business, from a data perspective.

“It shows how the business world sees information. It suppresses non-critical details in order to emphasize business rules and user objects. It typically includes only significant entities which have
business meaning, along with their relationships.”

— Applied Information Science website

A Conceptual Data Model usually takes the form of an Entity Relationship Diagram (ERD) or Object Role Model (ORM). The Conceptual Data Model typically does not contain attributes, or if it does only significant attributes.

The Conceptual Data Model is technology and application independent.

The Conceptual Data Model should reflect relationships from a historical, longitudinal perspective.

For example, a relationship between Store and Employee may usually be considered a one-to-many, but when viewed from a historical perspective perhaps the relationship may actually be many-to-many – what if the Employee begins work at another Store?

In addition, Conceptual Data Models should reflect both as-is and to-be in the same model. To-be here does not mean what the data entities/relationships will be in the solution to be developed; rather, to-be here represents changes to the business that are currently in the works, or are considered highly probable.

A caution is to make sure that the distinction is clearly made on the model when an entity or relationship is “to-be” – perhaps use a different color for this. Capturing “to be” entities/relationships allows for the possibility of designing in flexibility (within reason) into your solution.

What are the problems encountered by the lack of a Conceptual Data Model?

Without a Conceptual Data Model, you might stumble through and uncover most of the relationships required, but it is very easy to miss a big picture relationship when you are down in the details of developing a new system and burning the midnight oil to deliver it.

At a pet supply company, the relationship between Customer and Pet was not properly identified – a one-to-many relationship between Customer and Pet was modeled in the physical model (there was no Conceptual or Logical Model…) and the CDI (Customer Data Integration) hub went into production.

The problem was that in reality, there was a many-to-many relationship between these two entities.

Each member of a family could be considered a Customer and so could be considered an owner of the Pet.

Whenever a customer, other than the original customer associated with the pet, brought that pet in for services, the system required a duplicate Pet record to be created.

There was a non-identifying relationship from Customer to Pet. In this model, a Pet record could be associated with only one Customer.

The result was complex and time-consuming processing to deduplicate the data as well as loss of confidence by the business in this system.

In your Conceptual Data Model, you should also take advantage of cardinality notation to accurately describe relationships.

A Conceptual Data Model is a key source of business rules that can be identified by the cardinality notated on your model.

Be sure to differentiate between identifying and non-identifying relationships in the Conceptual Data Model – this is a critical distinction.

An identifying relationship indicates that the child entity does not make sense apart from a relationship to the parent entity.

For example, an Order Line entity does not exist apart from a relationship to an Order Header.

Identifying relationships determine the granularity of the child entity and is a key method of defining the meaning of the entity.

In a logical model or physical model, an identifying relationship results in the foreign key being a component of the primary key of the child entity/table.

What are the differences in types of data models?

Logical and Physical Data Models often look significantly different from the Conceptual Data Model and from each other due to the roles they play.

For example, many-to-many relationships are perfectly acceptable (and very common) in a Conceptual Data Model, but these must be resolved in the Logical Data Model (usually with an associative data entity).

The Physical Data Model in turn may look significantly different than the Logical Data Model in order to achieve the performance necessary (e.g. vertical partitioning – splitting an entity into multiple entities based on stability).

If the Conceptual Data Model is not developed, or is not maintained, the big information picture is lost and misunderstandings can arise when referring to Logical and Physical Data Models apart from the Conceptual Data Model.

Foreign keys are frequently disabled in the Physical Data Model due to performance concerns.

If there was not a Conceptual or Logical Data Model to refer to (ever reverse engineer a source system?), you would not know the cardinality except by profiling the data, which can be an intensive task, especially without a tool.

If surrogate keys are used in the Logical Data Model, relationship identification will differ from the Conceptual Data Model.

Conclusion

In order for your solution application to be successful, you must correctly identify the proper business data entities and their relationships.

It is unavoidable!

The question is, when will you identify these – early in the project during requirements definition (when Conceptual Data Models should be created), during development, implementation, or when in production?

The later in the project lifecycle problems are corrected, the greater
the cost will be. However, the greatest cost is often the cost of poor Information Quality when the business cannot receive the information it needs, or is forced to make decisions based on incorrect or incomplete information.

The Conceptual Data Model helps us to see the Data Context in order to provide the information solutions the business requires.

Dylan Jones