In this guest post William Sharp of The Data Quality Chronicle provides some practical advice for leveraging data modelling at the outset of a data quality activity.
In today’s world of increasing feature sets it is easy to become bedazzled by the latest instalment of new functionality. Data quality software is no different than other enterprise applications in this regard. However, perhaps more important than ever, it is critical that users of data quality softwareknowthe data they are analyzing in order to fully leverage the tool and truly deliver good quality analysis.
First things first
With all the advances in data quality software it is easy to become enamoured with the bells and whistles and thus lose sight of the fundamentals. As these rich feature sets enable the less technological oriented business users to participate in data quality exercises, the fundamentals of data analysis become more, well, fundamental.
One of my first questions when I start a new data quality project is, “Does anyone have a data model?”. Sometimes I get lucky and the DBA has one. Sometimes when I am really lucky the DBA also has a data dictionary. I don’t play lotteries due to my lack of being “really lucky”, if you know what I mean.
If I get a data model, I sit and examine it like a CSI agent does blood spatter. It is, after all, my roadmap to solving some mysteries. If there is not a data model available, I dig-in and create one. Most times I start out with pencil and paper (I know, how arcane!). Most data quality projects involve one or two main entities, the customer or a product. As such, I use this as my starting point.
For the sake of simplicity, let’s concentrate on a customer-focused data quality initiative. Commonly referred to as Customer Data Integration, or CDI, these projects involve the most critical person in any business; the customer! Customers are a complex animal, particularly from a data perspective. Organizations often focus on collecting as much data regarding customers as possible and rightfully so. As a result, there is usually a fair amount of data in, or related to, the customer entity.
What works for me is to start out with a big picture and narrow my focus with further analysis. My first picture step is to build what I call my “customer frame”. The customer frame consists of the customer entity and each entity to which it is related. In the figure below you can view some of my basic customer frame based on a typical instance of Microsoft Dynamics CRM (in pencil nonetheless).
What did I learn from the exercise? For starters we can see that:
- The ContactBase Entity has a relationship to the ContactExtensionBase entity but I didn’t find relevant/helpful data in the ContactExtensionBase table (which is why it contains only the ID field relating the tables)
- This was useful in identifying what entities are essential and which are no
- Since the ContactExtensionBase table is a storage place for custom defined details regarding a contact, it was critical to examine this entity so I could be sure I was not missing very specific contact details
- The ContactBase Entity can be related to the AccountBase entity by the ContactBase.AccountID <> AccountBase.AccountID
- This allows me to, among other things, determine if there are active contacts associated with inactive accounts
- The ContactBase Entity can be related to the CustomerAddressBase entity via the ContactBase.ContactID <> CustomerAddressBase.ParentID
- This allows me to relate a contact to their address on record. Addresses play a critical role in CDI projects so this is a crucial piece of information and a relationship I’ll know in my sleep before long
- Knowing this relationship allows me to check for contacts with no address records or, worse yet, orphaned addresses (addresses without an association to a contact)
The list above is just a simple example of how studying the data model translates into practical knowledge which is critical to the success of a data quality initiative.
One of the more subtle points I touch on in this example is the identification of entities in the data model that are not necessarily useful. This is particularly true of Microsoft Dynamics CRM. Microsoft Dynamics CRM provides an “extension” table for just about every table in the database so that organizations can define and store data unique to their enterprise. As such, this is where to look for data that is “near & dear” to the hearts of users.
I came across a prime example of this on my last project when I discovered that my client was storing a unique identifier in one of these extension tables. This helped me identify a code that was akin to a social security number for each unique customer! On a CDI project, data that uniquely identifies a customer is a treasure well worth the time invested to discover it.
While I am definitely one to “geek-out” on new features of my favorite analysis tools, there is simply no replacement for knowing the data. Learning the basics of a data model doesn’t take very long but provides valuable insight into an organizations “data state”. I highly recommend writing up a cheat sheet like the one in the figure above and keeping it close to you as you define your data quality cleansing, standardization and matching routines.