Accelerate your data quality, data governance or MDM career with a featured interview on Data Quality ProTIP: Want to be featured on Data Quality Pro (and get to the top of Google)?

Print to Page   |   Contact Us   |   Report Abuse   |   Sign In   |   Become a member
Identifying Duplicate Customers by Jim Harris (Part 1 of 5)
Share |

Identifying Duplicate Customers by Jim Harris (Part 1 of 5)

Author: Jim Harris, Blogger-in-chief at OCDQ Blog

Published: February 10, 2009



Jim Harris PhotoHow do you create a strategy for identifying duplicate customers? What techniques must you draw on?

In this feature by expert panelist and creator of the extremely popular OCDQ Blog, Jim Harris , we learn more about the challenges of managing duplicate records and why a robust methodology, not technology alone, should form the solution.

This feature forms the first part in a 5 feature series, to read the other featured articles please use the following links: Part 1 : Part 2 : Part 3 : Part 4 


 

One of the most common data quality problems is the identification of duplicate records, especially redundant information of the same customer throughout the enterprise.

The need for a solution to this specific problem is one of the primary reasons that companies invest in data quality software and services.

There are many data quality vendors to choose from and all of them offer viable solutions. Many of these solutions are driven by impressive technology using advanced mathematical techniques such as probabilistic record linkage theory, bi-partite graph matching algorithms, or my personal favorite, the redundant data capacitor, which makes identifying duplicate records possible using 1.21 gigawatts of electricity and a customized DeLorean DMC-12 accelerated to 88 miles per hour.

What is sometimes overlooked is that although technology provides the solution, what is being solved is a business problem. Technology sometimes carries with it a dangerous conceit – that what works in the laboratory and the engineering department will work in the board room and the accounting department, that what is true for the mathematician and the computer scientist will be true for the business analyst and the data steward.

However, what truly determines that a duplicate customer has been identified is not what scientific techniques or mathematical models can justify, but what your business rules define as a duplicate customer.

My point is neither to discourage the purchase of data quality software and services, nor to try to convince you which data quality vendor I think provides the superior solution – especially since these types of opinions are usually biased by the practical limits of your personal experience and motivated by the kind folks who are currently paying your salary or hourly rate.

My goal in this short series of posts is to focus on data quality methodology and not data quality technology.

I believe that an effective methodology for implementing your business rules for identifying duplicate customers will help you maximize the time and effort as well as the subsequent return on whatever technology you invest in.

One of the recurring themes in this series will be that the most significant challenge to solving this specific data quality problem is its highly subjective nature.

Data characteristics and their associated quality challenges are unique from company to company. Business rules can be different from project to project within the same company. Decision makers on the same project can have widely varying perspectives. All of this points to the need for having an effective methodology.

Unsuccessful data quality projects are most often characterized by the business team meeting independently to define the requirements and the technical team meeting independently to write the specifications.

Typically, the technical team then follows the all too common mantra of "code it, test it, implement it into production, and declare victory” that leaves the business team frustrated with the resulting "solution.”

Successful data quality projects are driven by an executive management mandate for business and technical teams to forge an ongoing and iterative collaboration throughout the entire project. The business team usually owns the data and understands its meaning and use in the day to day operation of the enterprise and must partner with the technical team in defining the necessary data quality standards and processes.

During the business requirements phase of the project, some form of the following question will be asked:

"How do you define a duplicate customer?”

This is a critically important question – however, without an effective methodology, it can also prove to be a frustratingly difficult question. The participants in the requirements gathering process will most often respond with an answer that falls into one of the following two categories:


Category 1: "A duplicate customer is a duplicate customer.”

In this category, the answer takes some form of stating that a duplicate customer occurs when the exact same information is repeated on multiple records, either within the same system or across multiple systems.

Sometimes, this answer is passive-aggressively provided by participants who doubt that such a problem could be prevalent in their systems. This "data denial” is not necessarily a matter of blissful ignorance, but is often a natural self-defense mechanism from the data owners on the business side and/or the process owners on the technical side. No one likes to feel blamed for causing or failing to fix the data quality problem. This is one of the many human dynamics that is missing from the relative clean room of the laboratory where the technology was developed. Your methodology must consider the human factor because it will be the people involved in the project, and not the technology itself, that will truly make the project successful.

Other times, this answer is conservatively provided by participants who are concerned that being aggressive in identifying duplicate customers will negatively impact business decisions after duplicates are consolidated (either physically removed or logically linked). This answer is motivated by the fact that there is generally far greater concern about "false positives” than "false negatives” resulting from duplicate identification. What are false positives and negatives?

False positives occur when a group of duplicates are identified that do NOT represent the same customer

False negatives occur when actual redundant representations of the same customer are NOT identified

Later, we will look at data examples that illustrate both of these scenarios and why the harsh reality is that they can and will occur regardless of the technology or methodology.


Category 2: "Isn’t that what we are paying you to do for us?”

In this category, the answer takes some form of stating that identifying duplicate customers is either what the vendor’s software is supposed to do clairvoyantly, or that the vendor’s services team should just implement their proven methodology that worked for other clients in similar industries.

In the former case, it may be that the salesperson successfully "blinded them with science” to have such high expectations of the software. I am not trying to accuse salespeople of Machiavellian machinations (even though we have all encountered a few who would shamelessly sell their mother’s soul to meet their quota) – as I stated earlier, all data quality vendors have viable solutions driven by impressive technology.

In the latter case, the participants may share my belief that it is the symbiosis of technology and methodology that leads to implementation success. However, the project team must still participate in the definition of the business rules and not simply send the vendor off to "do the voodoo that they do so well.”

Both categories of responses (but especially Category 1) help emphasize the importance of my first recommendation – defining the business rules to identify duplicate customers can not be accomplished via a theoretical exercise.

Customer duplication is not a theoretical problem – it is a real business problem that negatively impacts the quality of decision critical enterprise information. Data-driven problems require data-driven solutions. Business rules are best illustrated by data examples. And I mean examples in the true definition of the word – real data from one or more of the project’s actual data sources that exemplify the problem and not data metaphors that may meaningfully demonstrate the problem but are nonetheless fictional.

Therefore, it is highly recommended that before the requirements gathering phase, some preliminary analysis is performed on a representative sample of data from one or more of the project’s actual data sources. This preparation of effective data examples will enable a far more productive discussion of the business rules.




Read the full series: Part 1 : Part 2 : Part 3 : Part 4

Did you find this feature useful? Please share on Twitter or LinkedIn:




About Jim Harris

Jim Harris Photo
Jim Harris is the Blogger-in-Chief at Obsessive-Compulsive Data Quality, an independent blog offering a vendor-neutral perspective on data quality and its related disciplines. 

Jim is a recognized industry thought leader on data quality with over 15 years of professional services and application development experience in data quality, data integration, data warehousing, business intelligence, master data management and data governance. 

Jim is also an independent consultant, speaker and freelance writer for hire.  Jim is also very active on Twitter, where you can follow him @ocdqblog.


Useful Data Quality Resources Related to this Topic:

[TA.1] Data Quality Strategy & Methodology  
NZ Ministry of Justice Data Quality Framework Link  more ] Administration 24/06/2011
Data Matching Better Practice Guidelines Link  more ] Administration 24/06/2011
How to deliver £600m of Data Quality Benefits Link  more ] Administration 24/06/2011
Going beyond Six Sigma: The KFR Story Link  more ] Administration 24/06/2011
The Road to Data Quality and Governance Maturity Link  more ] Administration 24/06/2011
How to Deliver Lean Data Quality: Mark Humphries Link  more ] Administration 24/06/2011
Implementing Data Governance in Local Government Link  more ] Administration 24/06/2011
Asset Data Quality Management in the Rail Industry Link  more ] Administration 24/06/2011

[TB.3] Data Quality Rules and Requirements  
DQ and Business Rules Explained with Ronald G.Ross Link  more ] Administration 04/11/2011
Data Quality Rules Process for Data Migration Link  more ] Administration 07/10/2011
Data Quality-Centric Data Migration, John Morris-2 Link  more ] Administration 07/10/2011
Data Quality-Centric Data Migration, John Morris-1 Link  more ] Administration 07/10/2011
Data Quality Rules: General Attribute Dependencies Link  more ] Administration 07/10/2011
Data Quality Rules: Rules for Historical Data Link  more ] Administration 07/10/2011
Data Quality Rules: Attribute Domain Constraints Link  more ] Administration 07/10/2011
Data Quality Rules: Integrity Constraints Link  more ] Administration 07/10/2011
Data Quality Rules: State-Dependent Objects Link  more ] Administration 07/10/2011

[TB.6] Data Cleansing and Data Matching  
There are no files to display.
Search Data Quality Pro
Please sign in here >
Data Quality Journal
Event Calendar

06/02/2012 » 07/02/2012
Gartner BI Summit, London, 6-7 Feb

08/02/2012 » 09/02/2012
Gartner MDM Summit, London, 8-9 Feb

08/02/2012
Online DQ Business Bootcamp: Practical techniques & support for DQ business owners and professionals

15/02/2012
Online DQ Business Bootcamp: Practical techniques & support for DQ business owners and professionals

22/02/2012
Online DQ Business Bootcamp: Practical techniques & support for DQ business owners and professionals

Online Surveys
Popular Demos