Identifying Duplicate Customers (Part 1)
DQ Techniques,
Methodology | by
Jim Harris In this guest post by expert panelist Jim Harris, we learn more about the challenges of managing duplicate records and why a robust methodology, not technology alone, should form the solution.
To get the next article in the series, just click here.
Identifying Duplicate Customers (Part 1)
One of the most common data quality problems is the identification of duplicate records, especially redundant information of the same customer throughout the enterprise.
The need for a solution to this specific problem is one of the primary reasons that companies invest in data quality software and services.
There are many data quality vendors to choose from and all of them offer viable solutions. Many of these solutions are driven by impressive technology using advanced mathematical techniques such as probabilistic record linkage theory, bi-partite graph matching algorithms, or my personal favorite, the redundant data capacitor, which makes identifying duplicate records possible using 1.21 gigawatts of electricity and a customized DeLorean DMC-12 accelerated to 88 miles per hour.
What is sometimes overlooked is that although technology provides the solution, what is being solved is a business problem. Technology sometimes carries with it a dangerous conceit – that what works in the laboratory and the engineering department will work in the board room and the accounting department, that what is true for the mathematician and the computer scientist will be true for the business analyst and the data steward.
However, what truly determines that a duplicate customer has been identified is not what scientific techniques or mathematical models can justify, but what your business rules define as a duplicate customer.
My point is neither to discourage the purchase of data quality software and services, nor to try to convince you which data quality vendor I think provides the superior solution – especially since these types of opinions are usually biased by the practical limits of your personal experience and motivated by the kind folks who are currently paying your salary or hourly rate.
My goal in this short series of posts is to focus on methodology and not technology.
I believe that an effective methodology for implementing your business rules for identifying duplicate customers will help you maximize the time and effort as well as the subsequent return on whatever technology you invest in.
One of the recurring themes in this series will be that the most significant challenge to solving this specific data quality problem is its highly subjective nature.
Data characteristics and their associated quality challenges are unique from company to company. Business rules can be different from project to project within the same company. Decision makers on the same project can have widely varying perspectives. All of this points to the need for having an effective methodology.
Unsuccessful data quality projects are most often characterized by the business team meeting independently to define the requirements and the technical team meeting independently to write the specifications.
Typically, the technical team then follows the all too common mantra of “code it, test it, implement it into production, and declare victory” that leaves the business team frustrated with the resulting “solution.”
Successful data quality projects are driven by an executive management mandate for business and technical teams to forge an ongoing and iterative collaboration throughout the entire project. The business team usually owns the data and understands its meaning and use in the day to day operation of the enterprise and must partner with the technical team in defining the necessary data quality standards and processes.
During the business requirements phase of the project, some form of the following question will be asked: “How do you define a duplicate customer?”
This is a critically important question – however, without an effective methodology, it can also prove to be a frustratingly difficult question. The participants in the requirements gathering process will most often respond with an answer that falls into one of the following two categories:
Category 1: “A duplicate customer is a duplicate customer.”
In this category, the answer takes some form of stating that a duplicate customer occurs when the exact same information is repeated on multiple records, either within the same system or across multiple systems.
Sometimes, this answer is passive-aggressively provided by participants who doubt that such a problem could be prevalent in their systems. This “data denial” is not necessarily a matter of blissful ignorance, but is often a natural self-defense mechanism from the data owners on the business side and/or the process owners on the technical side. No one likes to feel blamed for causing or failing to fix the data quality problem. This is one of the many human dynamics that is missing from the relative clean room of the laboratory where the technology was developed. Your methodology must consider the human factor because it will be the people involved in the project, and not the technology itself, that will truly make the project successful.
Other times, this answer is conservatively provided by participants who are concerned that being aggressive in identifying duplicate customers will negatively impact business decisions after duplicates are consolidated (either physically removed or logically linked). This answer is motivated by the fact that there is generally far greater concern about “false positives” than “false negatives” resulting from duplicate identification. What are false positives and negatives?
- False positives occur when a group of duplicates are identified that do NOT represent the same customer
- False negatives occur when actual redundant representations of the same customer are NOT identified
Later, we will look at data examples that illustrate both of these scenarios and why the harsh reality is that they can and will occur regardless of the technology or methodology.
Category 2: “Isn’t that what we are paying you to do for us?”
In this category, the answer takes some form of stating that identifying duplicate customers is either what the vendor’s software is supposed to do clairvoyantly, or that the vendor’s services team should just implement their proven methodology that worked for other clients in similar industries.
In the former case, it may be that the salesperson successfully “blinded them with science” to have such high expectations of the software. I am not trying to accuse salespeople of Machiavellian machinations (even though we have all encountered a few who would shamelessly sell their mother’s soul to meet their quota) – as I stated earlier, all data quality vendors have viable solutions driven by impressive technology.
In the latter case, the participants may share my belief that it is the symbiosis of technology and methodology that leads to implementation success. However, the project team must still participate in the definition of the business rules and not simply send the vendor off to “do the voodoo that they do so well.”
Both categories of responses (but especially Category 1) help emphasize the importance of my first recommendation – defining the business rules to identify duplicate customers can not be accomplished via a theoretical exercise.
Customer duplication is not a theoretical problem – it is a real business problem that negatively impacts the quality of decision critical enterprise information. Data-driven problems require data-driven solutions. Business rules are best illustrated by data examples. And I mean examples in the true definition of the word – real data from one or more of the project’s actual data sources that exemplify the problem and not data metaphors that may meaningfully demonstrate the problem but are nonetheless fictional.
Therefore, it is highly recommended that before the requirements gathering phase, some preliminary analysis is performed on a representative sample of data from one or more of the project’s actual data sources. This preparation of effective data examples will enable a far more productive discussion of the business rules.
In Part 2 of this article: We will take a look at some of my favourite data metaphors to illustrate the importance of using the data examples from your preliminary analysis during the business requirements phase - click here for part 2.
Useful Links
Identifying duplicate customers (Part 2)
Jim Harris ask the expert profile
See all posts in: DQ Techniques, Methodology


Reader Comments (5)
Very promising writing Jim - your observations are very close to mine.
One thing I always start with when approaching a customer (or rather party) data quality improvement project is, that instead of asking “what is a duplicate?” it’s more fruitful to ask “on what hierarchy level(s) do you want to track your business partners?”.
The classic main questions are:
• If you do B2C: do you want to track on the communication channel (e.g. an e-mail), on the individual person level, on household level or to track hierarchies between these?
• If you do B2B: how do you want to track within hierarchies as personal contacts, department contacts, branch offices, headquarters, domestic mothers and global mothers?
• If you do both (most companies actually have a mix): what efforts are worth taking to classify these entities?
The classic ingredients in a solution are:
• A robust methodology supported by specific data governance policies, which will work pretty much globally
• A set of powerful technology tools which may be global generic but also more local oriented
• A world of external reference data which can be global standardised but very often are local founded
When doing the preliminary analysis – which I also find imperative – I do tend to gather as much data as possible and then identify and count both the usual suspects and catch the unusual cases.
I am looking forward to Part 2.
Jim,
Great beginning to what looks like a 999 page dissertation. I like the way you have kept this at a higher-level and have not gotten lost in the details right away. I think many projects involving data quality (and specifically "dedup" projects) fail to meet their objective because of what you described as the "human element": the data owners take the quality (or lack thereof) of their data personally and are too quick to implement something that will just "brush it all under the carpet before Corporate notices".
I hope the next installments will continue your attempt at staying "solution neutral" as this topic could get very interesting the deeper you dive. Keep up the good fight! Awesome stuff . . .
Great article, Jim! You have captured the gist of our early conversations with the stakeholders on the business end of a data quality project, including:
1. The need to define a “customer” and exactly what constitutes a “duplicate customer”, and...
2. The critical need to formulate rules using REAL data. We have all been there. When it comes to data problems, the truth is always much stranger than any fiction we can dream up.
I am looking forward to Part 2.
In my opinion, matching of name and address records is an art rather than science. It relies heavily on human intervention as it’s not exactly binary - ambiguity always play a part.
Whilst records can be safely classified as duplicates or not, there is unfortunately the twilight zone, a non binary world of free radical duplicates.
In this inhospitable place, either complex re-try logic is required or humans need to indulge their valuable time in reviewing and accepting or rejecting possible duplicates records.
Like most other business problems it appears Pareto's 80:20 rule applies. 80% of the duplicates will be found in 20% of the time the remaining 20% will take 80% of the effort.
So if you’d like to save 80% of your time for free and find out how many duplicates you have, we’re offering a no obligation 3 month trial of our DedupeExpress product.
You can fully use the software to identify, match, review and merge data from any single system or across disparate systems for up to 100,000 records in any single session.
To register please go to our website http://www.dqglobal.com/free_dedupeexpress.html or call on +44 (0) 2392 988303
Jim -
This is really well done. I have certainly come across situations analogous to those that you have described. I particularly have seen people become defensive when I find through basic queries that, generally speaking, "someone did something incorrect." End-users are quick to plead ignorance or blame predecessors for errors. In the event that they themselves have made the mistakes (audit trails are pretty hard to dispute), the tone of the conversation is quite different. There it is in black and white: you did this on this date. I find that many people can become quite defensive.
You're dead on about the role of the consultant and the technology to identify potential or probable duplicates. It's the client's role to ultimately make the final call. Far too often, however, end-users do not have the time, desire, or skill set to make these calls.
Phil Simon