Identifying Duplicate Customers (Part 5)
DQ Techniques,
Methodology,
mp-month | by
Jim Harris
In this final part of a 5 part series, data quality expert Jim Harris presents some final tips and techniques to help improve consolidation and create a "best-of-breed" representative record.
If you missed the first 4 parts you can view them here:
(Part 1), (Part 2), (Part 3), (Part 4).
Identifying Duplicate Customers (Part 5)
So far in this series, we have discussed:
- Why a symbiosis of technology and methodology is necessary when approaching the common data quality problem of identifying duplicate customers
- How performing a preliminary analysis on a representative sample of real project data prepares effective examples for discussion
- Why using a detailed, interrogative analysis of those examples is imperative for defining your business rules
- How both false negatives and false positives illustrate the highly subjective nature of this problem
- How to document your business rules for identifying duplicate customers
- How to set realistic expectations about application development
- How to foster a collaboration of the business and technical teams throughout the entire project
In this article, the fifth and final part in the series, we will discuss topics related to duplicate consolidation, including techniques for creating a “best of breed” representative record for duplicates, physical removal vs. logical linkage, and consolidation vs. cross population.
E Pluribus Unum – Out of Many, One
We will use the following data metaphors from Part 2 of this series:
Consolidation evaluates groups of identified duplicate records and creates one representative record for each group.
Creating a “Best of Breed” Representative Record
Typically, consolidation creates the representative (i.e. “best of breed”) record using one of two techniques:
- Record Level Consolidation – choosing one complete record from within the group
- Field Level Consolidation – constructing fields from potentially different records from within the group
The business rules for performing consolidation can vary as much as the business rules for identifying duplicate customers. The selection criteria are usually fairly straightforward, however complexity can be caused when multiple criteria with nested levels of tie-breakers are needed to choose or construct the “best of breed” data for the group.
Although not a comprehensive list, here are some of the most common selection criteria:
- Completeness – for record level consolidation, this usually means selecting the record with the highest number of populated fields. For field level consolidation, this usually means selecting the fields with the longest values.
- Frequency – more common in field level consolidation where the most frequently occurring value is selected for a given field. However, it can be used in record level consolidation to select the record that has the most frequently occurring combination of values across fields. Either way, the assumption is that the most frequently occurring value indicates preferred information.
- Recency - for record level consolidation, this usually means selecting the record most recently updated. For field level consolidation, this usually means selecting the field value from the record most recently updated. Either way, the assumption is that the most recent update contains reliable information.
- Source – more common in record level consolidation to select the record that originated in a preferred source system. However, it can be used in field level consolidation to select the value for a given field from the record that originated in a preferred source system.
Examples of Record Level Consolidation:
Choosing a record based on the most frequently occurring combination of values across fields:
Examples of Field Level Consolidation:
Constructing a record based on selecting the fields with the most complete (longest) values:
"Frankenstein Consolidation"
Field level consolidation is sometimes referred to as “Frankenstein consolidation” since constructing fields from different records within the group can assemble an “unnatural data monster” by creating an invalid combination of field values. This concern typically makes record level consolidation the far more common consolidation technique.
Physical Removal vs. Logical Linkage
Typically, consolidation is implemented using one of two techniques:
- Physical Removal – where the group of identified duplicate records is replaced with the representative record, meaning either the duplicate records are actually deleted from the source system or simply excluded from the target system. Physical removal is most commonly associated with record level consolidation.
- Logical Linkage – where the group of identified duplicate records are updated with a reference identifier field whose value points to the identifier field value of the representative record, which is differentiated by either not having a reference identifier field value or with an additional indicator field.
Consolidation vs. Cross Population
An alternative strategy in consolidation is cross population, where the representative record is used to update the identified duplicate records with the “best of breed” data. Cross population is most commonly associated with field level consolidation.
Typically, cross population is implemented using one of two techniques:
1. Fill in the blanks – values from the representative record are used to update only the unpopulated fields in the records in the group.
2. Create consistent values – the representative record is used to update all fields in all records in the group to create a single consistent representation of the highest quality data available.
Applying “fill in the blanks” cross population using the above representative records:
Applying “create consistent values” cross population using the above representative records:
Useful Resources
See all posts in: DQ Techniques, Methodology


Reader Comments (3)
Thanks for such a good series of articles.
Having practiced this for decades, I learned something a few years back, about which I have written and spoken several times. I call it "uniqueness is in the eye of the beholder."
It's basically this principle: given a set of records, different business units will have different false negative and false positive rules, arising in different - often opposing - definitions of what a "unique" thing is.
Here's an example: I'm a customer of a software company, having licensed several of their products for the last 20+ years. Over those years, I've lived at several addresses, have worked at many others, have had numerous email addresses and phone numbers. Other times, I've purchased other titles from the same vendor as an owner of my own consulting company, using similar PII but sometimes the same and sometimes different credit cards. Still other times, I've licensed software for others, using different addresses but the same credit card. Over time, I've also used different privacy preferences, depending on the software and its usage. So the question is: am I the same customer or not? To the financial folks, they may demand a more conservative match criteria, saying if the billing names and addresses are different, I'm a different customer. The privacy group may not care about this, but may prefer to match on name, email, and privacy preference. The marketing department may have yet a different set of criteria based on household, any address match, name, etc., even if it confuses me as a consumer with me as a contact for a small business. The product teams match based on other things like licenses, names, etc. So, who's right? well, the *all* are!
This is not a contrived example, by the way, but real situations that made it almost impossible to create a "single definition of truth" that would satisfy multiple business users.
So, it's imperative that you can support multiple definitions of "uniqueness" at once.
This implies that when you do find a match, you never, ever merge the underlying records, and you must be very careful when propagating changes across underlying records.
I hope this is helpful for you and your readers!
Marty Moseley
CTO, Initiate Systems, Inc.
I could not agree more with Marty, I believe the term SCV (Single Customer View) to be part of the problem. If we changed it to View of Single Customer (VSC); whilst I know not quite as snappy, we'd have a better sense that there can be many views of the same thing.
When we both look at a bottle from opposite sides, I see the label and you don't. That's no different than accounts, marketing, sales, finance et. al. looking at a person or business. We all want to see different views of the same thing.
When you boil it down it's all about context. The context you require the data in order to develop actionable information which drives purposeful decisions.
As Marty say's "uniqueness is in the eye of the beholder."
I enjoyed your article greatly Jim and like yourself I have used most of the Data Quality tools in the Gartner Magic Quadrant and then some. However, the data quality professional in me could not resist picking up a slight error unless this is something that needs more explanation. In the example 'Constructing a record based on selecting the fields with the most complete (longest) values:' the resulting record is Jerome David Salinger. I would guess that your result is based on parsing the Customer Name field into it's discrete components, ie First Name, Middle Name & Last Name and then retaining the longest value from each of those fields before consolidating each of the name elements back into Customer Name and not just simply choosing the longest Customer Name from the existing Customer Name field?