Identifying Duplicate Customers (Part 2)
DQ Techniques,
Methodology | by
Jim Harris In this, the second article in the series, expert panelist Jim Harris provides some fictional examples to highlight the need for a detailed, interrogative data analysis approach for duplicate customer resolution.
To see all articles in the series click on one of the options below:
(Part 1), (Part 2), (Part 3), (Part 4), (Part 5)
Identifying Duplicate Customers (Part 2)
In Part 1 of this series I explained why a symbiosis of technology and methodology is necessary when approaching this common data quality problem. I also recommended performing a preliminary analysis on a representative sample of real project data in order to prepare effective examples for defining your business rules.
In Part 2, I will use data metaphors (i.e. fictional examples) to illustrate the importance of real data analysis as well as the highly subjective nature of this problem.
Specifically in this post, I will be focusing on the impact that false negatives can have on business rule definition.
For simplicity, the data metaphors will use the following three customer attributes:
- Customer Name – only personal names
- Postal Address – only United States address formats
- Tax ID – for better fictional values, I used dates related to the customer name
False negatives can be caused when the greater concern about false positives motivates a cautious approach to duplicate identification. This leads many projects to adopt a strategy allowing only exact matches. Therefore, let’s begin by looking for duplicates where the exact same information is repeated on multiple records – meaning where all attributes are populated and have the same value:
Would you argue that these are NOT duplicate customers?
Now you have an important choice – either you:
- Take the blue pill – stop reading and be content with implementing an exact match strategy and at least some of your duplicate customers will be identified
- Take the red pill – stay in Wonderland and I will show you just how deep the rabbit hole goes...
You have chosen wisely – so let’s get back to the future of identifying duplicate customers:
Exact matching missed the last three records – do you think that all five are duplicates of the same customer?
The abbreviation of first and middle names is a common challenge:
Does a matching Tax ID guarantee that a variation is a duplicate? What about when Tax ID is missing?
An additional challenge that can occur with abbreviated first names:
Without Tax IDs, could you determine who Keys 223 & 233 should match? If both were missing Tax ID, would they then be considered duplicates of each other?
Name and address variations can combine to present additional challenges:
Is Key 153 an old postal address for the same customer? If postal validation confirms for Key 242 that “Riverside Drive” was renamed “John F. Kennedy Drive” would that make the name variation more acceptable?
Do Keys 154 & 243 represent a possible nickname or another family member using the same Tax ID? Did you notice the transposed numbers in Tax ID on Key 243? If so, did you give any partial credit?
Marriages can be good for people but possibly bad for their data:
Did the hyphenated last name on Key 252 help you overcome the change of address and missing Tax ID? How do you know if Keys 261 and/or 262 are truly the same customer as Key 263?
In closing, please carefully consider the following pairs of records:
One of these pairs is a potential false negative caused by a pseudonym and the other is a potential false positive. Even if you know which is which, how do you define a business rule for this scenario?
What other data metaphors can you think of that would illustrate the challenge of false negatives?
In Part 3 of this article: We will look at data metaphors that illustrate why some of the business rules that you just defined for resolving false negatives could result in creating the false positives that you started out trying to avoid.
Useful Links
See all posts in: DQ Techniques, Methodology
-
Identifying Duplicate Customers: (Part 1), (Part 2), (Part 3), (Part 4), (Part 5)
-
Jim Harris online: Data Quality Pro Expert Profile : OCDQ Blog : LinkedIn Profile : OCDQ LinkedIn Group : Twitter


Reader Comments (3)
Interesting piece. The three biggest challenges I’ve found are:
1. Missing data – where there are holes in the data i.e. something versus nothing is 100% different, hence any weighting applied to a field on this basis won’t help.
2. Misaligned data – where data in the wrong column (Postcode in Town field etc) so records being compared using the same column have differing values
3. Aliases - bringing the data to a common format so that string comparisons and weightings have real value
Over the years we've had a go at solving these and if you’d like to find out how we’ve done, we’re offering a no obligation 3 month trial of our DedupeExpress product. You can fully use the software to identify, match, review and merge data from any single system or across disparate systems for up to 100,000 records in any single session.
To register please go to our website http://www.dqglobal.com/free_dedupeexpress.html or call on +44 (0) 2392 988303
A well written and nicely considered article, Jim, with some great examples. I particularly like the T. Kundera v T. Kundera case and your question about what would happen if the Tax ID was not present. It illustrates the point that identical records are not necessarily good matches.
Your final examples (Shakespeare & Twain) are interesting too. The use of alternative names is the sort of issue that our software deals with on a daily basis for companies like Barclays and Vodafone who need to identify whether any of their customers are known bad guys.
Many of these people use strings of aliases and deliberately manipulate their names, addresses and other personal details in an attempt to create a new identity. Relying on probabilistic matching or relaxing comparisons enough to find these matches would create an intolerable volume of false positives, so we use intelligence data from third parties and the business along with known name vairants to identify matches with unparalleled accuracy.
Find out more about our enterprise data quality solution and its application in customer and employee screening by visiting datanomic.com or call +44 1223 228 450.
Jim, it’s always interesting when you discover that someone else out there is working with the same challenges as you.
And of course there are a lot of situations discovered during the years you want to have an automated fix for solving.
Adding to Martin’s list here is a few:
Person names being 2 names in same string like “Mary & John Smith”. If say you on the same address have
• “Mary and John Smith”,
• “John Smith”,
• “Mary Smith”,
• “M. Smith”,
then you really have to consider your hierarchies – individual and household.
Many companies don’t have exact or reliable markings between private customer and business customers. So if you have
• “Henrik Sorensen”,
• “IT-consultant Henrik Sorensen”,
then it depends on your whether you want to mix private and business entities and what constitutes a private/business entity.
In B2B data with contact persons your hierarchies are core. What about
• “Sorensen Inc.”,
• “Sorensen Incorporated, John Smith”.
If you want to clean for duplicate mailings – it could be regarded as a duplicate. If you want to do MDM, you have 2 linked entries on 2 different hierarchy levels.
A special situation – I call the “echo problem” - is that you have
• “Sorensen Inc., John Smith”,
• “Sorensen Real Estate Ltd, John Smith”
on the same address and perhaps same phone. This puts some fire under your hierarchy management.
Often I see that handling international data makes solutions working pretty well for some data being not so good and maybe rather damaging for data of different origin. Some of the challenges are:
• Sequence and special words in names – in English “and” and & are the same, but in Danish “and” is a duck
• Different meaning, e.g. “Kim” is an anglo female nickname, a Danish male given name and a very common Korean family name - and in Korea family names are written first
• Postal code formats, placement and granuity differs a lot between countries
The solutions I work with to struggle this mess include:
• Powerful and configurable algorithms for similarity
• Maintainable decision matrices
• Extensive use of external reference data
The availability, cost, coverage, quality and actuality of external reference data are also very different between countries. So are the rules for storing and using elements like the ones similar to the US TaxID.
About business rules the main goal often are just to replicate the real world. The question then is rather in what degree you are able to store, maintain and consume a true picture of the real world. Then you start prioritising.
I will refer from putting an ad here :-)