Search the site
Subscribe to Data Quality Pro

 via email            RSS Feed

external resources
« How to publicise your skill, service or solution on Data Quality Pro.com | Main | ISO 8000 - A New International Standard for Data Quality »
Monday
Feb232009

Identifying Duplicate Customers (Part 3)

 image

This is the 3rd article in the series by expert panelist Jim Harris. In this article we learn more about the challenges presented by false positives in your duplicate customer management process.

To read all articles in the series click on one of the options below:

(Part 1), (Part 2), (Part 3), (Part 4), (Part 5)

 

Identifying Duplicate Customers

In Part 2 of this series: Data metaphors (i.e. fictional examples) illustrated the importance of using a detailed, interrogative analysis of real project data in your approach to identifying duplicate customers. I explained how the greater concern about false positives could motivate you to take a cautious approach that can cause false negatives, especially when you restrict yourself to using only exact matching techniques.

In this article, Part 3, additional data metaphors will illustrate the impact that false positives can have on business rule definition.

Again, for simplicity, the data metaphors will use the following three customer attributes:

  1. Customer Name – only personal names.
  2. Postal Address – only United States address formats.
  3. Tax ID – for better fictional values, I used dates related to the customer name.

Your business rule adjustments for preventing false negatives can result in linking records of related (but not duplicated) customers. Sometimes, these false positives may reveal meaningful data relationships that are useful in other enterprise information initiatives.

 

Let’s begin by looking at some false positives caused by a matching Tax ID and/or postal address:

image

For Keys 312 & 313, do you think the matching Tax ID and similar name indicate possible duplication of Key 311 despite the different postal address?

For Keys 322 & 323, do you think the exact same postal address and similar name indicate possible duplication of Key 321 despite the missing Tax IDs?

What about Keys 331 – 333, where completely different names have the exact same postal address and Tax ID?

 

A common challenge is the same family name and the exact same postal address:

image

Do you think Key 413 a duplicate of Key 412 or a son named after both of his parents?

Do you think Keys 421 – 423 are duplicates caused by multiple pseudonyms or a son, father and grandfather living in the same house?

Without Tax IDs, Keys 431 & 432 can only be differentiated by generation (i.e. “Jr.” indicating Junior and “Sr.” indicating Senior), however do you think Key 433 a duplicate for either of them?

Keys 441 & 442 might only be differentiated by gender, but what about Key 443?

 

An additional complexity that can occur with families at the exact same postal address:

image

You may find it useful to first split the compound customer names into separate records:

image

Performing this split reveals two potential pairs (511-b/512-a & 512-b/513-a) with the exact same name and postal address but completely different Tax IDs – are these duplicates? How many customers do you think are represented by Keys 511 – 513?

Keys 411 – 513 are also examples of a non-duplicate data relationship commonly referred to as a family household, where multiple distinct customers are linked for having the same family name and the same postal address. This relationship is useful in marketing programs that target family units (e.g. vacation packages, mobile phone plans) or that target the head of a household (i.e. customers making purchasing decisions).

 

A common family name and street name can combine to present an additional challenge:

image

Do you think that any are duplicates of the same customer(s) or relate the same family household(s)?

 

In closing, please carefully consider the following groups of records:

image

Do you think that any are duplicates of the same customer(s)?

Keys 711 – 744 (as well as Keys 321 – 513) are also examples of a non-duplicate data relationship commonly referred to as a geographic household, where multiple distinct customers are linked for having the same postal address. This data relationship is useful in mass mailing programs that benefit from the cost savings of eliminating redundant deliveries to the same postal address.

What other data metaphors can you think of that would illustrate the challenge of false positives?

What other meaningful data relationships can you think of that may be revealed by false positives?

 

 

In Part 4 of this series: We will discuss recommendations for documenting your business rules as well as setting realistic expectations about the first iteration of application development and guidelines for the necessary collaboration of the business and technical teams throughout the entire project.

 

 

Useful Links

 

See all posts in: DQ Techniques, Methodology

Reader Comments (2)

You are still going strong, Jim.

I would like to focus a bit on the term: Business Rule.

As I also commented to the earlier articles, adding diversity of international data challenges the making of business rules. Very often I deal with data from 3 different countries: Sweden, Norway and Denmark. The different traditions, rules and reference data availability makes rule setting different here.

Some examples:

• In Denmark every address has a house number and if several apartments exist there, there is a distinct way of identifying this. Reference data for verification exists. So you are able to confirm if a geographical household is valid. But in Sweden and Norway it is not tradition to address to the single apartment in high rise buildings, and in less populated parts of Norway you only address to a location. So making household linkage is harder here.
• All 3 countries have a unique ID for each person used in every citizen role. But in Sweden the rules are much more liberal for private companies to store and use the ID, than it is in the other countries. This makes individual deduplication much better in Sweden.

If you look at MDM as the final destination of deduplication you will face the different needs from the various business units and activities: Credit risk, 1-1 CRM, direct marketing, SCM, multiple brands, analytic business intelligence and operational business intelligence.

When you ask whether some rows in a database are duplicates or not in this article our answer will in the first place be based on a comparison to a picture of the real world. And it seems to me that this is also the ruling factor when making the decisions about these business rules in general. But then a great challenge is present:

• You require different levels of confidence between different tasks. In credit risk you tend to be absolutely sure but with direct mailing you can live with false positives and false negatives.
• Most data models, enterprise application deployments and maintaining organisations are not able to store a true picture of the real world and facilitate a differentiated consumption of these data.

Deduplication is much like to square the circle and calculate Pi. The result could be 22/7 or 3.14159265358979323846264338327950288419716939937510 or something similar.

Jim,
Part 3 done and still more to come . . .can't wait for part 4 as it looks like you'll be getting into the relationship with the "teams" and setting some (realistic?) expectations.

The examples show how different/similar data can be viewed many ways and the potential relationships that can exist, from the close-knit family out to an apartment complex or a university campus.

So far you have kept neutral regarding what the data can represent and for that I applaud you. Hopefully you can maintain your Switzerland-like approach throughout this series :-).

Feb 26, 2009 | Unregistered CommenterSteven Cagan

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>