Search the site
Subscribe to Data Quality Pro

 via email            RSS Feed

external resources
« Are you managing change in your data quality initiative? | Main | Financial Services Data Quality Survey »
Tuesday
Mar102009

Identifying Duplicate Customers (Part 4)

image Expert Data Quality Pro panelist Jim Harris continues his excellent series of articles focusing on the challenges and methods of identifying duplicate customers.

To read all articles in the series click on one of the options : (Part 1), (Part 2), (Part 3), (Part 4), (Part 5)

Identifying Duplicate Customers (Part 4)

 

So far in this series, we have discussed:

  • Why a symbiosis of technology and methodology is necessary when approaching the common data quality problem of identifying duplicate customers
  • How performing a preliminary analysis on a representative sample of real project data prepares effective examples for discussion.
  • Why using a detailed, interrogative analysis of those examples is imperative for defining your business rules
  • How both false negatives and false positives illustrate the highly subjective nature of this problem

In this article, Part 4 in the series, we will discuss recommendations for documenting your business rules as well as setting realistic expectations about application development and guidelines for the necessary collaboration of the business and technical teams throughout the entire project.

Business Rule Documentation

image The goal of a business requirements document (BRD) is to provide clear definitions of business problem statements that include associated solution criteria. Although your project’s BRD will obviously contain other necessary material, here are a few recommendations for documenting your business rules for identifying duplicate customers:

  1. Include data examples – parts 2 and 3 of this series illustrated the effectiveness of examples for facilitating discussion. They should also be included in the documentation. Data examples convey business rules far better than either concise (but esoteric) statements, or detailed (but verbose) pages of attempted explanation.
  2. Accentuate the negative – although it may sound counterintuitive, it is simply easier to explain something when you don’t like it. Recall your answers to the questions in parts 2 and 3 of this series. When you looked at records that you believed should be considered duplicates, did you feel the need to justify your decision with an elaborate explanation? Compare that with your reaction when you looked at records that you believed should NOT be considered duplicates. This effect is known as “negativity bias” where bad evokes a stronger reaction than good in the human mind – just compare an insult and a compliment, which one do you remember more often? Therefore, focus on documenting the rules that identify what is NOT a duplicate customer.
  3. Avoid technology bias – it is often easier to define your business rules before vendor evaluation. Knowing how the vendor’s software works can sometimes cause a “framing effect” where rules are defined in terms of software functionality, framing them as a technical problem instead of a business problem. Remember that all data quality vendors have viable solutions driven by impressive technology. Therefore, focus on stating the problem and solution criteria in business terms.

Application Development Expectations

Too many data quality initiatives fail because of lofty expectations, unmanaged scope creep, and the unrealistic perspective that problems can be permanently “fixed” as opposed to needing eternal vigilance. Here are a few recommendations for setting realistic expectations for application development:

  1. Plan for multiple iterations – in order to be successful, application development must always be understood as an iterative process. ROI will be achieved by targeting well defined objectives that can deliver small incremental returns that will build momentum to larger success over time. Projects are easy to get started, even easier to end in failure and often lack the decency of failing quickly. Just like any complex problem, there is no fast and easy solution for data quality.
  2. Prepare for more reviews – review of preliminary data analysis was used to help discuss and document your business rules. Additional reviews are necessary during application development in order to refine the matching criteria before implementation. Also, most implementations will include logic for identifying scenarios of uncertainty that require manual review.
  3. Focus on the data – every vendor’s software has some way to rank match results (e.g. numeric probabilities, weighted percentages, confidence levels). Ranking is often used as a primary method in differentiating the three possible result categories: (1) automatic matches, (2) automatic non-matches, and (3) potential matches requiring manual review. Although this functionality is necessary, it can sometimes be a distraction when reviewers become obsessed with ranking to the point that they actually ignore whether or not records have been properly categorized. First and foremost, focus on the data (e.g. are the “automatic matches” truly duplicates?). Modifying matching criteria is where science meets art. Perform trending analysis on the effects caused by changing criteria to guard against doing more harm than good. I have used software from most of the Gartner Data Quality Magic Quadrant and many of the so-called niche vendors. Without exception, I have always been able to obtain the desired results by staying focused on the data.
  4. Perfection is impossible – it doesn’t matter if your vendor’s match algorithms are deterministic, probabilistic, or even supercalifragilistic. The harsh reality is that false negatives and false positives can be reduced, but never eliminated. A relentless quest to find and fix every one of them is a self-defeating cause. Although this is easy to accept in theory, it is notoriously difficult to accept in practice. For example, let’s imagine that your project is processing one billion records and that exhaustive analysis imagehas determined that the results are correct 99.99999% of the time, meaning that incorrect results occur in only 0.00001% of the total data population. Now, imagine conducting a review by explaining the statistics but providing only the 100 exception records. Do not underestimate the difficulty that the human mind has with large numbers. Also, don’t forget about the effect of negativity bias. A focus on exceptions can undermine confidence and prevent acceptance of an overwhelmingly successful (but not completely perfect) implementation.

 Team Collaboration

imageAs I explained in Part 1 of this series, successful projects are driven by an executive management mandate for business and technical (i.e. IT) teams to forge an ongoing collaboration. Here are a few recommendations for fostering that collaboration:

  1. Provide leadership – not only does the project require an execute sponsor to provide oversight and arbitrate any issues of organization politics, but the business and IT must each designate a team leader for the initiative. Choose these leaders wisely. The best choice is not necessarily those with the most seniority or authority. You must choose leaders who know how to listen well, foster open communication without bias, seek mutual understanding on difficult issues, and truly believe it is the people involved that make projects successful. Your team leaders should also collectively meet with the executive sponsor on a regular basis in order to demonstrate to the entire project team that collaboration is an imperative to be taken seriously.
  2. Formalize the relationship – consider creating a service level agreement (SLA) where the business views IT as a supplier and IT  views the business as a customer. However, there is no need to get the lawyers involved. My point is that this internal strategic partnership should be viewed no differently than an external one. Remember that you are formalizing a relationship based on mutual trust and cooperation.
  3. Share ideas – foster an environment in which a diversity of viewpoints is freely shared without prejudice. For example, the business often has practical insight on application development tasks, and IT often has a pragmatic view about business processes. Consider including everyone as optional invitees to meetings. You may be pleasantly surprised at how often people not only attend but also make meaningful contributions. Remember that you are all in this together.

In Part 5 of this series: We will discuss topics related to duplicate consolidation, including physical removal vs. logical linkage, techniques for creating a “best of breed” representative record for duplicates, and consolidation vs. cross population where the representative record is used to update duplicates with a consistent representation of the highest quality data available.

 

Useful Links

 

See all posts in: DQ Techniques, Methodology

Reader Comments (1)

The article is very interesting and intuitive and addresses the Customer Master data problems.

One of our Customers too had a painful Customer Data problems as they had acquired many different companies(as a part of strategic initiative) and had common customers from these newly acquired companies. We had to manage these Master data sets in way it can be handled better and the approach we took is Rules Based Master data Management.

The approach that Jim Harris is suggesting is also one of the better ways to treat , especially the customer Master data, and I really like the way he has put them in these series of articles.

Mar 12, 2009 | Unregistered CommenterRanjan MR

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>