Search the site
Subscribe to Data Quality Pro

 via email            RSS Feed

external resources
« Data Quality for Data Migration - Podcast of interview with migration expert John Morris available | Main | Now live - DQ QuickLink, find data quality materials scattered across the internet literally in seconds »
Tuesday
Aug192008

Need a free data quality pattern analyser to help trap data defects in your business data?


If you are serious about improving data quality you simply cannot live without a data quality pattern analyser in your DQ toolbox. To help our readers kick-start their data quality strategy we are now giving away the first of many free data quality tools.



What is pattern analysis?


Pattern analysis is a technique typically deployed in a data quality assessment as part of a data profiling activity.

It’s primary function is to understand which formats are valid for a particular field and it does this by examining the distribution of character types found within the column.

Take the following part numbers as an example:

102112AB
IO2112aB
0312121cj
213100JJ


A pattern analyser would return something similar to the following results:

nnnnnnAA
aAnnnnaA
nnnnnnnaa
nnnnnnAA


(Where n - number, A - uppercase, a - lowercase).

If we analyse the patterns over several thousand records we can start to uncover the data quality rules for this field as the patterns that occur with the most frequency are typically the valid rule and the ones which occur least frequently are often in error.

Sample of 10,000 records:

nnnnnnAA   98%
nnnnnnaa    1%
aAnnnnaA   0.5%
nnnnnnAA   0.5%


In this situation we could assume that the data quality rule for this attribute is:

"6 number characters followed by two uppercase characters".

Any other patterns invalidate this rule and would require defect resolution.

Why is pattern detection so beneficial?


Detecting pattern values becomes extremely important for a number of reasons:
  1. To help us identify historical defective data that may require cleansing prior to a data migration or integration activity
  2. Defective patterns can be trapped "in-flight" to prevent a defective value being committed to the data store
  3. The naked eye is incapable of identifying issues that only pattern analysis can discover (eg. non-ascii characters)
  4. ETL developers can create standardisation rules for ensuring consistent data transformation into data warehouses/MDM platforms etc.
  5. They provide the rules that help data cleanse analysts complete their tasks

How did we create the data quality pattern analyser?


The process is fairly simple but will vary slightly for each type of database platform you are using. The first product we are giving away works with Microsoft Access 2003 and higher.

Note: Very soon we will show you how to create a pattern analysis tool for embedding directly within Microsoft Excel and Oracle.

Step 1: Create a pattern lookup table

We first mapped the standard ascii character set to letters that indicate their character format.

For example, the letter ‘b’ could be mapped to ‘a’ meaning lowercase standard alphabetical character.

The number ‘5’ could be mapped to ‘n’ meaning number (some companies prefer to map numbers to ‘9’ or 'N' to make the distinction clearer).

We can also create multiple mappings. For instance, if we are not concerned about the case of a text field we could just map all characters to ‘A’ for simplicity,

The way we perform the mappings is to convert the ascii number of a character to our new mapping character as many of the characters are not visible on a standard editor.


Step 2: Write a conversion function


This step will typically require a programmer with average skills to write a function that loops through each character of the string, converting each ascii value to the new mapping.

We coded our first function in Microsoft Access so that you have maximum flexibility. You can now perform pattern analysis on any ODBC compliant database as well as your standard Access database tables.

If you follow the download instructions it will show you how to install and use the tool.


Download the free data quality pattern analyser (Microsoft Access Compatible)


Now you know a bit more about what it does and how we built it, let's download it and see how you can use it on your own data.

Just follow the steps below:



Step 1: Become a registered member of Data Quality Pro


You need to be a registered member of Data Quality Pro and signed into the site, just click on the icon opposite if you are not yet a member.

Step 2: Download the software and instructions

Once you have registered and logged in to the site using the login link (see "Members" to the left) you can visit the software download page by clicking on the icon opposite. Select the pattern analyser tool and download.



The download zip file contains instructions on how to use the tool and the Microsoft Access database has some sample data to demonstrate the concepts and techniques involved.

Over the coming weeks we will be adding new tools and tutorials to the site so you can learn more advanced techniques, make sure you subscribe to our RSS feed or visit regularly to find out more.

We hope you enjoy the tool,

The Data Quality Pro Team.

 

 

 

Reader Comments (1)

Was going through the site. You folks are doing a great job serving the community.. Thanks..

Mar 27, 2009 | Unregistered CommenterPrashanth

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>