How to transform your ETL tool into a data quality toolkit
Many organisations have some form of ETL (Extract-Transform-Load) tool in their business.
These products have been around for many years and have more than proven their worth for shunting data around the enterprise but it is surprising how many organisations have failed to recognise their potential as a data quality tool.
In this post we will present the case for using ETL tools to help get your data quality initiative off the ground.
We're not claiming they will be a direct replacement for the high-end data quality platforms but they might just give you an interim solution that will help you identify some quick wins that could pay for that more expensive data quality suite you've been struggling to get budget for.
How to transform your ETL tool into a data quality toolkit
In order to determine whether ETL tools can deliver a useful data quality function we need to come up with some form of benchmark. We need to classify the typical data quality tasks a business can undertake and assess whether a typical ETL tool can deliver against these needs.
To help us classify these needs we're going to use the "Data Quality Assessment" framework created by Arkady Maydanchik in the book of the same name (see our online bookstore for an overview).
Aside from being an essential read for any data quality practitioner, this book provides an exhaustive list of data quality rules that organisations should ideally be monitoring on a regular basis.
In short, this provides an excellent framework for our ETL benchmark.
After assessing each rule against the capabilities of known ETL tools we created a simple compliancy rating against each rule. Compliant, partial compliance, non-compliant.
The widget below demonstrates that nearly 70% of all the rules Arkady covers in his framework can be supported by ETL tools. 23% are partially compliant and only 7% are not compliant.
What next?
1. Understand the underlying principles of data quality assessment
Clearly, building a data quality framework with ETL is feasible but you must build out from a robust data quality assessment framework. We recommend the framework developed by Arkady Maydanchik as there is nothing that we're aware of that comes close for depth and completeness.
If you want to learn more, here are some options:
- Read some excellent extracts from the book here on Data Quality Pro
- Buy the book
- Complete the online Data Quality Assessment tutorial provided by Arkady and his new eLearningCurve platform (that we will be reviewing shortly)
2. Trial on high payoff areas of the business
If your organisation has ETL licenses, spare development resources and a reasonable understanding of data quality rules management we can now begin to trial some data quality initiatives.
But where to start? The very place that ETL tools are typically deployed is a great place to go hunting.
I've lost count of how many horror stories we have heard on our sister site Data Migration Pro where teams ignore the need for sound data quality management and data discovery, opting to dive straight into ETL development.
Our advice is to use the ETL tools earmarked for the integration or migration development to instead discover, validate, improve and control the data first.
Already finished your data migration or data integration? Why not use your ETL tool to monitor ongoing data quality health in the new target environment?
There are literally hundreds of scenarios where you can deploy ETL tools as a data quality workhorse in any organisation. Monitoring data quality rules along an entire customer service fulfillment chain for example is a great place to start.
Some of these articles should help inspire you to action:
Lean techniques to help your data quality improvement initiative (Part 1: Time Value Maps)
Lean techniques to help your data quality improvement initiative (Part 2: Little's Law)
3. Focus on creating capital to re-invest into ongoing data quality improvements
At the start of this post we highlighted the fact that long-term, it really does pay to invest in a full-blown data quality platform. The total universe of data quality rules you can manage with specialist tools is clearly greater than those of ETL tools.
What we have prescribed here is a simple "stepping-stone" approach.
By leveraging ETL tools already in your organisation you can target the low-hanging fruit which can provide much needed cash for additional staff and specialist technology to help you expand your data quality initiatives.
Case in point: One telecommunications company in the UK did exactly what was prescribed in this post. They looked at some of their most business critical processes and identified several key processes that were experiencing data quality failures at different stages of the service chain.
They implemented an improvement programme using an ETL tool that they were familiar with and had spare licenses. They modelled data quality rules and business process rules across the full life cycle of expensive fibre provisioning and created a management reporting dashboard that highlighted anomalies.
The cost-savings were dramatic. Not only did the customer retention increase as they eliminated process defects, they slashed scrap and rework costs and increased the amount of spare equipment that would have previously been re-purchased at considerable cost. All of this created significant capital that could be re-invested in other improvement initiatives.
What do you think about using ETL tools in a data quality framework? Have you tried this approach? Do you disagree? Why not share your thoughts below?
Useful Resources
Lean techniques to help your data quality improvement initiative (Part 1: Time Value Maps)
Lean techniques to help your data quality improvement initiative (Part 2: Little's Law)
Data Quality Rules by Arkady Maydanchik - Tutorial 1 of 4: Attribute Domain Constraints
Data Quality Rules by Arkady Maydanchik - Tutorial 2 of 4: Relational Integrity Constraints
Data Quality Rules by Arkady Maydanchik - Tutorial 3 of 4: Rules for Historical Data
Data Quality Rules by Arkady Maydanchik - Tutorial 4 of 4: Rules for Event Histories
Data Quality Rules Tutorial: Rules for State-Dependent Objects by Arkady Maydanchik
Essential data quality skills #1: Information Chain Management
How to create a data quality scorecard
The Service Information Quality Matrix


DQ Techniques
Reader Comments (5)
Creating a ETL tool that has clean and accurate data (data quality) is sometimes a tiring process. Thanks for the nitty gritty on making it possible to achieve data quality within my files and databases. Duplicate and inaccurate data litters databases in many companies, without a program that enables them to clean the data it injures the data flow and BI.
We have built an ETL testing application with Informatica. We can test nearly all test cases with it. We promote susccessful integration tests run in it to the regression testing environment where they are run with each subseqent release. It cut testing time in half, reduces risk in iterative development releases and inproved data and application quality.
Integration of a data quality toolkit with an ETL tool is a powerfull combination. Human Inference now offers .NET and Java connectors that allows integration of DQ functionality into, for example, Talend and SSIS as custom transforms. From the ETL tool you can then: decompose names and addresses into its elements (splitting person names into initials, prefixes, titles, surnames, firstname etc for example), validate and format addresses, match customer records from different sources, etc.
Here is how to do it in Microsoft SSIS http://tinyurl.com/bh8efr using the Melissa Data SSIS Total Data Quality Toolkit. Profiling, Cleansing, Parsing Standardization, Matching and Monitoring
An excellent post, which I think does a lot to make clear the business results that well-targeted DQ efforts can deliver, and the degree to which DQ is a matter of mindset and ongoing commitment, rather than a thing you do once and move on from, or do after the fact (or after the failure).
Crafting your own DQ solution using your ETL tool can be a way to get started but will probably eventually run into the limits of what you can easily implement, maintain and enhance on your own. (Emphasis on "easily"-- there's what's possible, what's reasonable, and what will grow with you.)
The Oracle Warehouse Builder team long since concluded that DQ integration into ETL processes was the right way to go, and since 10.2 a full assortment of data quality features have been available in Warehouse Builder. Of course, this DQ feature set is not free... but since you can also cover a wide range of data profiling, data quality and ongoing data quality monitoring use cases out of the box, without writing your own code, we think it provides good value.
Have a look at what we offer in DQ:
http://www.oracle.com/technology/products/warehouse/pdf/transforming%20data%20into%20quality%20information.pdf
http://www.oracle.com/technology/products/warehouse/pdf/warehouse-builder-11g-data-quality-datasheet.pdf
Also note that as of the Oracle 11g database, all of the DQ features of OWB are present in the standard database installation, meaning that once you're using Warehouse Builder, you can evaluate our DQ offering with no additional preparation. (Developing a real solution and bringing it to production will of course require the license, but at least you can kick the tires and understand the tool and start to get a feel for your own DQ issues.)