In this debate, we provide the case for and against data cleansing and encourage readers to share their experiences and views on this thorny topic.
View From The Data Quality Purists
Many data quality purists would argue that any form of data cleansing is a cost-based activity that sits firmly under the "scrap and rework" banner. This is hard to disagree with, cleansing tools and processes can cost considerable amounts of money to maintain.
As one of the primary goals of a data quality project is to eliminate scrap and rework caused by defective data it would appear counterintuitive that one of the most common activities witnessed in so many dataquality improvement projects is therefore a wasteful activity.
Philip Crosby, author of "Quality is free" expresses this point far more eloquently:
"Quality is free. It’s not a gift, but it is free. What costs money are the un-quality things – all theactions that involve not doing jobs right the first time."
So, because a company is carrying out data cleanse repeatedly, the various activities associated with cleansing could be deemed (by many) to be "anti-quality".
However, if we ran a poll across any business or IT workers that used data cleanse tools, it's not inconceivable that most people would believe they were increasing the quality of the data, dramatically in many cases.
In the eyes of many, data cleansing is definitely a "pro-quality" activity.
The Case for Data Cleansing
There are many instances where the need for data cleansing is irrefutable.
Data Migration: On Data Migration Pro (our sister site) one of the recurring questions we receive is "how should we cleanse data for our migration". It's clear that without tactical data cleanse most migration projects would fail to meet objectives. If you have 6 months to migrate a system that is to be decommissioned you're going to have a tough sell convincing the legacy data sponsor that she needs to put in place initiatives such as root-cause prevention, a cultural shift towards data as an asset, greater stewardship etc. What you'll typically get is a mad-dash panic to "fix" the data to a point that it can support the migration. Step forward data quality products to help accelerate and control the cleansing process.
Third-Party Data Supply: If your business depends on the supply of third party data from many sources, it is common to see data cleansing tools in action, scrubbing, matching, standardizing, enriching, augmenting and preparing the data for use within your business. Yes, in an ideal world, every supplier should conform to an agreed standard. Service level agreements should be defined by a cross-party data governance team and regularly monitored with a focus on continual improvement. Try explaining this to a former eastern bloc census data supplier who sends their data on magnetic tape (I speak from experience). The fact is that it is often impossible to impose service levels on data suppliers so cleansing is a fact of life for many who depend on external data.
These are just a couple of simple examples where data cleansing is unavoidable and there are many, many more.
The Case Against Data Cleansing
Cleansing data invariably means resolving defects downstream from the original source. It often means that data can become inconsistent around the business and we create an endless cost-centre where more and more technology is required to cope with the constant flow of data defects. The lag between when the defect was created and when it was resolved through cleansing always has some kind of negative impact.
The alternative is to design error prevention into our systems and information chains. Cut off data defects at source so poor quality information cannot flow around our business, wreaking havoc on our business operations.
Back to Reality
Obviously, in an ideal world data cleansing would not be required.
But we don't live in an ideal world.
We live in a world full of complex data silos, massive data volumes, ageing legacy systems, political turf wars, endless mergers and acquisitions, budget cuts, skills shortages and a relatively immature data quality profession compared to other disciplines.
So, back to the original topic of this post:Is Data Cleansing Now a Reality for the Data-Driven Business?
Is it simply a business reality that given our insatiable desire to...
- Store exponential volumes of data
- Integrate across disparate systems and sources
- Create ever more complex, data-driven services
- Rely on greater volumes of 3rd party information
...we will always need a comprehensive data cleansing capability in the modern data-driven organisation?
Is the rate of data growth such that we will never have enough resources to manage data in a "total quality management" or Six Sigma fashion?
Posted 01 October 2011
Interestingly enough, my next blog post is going to be on the business need for Data Governance, in which I will talk about the benefit of data cleansing. To me they come hand in hand.
As you say, in an ideal world we would not need data cleansing, to many it's lipstick on a pig, we should sort the pig out. But that is an idealistic view. As an example:
We live in an integrated world, with businesses sharing data constantly. It is impossible (however much we wish it wasn’t) to have an overarching Data Governance practice ensures that every business that interacts with each other follows standard data quality procedures. They are all different, working to different aims and agendas. Take a company that relies on prospect data from business partners, and then acts as an 'agency' for that business partner in selling to those prospects. That 'agency' has hundreds of business partners, all with different transactional systems and communication capabilities. Without the ability to cleanse the data coming in from each of those business partners the agency will have a significantly reduced opportunity to sell. The cost of which would be crippling. You can say that the agency should work with the business partners to get the data cleansed at source, and as best practice they would, however when those business partners are massive institutions, it would prove very difficult to get them to change.
Data cleansing is not always a once off activity, it is still key to business success and data governance, it doesn't matter how much Six-Sigma you throw at it, it will be around for a while to come.
Nov 13, 2009 | Charles Blyth