DEBATE: Is Data Cleansing a Reality for the Data-Driven Business?

The use and benefits of data cleansing can often polarise the data quality profession.

For every practitioner who has witnessed the benefits of data defect reduction via cleansing technology there are those who view cleansing to be firmly outside the scope of best-practice data quality management.

In this debate, we provide the case for and against data cleansing and encourage readers to share their experiences and views on this thorny topic.

View From The Data Quality Purists

Many data quality purists would argue that any form of data cleansing is a cost-based activity that sits firmly under the “scrap and rework” banner. This is hard to disagree with, cleansing tools and processes can cost considerable amounts of money to maintain.

As one of the primary goals of a data quality project is to eliminate scrap and rework caused by defective data it would appear counterintuitive that one of the most common activities witnessed in so many dataquality improvement projects is therefore a wasteful activity.

Philip Crosby, author of “Quality is free” expresses this point far more eloquently:

“Quality is free. It’s not a gift, but it is free. What costs money are the un-quality things – all theactions that involve not doing jobs right the first time.”

So, because a company is carrying out data cleanse repeatedly, the various activities associated with cleansing could be deemed (by many) to be “anti-quality”.

However, if we ran a poll across any business or IT workers that used data cleanse tools, it’s not inconceivable that most people would believe they were increasing the quality of the data, dramatically in many cases.

In the eyes of many, data cleansing is definitely a “pro-quality” activity.

The Case for Data Cleansing

There are many instances where the need for data cleansing is irrefutable.

Data Migration: On Data Migration Pro (our sister site) one of the recurring questions we receive is “how should we cleanse data for our migration“. It’s clear that without tactical data cleanse most migration projects would fail to meet objectives. If you have 6 months to migrate a system that is to be decommissioned you’re going to have a tough sell convincing the legacy data sponsor that she needs to put in place initiatives such as root-cause prevention, a cultural shift towards data as an asset, greater stewardship etc. What you’ll typically get is a mad-dash panic to “fix” the data to a point that it can support the migration. Step forward data quality products to help accelerate and control the cleansing process.

Third-Party Data Supply: If your business depends on the supply of third party data from many sources, it is common to see data cleansing tools in action, scrubbing, matching, standardizing, enriching, augmenting and preparing the data for use within your business. Yes, in an ideal world, every supplier should conform to an agreed standard. Service level agreements should be defined by a cross-party data governance team and regularly monitored with a focus on continual improvement. Try explaining this to a former eastern bloc census data supplier who sends their data on magnetic tape (I speak from experience). The fact is that it is often impossible to impose service levels on data suppliers so cleansing is a fact of life for many who depend on external data.

These are just a couple of simple examples where data cleansing is unavoidable and there are many, many more.

The Case Against Data Cleansing

Cleansing data invariably means resolving defects downstream from the original source. It often means that data can become inconsistent around the business and we create an endless cost-centre where more and more technology is required to cope with the constant flow of data defects. The lag between when the defect was created and when it was resolved through cleansing always has some kind of negative impact.

The alternative is to design error prevention into our systems and information chains. Cut off data defects at source so poor quality information cannot flow around our business, wreaking havoc on our business operations.

Back to Reality

Obviously, in an ideal world data cleansing would not be required.

But we don’t live in an ideal world.

We live in a world full of complex data silos, massive data volumes, ageing legacy systems, political turf wars, endless mergers and acquisitions, budget cuts, skills shortages and a relatively immature data quality profession compared to other disciplines.

So, back to the original topic of this post: Is Data Cleansing Now a Reality for the Data-Driven Business?

Is it simply a business reality that given our insatiable desire to…

  • Store exponential volumes of data
  • Integrate across disparate systems and sources
  • Create ever more complex, data-driven services
  • Rely on greater volumes of 3rd party information

…we will always need a comprehensive data cleansing capability in the modern data-driven organisation?

Is the rate of data growth such that we will never have enough resources to manage data in a “total quality management” or Six Sigma fashion?

Should we instead view data cleansing as an inevitable reality of the modern, data-driven business, a strategic and tactical enabler, an innovator, a data quality “workhorse”, a weapon for competitive advantage and a discipline to be understood, developed and increasingly adopted?

Or, in line with so many data quality practitioners, should we strive to eliminate it from the data quality management landscape completely? What are your views?

Please add your comments below.