DEBATE: Is Data Cleansing a Reality for the Data-Driven Business?
The use and benefits of data cleansing can often polarize the data quality profession.
For every practitioner who has witnessed the benefits of data defect reduction via cleansing technology there are those who view cleansing to be firmly outside the scope of best-practice data quality management.
In this debate, we provide the case for and against data cleansing and encourage readers to share their experiences and views on this thorny topic.
DEBATE: Is Data Cleansing a Reality for the Data-Driven Business?
Many data quality purists would argue that any form of data cleansing is a cost-based activity that sits firmly under the "scrap and rework" banner. This is hard to disagree with, cleansing tools and processes can cost considerable amounts of money to maintain.
As one of the primary goals of a data quality project is to eliminate scrap and rework caused by defective data it would appear counterintuitive that one of the most common activities witnessed in so many data quality improvement projects is therefore a wasteful activity.
Philip Crosby, author of "Quality is free" expresses this point far more eloquently:
"Quality is free. It’s not a gift, but it is free. What costs money are the un-quality things – all the actions that involve not doing jobs right the first time."
So, because a company is carrying out data cleanse repeatedly, the various activities associated with cleansing could be deemed (by many) to be "anti-quality".
However, if we ran a poll across any business or IT workers that used data cleanse tools, it's not inconceivable that most people would believe they were increasing the quality of the data, dramatically in many cases.
In the eyes of many, data cleansing is definitely a "pro-quality" activity.
The Case for Data Cleansing
There are many instances where the need for data cleansing is irrefutable.
Data Migration: On Data Migration Pro (our sister site) one of the recurring questions we receive is "how should we cleanse data for our migration". It's clear that without tactical data cleanse most migration projects would fail to meet objectives. If you have 6 months to migrate a system that is to be decommissioned you're going to have a tough sell convincing the legacy data sponsor that she needs to put in place initiatives such as root-cause prevention, a cultural shift towards data as an asset, greater stewardship etc. What you'll typically get is a mad-dash panic to "fix" the data to a point that it can support the migration. Step forward data quality products to help accelerate and control the cleansing process.
Third-Party Data Supply: If your business depends on the supply of third party data from many sources, it is common to see data cleansing tools in action, scrubbing, matching, standardizing, enriching, augmenting and preparing the data for use within your business. Yes, in an ideal world, every supplier should conform to an agreed standard. Service level agreements should be defined by a cross-party data governance team and regularly monitored with a focus on continual improvement. Try explaining this to a former eastern bloc census data supplier who sends their data on magnetic tape (I speak from experience). The fact is that it is often impossible to impose service levels on data suppliers so cleansing is a fact of life for many who depend on external data.
These are just a couple of simple examples where data cleansing is unavoidable and there are many, many more.
The Case Against Data Cleansing
Cleansing data invariably means resolving defects downstream from the original source. It often means that data can become inconsistent around the business and we create an endless cost-centre where more and more technology is required to cope with the constant flow of data defects. The lag between when the defect was created and when it was resolved through cleansing always has some kind of negative impact.
The alternative is to design error prevention into our systems and information chains. Cut off data defects at source so poor quality information cannot flow around our business, wreaking havoc on our business operations.
Back to Reality
Obviously, in an ideal world data cleansing would not be required.
But we don't live in an ideal world.
We live in a world full of complex data silos, massive data volumes, ageing legacy systems, political turf wars, endless mergers and acquisitions, budget cuts, skills shortages and a relatively immature data quality profession compared to other disciplines.
So, back to the original topic of this post: Is Data Cleansing Now a Reality for the Data-Driven Business?
Is it simply a business reality that given our insatiable desire to...
- Store exponential volumes of data
- Integrate across disparate systems and sources
- Create ever more complex, data-driven services
- Rely on greater volumes of 3rd party information
...we will always need a comprehensive data cleansing capability in the modern data-driven organisation?
Is the rate of data growth such that we will never have enough resources to manage data in a "total quality management" or Six Sigma fashion?
Should we instead view data cleansing as an inevitable reality of the modern, data-driven business, a strategic and tactical enabler, an innovator, a data quality "workhorse", a weapon for competitive advantage and a discipline to be understood, developed and increasingly adopted?
Or, in line with so many data quality practitioners, should we strive to eliminate it from the data quality management landscape completely?
What are your views?
Please add your comments below.
Useful Resources
Bloor Research Complete Market Update With Reports on Data Cleansing/Matching Solutions
Rethinking Data Quality: The Need for a Data Quality Profession
The Future of Information and Data Quality
WANTED: Data Quality Entrepeneurs
Need a free data quality pattern analyser to help trap data defects in your business data?


Industry Viewpoint
Reader Comments (27)
Great debate Dylan.
Interestingly enough, my next blog post is going to be on the business need for Data Governance, in which I will talk about the benefit of data cleansing. To me they come hand in hand.
As you say, in an ideal world we would not need data cleansing, to many it's lipstick on a pig, we should sort the pig out. But that is an idealistic view. As an example:
We live in an integrated world, with businesses sharing data constantly. It is impossible (however much we wish it wasn’t) to have an overarching Data Governance practice ensures that every business that interacts with each other follows standard data quality procedures. They are all different, working to different aims and agendas. Take a company that relies on prospect data from business partners, and then acts as an 'agency' for that business partner in selling to those prospects. That 'agency' has hundreds of business partners, all with different transactional systems and communication capabilities. Without the ability to cleanse the data coming in from each of those business partners the agency will have a significantly reduced opportunity to sell. The cost of which would be crippling. You can say that the agency should work with the business partners to get the data cleansed at source, and as best practice they would, however when those business partners are massive institutions, it would prove very difficult to get them to change.
Data cleansing is not always a once off activity, it is still key to business success and data governance, it doesn't matter how much Six-Sigma you throw at it, it will be around for a while to come.
Great point Charles, liked the way you backed it up with a real-life example.
What triggered this post was the recent DM&IQ conference. After witnessing several of the presenters discuss the "evils" of cleanse I wandered the halls and spoke to a number of people from large organisations I deal with or have dealt with in the past. All use data cleanse extensively throughout their businesses so it triggered the debate.
I think it's common sense that root-cause prevention is the preferred option but I'm intrigued to hear people's observations, exactly like yours, as to where they find that cleansing is simply impossible to replace.
Thanks again for your comments Charles, appreciate you kicking off the debate.
One answer to whether data cleansing is pro-quality or anti-quality is whether the benefits to cleansing outweigh the costs. For instance, the retail supply chain has spent hundreds of millions of dollars on a Global Data Synchronization Network to share product data. The initial goal was to reduce the (up to 70%) discrepancies between brand owners and retailers with respect to the data.
The end result is a network that supposedly shares product data in a machine-to-machine fashion, but the data is still bad. Supplier data isn't good and retailers retailers sometimes rekey, so problems persist. Often suppliers are hand entering that data into spreadsheets and then doing a "machine-to-machine" transfer of that keystroke-error ridden data.
If a retailer wants to be sure of what they are ordering, they need to make sure that data is good. Some hire consultants to validate the basic information about a product is accurate so their orders will have the best chance of success. Still others are using product data quality tools to ensure complete, accurate, consistent and normalized data. Without these cleansing activities, thousands of orders could go wrong (some still do!) and millions of dollars will be lost.
The thing is, these activities of cleaning the data have gone on for years - even before GDSN. Several years ago there was only one company showing product data quality tools at the annual big industry supply chain event - U-Connect. Last year it was up to nine. We're finally moving beyond manual-only solutions which should reduce the overall costs and increase the overall benefits.
If the cleansing processes don't yeild significant benefits, I don't think you could call them pro-quality. But if they can radically improve operational performance (in this case supply chain operations and on-shelf availability) they almost are assuredly pro-quality.
Great story there Bryan, demonstrating the very real value of data cleansing.
I like your point about ROI, if a cleansing tool generates sustained profits who are we practitioners to say it doesn't comply with a best-practice view of quality, profits and customer satisfaction are the primary drivers here.
But then, to play devils advocate, is the fact that data cleansing tools are being used symptomatic of a poorly designed process that requires improved controls and governance?
Dylan, there is an iterative sequence question around when to do Data Governance and when to do cleansing.
Right now I am engaged in a data management initiative at a public transportation authority. We have bad data and this has been going on for years. A lot of training and guiding has been directed at the drivers of busses in order to solve the matter at the root. This will continue but we will also implement correcting automated processes because we can – and because the main task for a bus driver is not operating the onboard computer but bringing passengers safely from point A to B.
Excellent debate Dylan, thanks for kicking this off.
I agree with Charles, data cleansing will be around for some time to come.
I love your point " If you have 6 months to migrate a system that is to be decommissioned you're going to have a tough sell convincing the legacy data sponsor that she needs to put in place initiatives such as root-cause prevention". Clearly, data cleansing is an "evil necessity" in this scenario.
However, in the above scenario, it should be possible to reuse the lessons learnt on the data cleansing project to prevent similar problems occurring in the new system.
Data cleansing is effectively 'bug fixing' or 'defect fixing'. Data Cleansing fixes the symptom, but does not address the cause of the problem. We know that in life, we should learn from our mistakes. There will always be a need for Data Cleansing, but it is an opportunity wasted if the lessons learnt are not applied to prevent the problem recurring in the future.
Rgds Ken
Good morning;
There are two parts to my response to this excellent question.
I tell my clients that data is like water, and the reasons for cleaning it up are the same for filtering the water that comes in to our homes and our bodies. Now, if water was the only 'ingredient' in a healthy information environment, they would be home free. The fundamental truth is that while data comes in many forms, pure data creates robust information. It is also true that while there are an infinite number of frequencies between middle C and D, there are conventions necessary to tune the orchestra, which leads to my next point.
The second thing I tell my clients, is in spite of appearances all 'languages' created by human beings (as opposed to say birdsong) have familiar patterns, not only of grammatical and syntactical use but in their elemental nature. I use a phrase called 'Quantum Semantics' to invoke a sense of all individual pieces of data belonging to one of a fixed number of element classes, which combine to form all of the meaningmatter - past, present and future. It turns out, according to my research and application of the theory, that there are only nineteen.
Like all of the laws of physics, the laws of Quantum Semantics are invariant: they apply equally and everywhere. This assertion begins to solve one of the more intractible challenges with data cleansing: the fact that it never ends. At least with a reliable framework in place we can confidently say that while we will be filtering our fundamentals for as long as we have human beings in the mix, at least we can predict how the elements behave.
John O'
Thanks for starting up another great debate Dylan,
I have blogged about the Reactive (i.e., data cleansing) vs. Proactive (i.e., defect prevention) debate by describing what is necessary as Hyperactive Data Quality.
Proactive data quality is the best practice. Root cause analysis, business process improvement, and defect prevention will always be more effective than the endlessly vicious cycle of reactive data cleansing.
However, a data governance framework is necessary for proactive data quality to be successful. Patience and understanding are also necessary. Proactive data quality requires a strategic organizational transformation that will not happen easily or quickly.
The unavoidable reality is that reactive data quality will occasionally be a necessary evil that is used to correct today's problems while proactive data quality is busy trying to prevent tomorrow's problems.
Therefore, data cleansing is NEVER going away. As a profession, we have to stop trying to wish it away, berate people for doing it, and advocating idyllic frameworks that exclude it.
Just like any complex problem, data quality has no fast and easy solution. Fundamentally, a hybrid discipline is required that combines proactive and reactive aspects into a realistic and achievable best practice.
Best Regards,
Jim
Data Cleansing is a reality for any business from vendor list management, spare parts maintenance, purchasing requisitions, engineering data, etc. The focused discussions needs to be around the processes and metrics to standardize the definition of data quality including completeness, accuracy and data structure.
As far as the "data quality purists" arguments of data cleansing cost-based activity sit firmly under the "scrap and rework" banner, the data process shouldn't happen after the non-cleansed data is moved from one system to another, starting with the engineering bill of material, purchasing system or maintenance system. This never ending cycle of data corruption never to be corrected without standardized processes in place especially to cleanse legacy data. Don't forget that cleansing legacy data will typically affect physical inventory, storeroom setup, and purchasing contracts, etc. and this is a much more complication data cleansing effort which is a true partnership with the business and cleansing company. The processes and governance to manage data provenance needs to be in place for the data cleansing, verification, structuring, completeness review and language translation before it enters into the business systems, therefore the business have 1 master record standardized for use in all systems utilizing the information.
Posted by C.Lwanga Yonke on the IAIDQ Forum:
If data-driven businesses are ever to become success-focused businesses, I believe they must relentlessly focus on process improvement, so a data set that is cleansed once never has to be cleansed again.
Paraphrasing Tom Redman, if you wanted clean water from a lake that is continually polluted by upstream factories, would you just build a water treating plant? Or would you also vigorously attempt to eliminate the sources of pollution upstream by working with the factories?
Data cleansing is useful and important. However, if its not closely followed or preceded by process improvement to eliminate the causes of the dirty data, then it is just a recipe for ultimate failure in my book.
World class maintenance management calls for 35% of efforts invested in corrective maintenance (repairs), and 65% invested in preventive and pro-active activities. Why not adopt a similar benchmark for data quality?
Of course, the solution involves smart governance, horizontal thinking and management between data suppliers and data consumers, working across organizational boundaries etc.. Similar problems have been successfully tackled in Manufacturing. Many organizations are doing the same for data quality. So this is not theory.
Is it easy? No. But it is cheaper than cleansing the same data over and over and over and over and…
Data quality has always been the need in places where IT supports or rather elevates business. Every business needs to take informed decisions and to aid that data is the key. I have been thinking for a long time about the need for data quality checks at the data capture stage say: Point Of Sale system in a retail application or as Jacqueline points out in BOM/ purchasing systems in a manufacturing set up.
I read a research report by GXS and AMR which suggests that close to 40% of data feeding ERP systems comes from external systems!!! Just imagine the amount of ambiguity this would cause in terms of data format/quality among others
To keep it simple it is all about moving data quality UP in the ‘information value chain’ – (i.e.) operational systems. Traditionally this involves cleaning data (ETL process) in a data warehouse with operational systems continuing to engulf data of sub-standard quality. This is in fact what MDM professes a ‘Single Version of Master Entity’ that is accessed by all up-stream/ down-stream systems, now with MDM all systems be it operational/analytical view the same version of data
@Jim - Valid points, proactive is a no-brainer, we all instinctively preach and teach this but I think many organisations simply have no option but to opt for your hybrid approach until they mature.
@Jacqueline - Thanks for your comments, great point about partnership, just because we're cleansing doesn't mean it shouldn't be a biz/IT effort, governance is also critical for cleansing as it is in root-cause prevention, all part of the same journey I guess.
@Lwanga - Agree with all your points. However, let me play devils advocate, we so often cite the manufacturing systems metaphor but this is a closed system. When we design a car production line the parts and materials are generally fixed, in data "manufacturing" systems this is increasingly not the case, the rate that new data "products" flow into our business systems is growing rapidly so do the same rules apply? Are we seeing so much downstream cleanse not just because of poor practices but because data is simply far more complex and changeable than the closed systems we see in manufacturing? I don't have the answer but I do value your opinion.
@Satesh - Interesting stats about ERP, kind of re-affirms my previous point, is this massive influx of external data making it virtually impossible for us to pre-empt all the data quality rule violations that we will encounter? Of course many companies don't have any kind of measurement and control but I do feel we're placing all our eggs in the "MDM" basket at times, not convinced that will really eliminate this issue but you make some great points, thanks for the contribution.
Thanks for the feedback, keep it coming...
In the past 20 years I have seen many databases in many sectors, primarily though not all ERP, and many examples of databases in an EDI client/vendor arrangement where they are intended to be in alignment but in truth are disparate and require much remedial work to take them to the point where they are reconciled.
In my humble opinion data integrity for the most part is a neglected area and one that will only come under the spotlight when there is a burning platform that forces the business to address the issue. There are still too many cave managers who still do not understand or appreciate that efforts made to run a clean system pay dividends in terms of the returns of elimination of non value adding activity for staff who interact with the database. Often these benefits are realised 2 or three stages up or down stream from the cause of pollution and hence invisible to the culprits just as very real complaints by finance for example are often dismissed as white noise.
I have been employed as a specialist to analyse and clean data and much as it has been financially rewarding the feeling of futility sometimes is one that draws remarkable parallels to a holiday job as a street cleaner as a student. It looks great for now, but before long due to no change in behaviour in the greater population it will once more deteriorate into chaos once more.
For the past 7 years I have been an advocate that some of the improvement programmes within the greater business should be extended to the business system. Lean has been a mantra around manufacturing and service industries for a number of years now and the workplace organisation tool 5S or 5C as a mindset applied to database integrity along with user education and understanding is a far more appropriate methodology to ensure long term sustainability than employing a specialist at times of trauma or system migration to fix an endemic problem that is driven by poor business processes or poor adherence to good processes.
Data Quality is everyones responsibility - just like Health & Safety.
Just my 2cents.
http://en.wikipedia.org/wiki/5S_methodology
Comment posted by Nigel Devenish at the Data Quality Association LinkedIn Group:
Does cleansing data mean clean data, or business-usable-data and what is the difference?
Once you have business-usable-data quality, can an organisation deploy agents to maintain that quality within their systems, or does the cleansing continue like the painting of large bridges?
Comment posted by Sanjib Mallik at the Data Quality Association LinkedIn Group:
Data Cleansing has always been a fundamental action to achieve Data Quality. What is changing is the way we use Data Cleansing processes now.
Historically we used to execute Data Cleansing on large volumes of data in a batch environment to execute business objectives because operational databases were inflexible to accept any changes or corrections to data we have processed through such processes.
Now we find that Data Cleansing can be applied to individual records as they arrive to our databases or sometimes even as Users enter the data at any point of entry.
There is still need for bulk Data Cleansing because sooner or later data from one organization has to be exchanged with data from another and you need processes to harmonize/standardize the data to a standard so data can be compared meaningfully to each other.
This need has not reduced even after several industry standard formats has been created to exchange data between various entities.
Comment posted by Tony O'Brien at the Data Quality Association LinkedIn Group:
Data cleansing is only a supporting element in our quest for quality data and yes we may have to clean regularly but it is not an alternative to enhancing our people and processes
Comment posted by Nigel Devenish at the Data Quality Association LinkedIn Group:
Tony in that enhancement of our people and processes, how valuable would it be that once you have brought the operational data to an acceptable business usage level, that you are able to deploy data maintenance agents within the processes to enforce accurate entry, or check imports of data at source?
Thus attempting to alleviate the constant "bridge painting" excercise and rather making agile change to the embedded agents to accomodate changes to data sources or business requirements
Comment posted by Tony O'Brien at the Data Quality Association LinkedIn Group:
Ahh Nigel the devil is in your final phrase....accurate entry & getting it right at source
Posted by Henrik Liliendahl Sørensen at the Data Quality Association LinkedIn Group:
Getting the right data entry at the root is important and it is agreed by most (if not all) data quality professionals that this is a superior approach opposite to doing cleansing operations downstream.
The problem hence is that most data erodes as time is passing. What was right at the time of capture will at some point in time not be right anymore.
Therefore data entry ideally must not only be a snapshot of correct information but should also include raw data elements that make the data easily maintainable.
More here:
http://liliendahl.wordpress.com/2009/11/17/ongoing-data-maintenance
Posted by Sanjib Mallik at the Data Quality Association LinkedIn Group:
The issue comes up because "right data" needs to be clearly defined to the Users so they know what the right data is and also the technology should be an enabler ensuring only "right data" is accepted and stored in the database.
In my experience, we never have the time and budget to develop software that has every possible holes covered through which "bad data" can slip in or can train every new hire accessing the system to not enter such data.
Another important aspect that generates "bad data" is when data is translated and stored as opposed to the original data because over time no one remembers the business rules that guide such translation.
Hence my first rule. Store Data only in its original form and never transformed.
Then, I found we need three continuous processes:
1) Identify gaps in solution that are introducing "bad data", Then initiate and prioritize projects to close them.
2) Identify poorly trained Users who are adding "bad data" despite their best intentions and train them.
3) Execute bulk updates to correct past data where possible to reduce "bad data".
This approach make sure error rates improve over time and finally levels off.
Posted by William Sharp at the IAIDQ LinkedIn Forum:
I think the benefits of it are more clear now than in previous years. Certainly accountability has been driven into organizations more now than ever. And in my opinion accountability aids (heavily) data quality.
The more accountable people are, the more they hold their data accountable.
Posted by Daragh O Brien in the IAIDQ LinkedIn Forum:
What would you prefer to have happen in a hospital? Would you prefer that processes for maintenance, cleaning, handling of supplies etc. were constantly reviewed and improved to ensure hygiene and reduce risk of infections, or would you rather have patients (for example tiny new borns) treated with expensive medications each time they pick up a bug in the hospital, with the risk that the infections might become drug resistant?
Data Cleansing is (to my mind) a shot of antibiotics for your data. Very useful for restoring the function of your organisation to what is expected (controlling the illness). It can be very effective. But it is costly each time you take that approach. And eventually you build up a resistance to the clean up ("what? we need to put how many people on manual review of exceptions?")
Tackling the process and governance issues is the equivalent of hygiene as a preventative measure for infection in hospitals. By understanding how and where your data gets contaminated you can change your process to prevent the infection or at least implement a "clean as you go" culture.
Ultimately you need to do both to cure the patient and prevent future illness, but simply prescribing drugs (illness cleansing) isn't enough to actually control infection... you need governance and hygiene as well.
In relation to Dylan's point about multiple data sources... I would take a leaf from Stephen Covey here and suggest that we should seek to control and change the things that we can control and change.
The governance and hygiene issues that need to be considered then become ones of ensuring that you select good sources of data, manage the interface with that source well (e.g. ensuring APIs are properly coded and kept up to date, or xml or xbrl schema properly documented).
going back to healthcare as an example... hospitals can't control the hygiene of visitors. However they can work to ensure that they make it easy for visitors to cleanse their hands by putting disinfectant gels and reminders in prominent locations as well as working to educate visitors as to the risks of poor hygiene in infection control.
An analogy in IQ would be making sure you have clearly documented the standards and meta-data definitions you require for your data and share that with your data source providers or 3rd party integrators so that they know what your hygiene standards are. The equivalent of the disinfectant gel could be a shareable platform for checking quality against those shared standards.
Japanese auto manufacturers work with their suppliers to ensure that the components they purchase in to build the cars meet the standards required. This breaking down of barriers is a component of lean manufacturing which requires high quality product inputs to be effective. This article ( http://www.qmisolutions.com.au/article.asp?aid=198 ) gives an insight. Just like in hospital infection control, there is a strong emphasis on education of the supplier as to the standard that is required.
So... while the challenges might appear different, there are clear fundamental principles of communication and collaboration to ensure quality (could we tick this off as a Deming-esque "Breaking down barriers"... I think so).
Hi Dylan,
I think that for most customers, data quality is like a journey. It often starts as a reaction to a catastrophic event - unacceptable to the organization e.g. a failed project with major consequences – both financially and emotionally for many individuals. Alternatively a change management event is the catalyst to change the culture and begin the journey. The journey typically starts in reactive mode e.g. a batch process which cleanses data after data enters the application. The reason for this is simply that a batch process is typically easier to set up and is a non-intrusive process.
Preventative data quality requires a culture change as well as a technology shift. People need to take responsibility for data quality – within applications, within business units and across enterprise processes. Technology can help by supporting data quality processes at the point of capture, the exception management process and the monitoring process.
Given the volume of evidence that data quality continues to impact on business processes to the tune of millions of dollars per year per organization, it is clear that real barriers exist preventing organizations move up the maturity curve from reactive data cleansing to a more pervasive and proactive data quality status. These barriers include the lack of role based tools for the business owners of the data, lack of a comprehensive approach or strategy for data quality across all data domains (customer, product, asset, financial) and the lack of an ability to manage common data quality rules to support all applications. By reducing these barriers, organizations will be able to implement not just a data cleansing strategy but a data quality management strategy to improve business processes.
So in summary, in my view, in the practical world, data cleansing will continue to be a core part of any end to end data quality process.
@Tommy - great points, thank you for responding and sharing your views.
I guess I am a bit late to this debate. As a data quality practitioner with 20 years of experience I must say first that regardless of the theory of the matter, data cleansing has been a significant component of data quality management in the past. And, I am certain it will remain significant in the foreseeable future. In fact, I am sure more companies would do data cleansing if they could, but unfortunately data cleansing is one of the more difficult technical aspects of data quality management.
I have had this debate on many occasions with various colleagues. I like to use the analogy of the flu. In a flu season, people are recommended to take vaccines and use other preventive measures to avoid getting sick. Those are very important to overall health of the population. Lots of money is spent in improvements of these preventive measures. Without them, flu would spiral out of control, claimimg more and more victims every year.
However, there is another side to fighting flu - various flu medicines and other forms of medical care, inlcuding hospital procedures for those with complications. If we did not have those and all we had were vaccines and preventive measures, people who get flu (and millions do!) would be suffering greatly and many would die! So, to say that flu treatment is a waste of money and effort and is somehow counter-productive would sound ridiculous. At least, not until far superior means of prevention were developed than what we currently possess.
The same is true for data. Data quality problems are like flu - invading our database organisms, hurting and harming them, and spreading from database to database through countless interfaces. Preventive measures are very important. Yet, most of our databases are sick today - many are very sick with serious complications! They need treatment. This will be true until we create a really effective prevention mechanism.
But will we ever create such preventive mechanisms? Though it sounds pessimistic, I seriously doubt it will happen in the near future. The reality is - our data quality continues to deteriorate year after year. This is largely caused by the fact that we are gathering more and more data and propagating it faster and faster to more and more systems we build, and we use the data for more and more purposes. We want "more and faster", but those conflict with "better". So, in my view, even though I spend my days defending the "better" side, I have to recognize the importance of "more and faster" side, therefore I do not expect the need in data cleansing to disappear any time soon.
I wrote an article on the subject in the last issue of TDWI's FlashPoint. If you are interested you can read it, and I will certainly post it here in 3 months, as publisher allows.
Great post and a very lively thread of comments.
There's a fundamental tension on most IT projects between IT and the business. IT tends to be charged with cleaning the data that the LOBs created and (in theory) own. Of course, many people in HR, marketing, sales, or whatever reject this notion at some level.
@Tommy - great point.
Preventative data quality requires a culture change as well as a technology shift. People need to take responsibility for data quality – within applications, within business units and across enterprise processes. Technology can help by supporting data quality processes at the point of capture, the exception management process and the monitoring process.
Culture is a huge and typically overlooked part of IT projects, particularly as it pertains to DQ. As long as people view DQ as an IT function, either implicitly or explicitly, the organization has little chance of succeeding. Even if IT steps up to the plate, the lack of DG and new found awareness means that DQ problems will manifest themselves again, and soon.
Excellent post
While "doing things right the first time" is the way to go, in reality most businesses are dealing with data that they didn't originally create. Acquisitions, purchased leads lists, old systems are the norm in business today. Also having free form text entered by customers is the norm, and possibly better than a rule based form.
As an example we moved from a free based text entry form for new customer leads to a rules enabled form, immediately leads dropped by 55% because the extra data cleansing work pushed onto the potential customer aggravated the potential leads.