Search the site
Subscribe to Data Quality Pro

 via email            RSS Feed

external resources
« Data Quality Skills Tutorial: Learn how to profile and validate data (for free) using this DataCleaner tutorial | Main | Identifying Duplicate Customers (Part 1) »
Wednesday
Feb112009

Interview with Kasper Sørensen, creator of DataCleaner

image

Interview with Kasper Sørensen, creator of DataCleaner

We've recently been reviewing a great Open Source data quality app called DataCleaner, created by a very close supporter of the Data Quality Pro community - Kasper Sørensen from Copenhagen.

Kasper has created both a solid product and a comprehensive online environment surrounding the tool.

In particular, his focus on building the community and social media aspects of the product would put many established vendors to shame.

As this product is free, simple to use and has some of the core data quality functions that many organisations are typically lacking we've created a detailed tutorial on how to use the tool, to be published tomorrow.

If you're new to data quality management and want to learn some of the functions typically found in more expensive products it's a great way to get some new skills at no cost.

The tool also integrates well with some of our existing free data quality tools that have already been published here on Data Quality Pro so check out the tutorial tomorrow for more details.

Before we published the tutorial we caught up with Kasper to find out more about his goals for the DataCleaner project and how tool has benefited him personally.

 

Interview with Kasper Sørensen, creator of DataCleaner

 

DQPRO: What was your motivation for creating Data Cleaner?

Kasper Sørensen: There are two answers to this question.

In short it was based on frustration as well as fascination. I used to work as a software developer for one of the large proprietary BI-vendors who also offer a substantial DQ package.

I eventually came to work as a consultant and experienced two things in relation to data quality:

  1. Most customers didn't really care about DQ because the products where so expensive
  2. Those who did care, had a hard time because the tools were so hard to figure out

So I got the notion that something was out of tune with the proprietary solutions in this field.

In my mind high quality data is something that we should all care about as we all benefit from initiatives that improve the quality of data everywhere. It's not just something that the companies who apply DQ strategies benefit from.

For instance, I want my bank to make sure that they have got my customer details right and I want my government to be able to manage the public with greater care and insight.

At the same time I was beginning to form an interest in Open Source software. DataCleaner started as a term paper project I did at the Copenhagen Business School where I wanted to investigate how Open Source communities work and how they attract and motivate contributors.

 

imageDQPRO: What are your goals for the product long-term?

Kasper Sørensen: I have my own goals for DataCleaner, such as making it the most popular Open Source DQ product and ensuring that it's easy to use so that the learning curve in the DQ-field will improve.

On the other hand I'm also engaged because I think it's interesting and fun to work with. What turns me on about Open Source software development and DataCleaner in particular is the idea that people around the world show interest and some of them even participate directly in the development of the product. So in that sense I think that the greatest goal for me is just to have the most participative and flourishing community to work with.

In terms of features I have some ideas for the future.

As it is, DataCleaner is both a Java-framework for DQ and an end-user application. The framework-part have been pretty constant for the last few releases and I think that time is nearing when we do an upgrade to support more advanced types of profiling. What we have now works really well but within some limitations to functionality. For instance we are currently not able to update or create new data based on profiling results. Also we need some technical improvements in order to make a join-testing profile which is one of the few crucial profiling features I think we are still missing.

 

 

DQPRO: Do you know of any organisations or individuals who are using the product in a business environment?

Kasper Sørensen: Yes there are a few organizations and individual professional that have actively proclaimed that they use it. One of our biggest supporters is FAP Europe who (as far as I know) use it on daily basis for profiling and validating a lot of incoming data in varying formats before they migrate it into a grand ETL scheme. Developers from FAP Europe have also contributed to DataCleaner with improvements to various features.

Also we have some of the most professional and devoted DQ consultants that I have known listening in on the development mailing lists and they have been great at promoting the tool on the tasks that they work on with different customers. I know there are a lot of consultants that use the tool from time to time like this.

 

DQPRO: When was the product created?

Kasper Sørensen: I started designing and programming the application in November 2007. I handed in my term paper about the project just after the 1.0 release in April 2008.

 

DQPRO: What is your day job?

Kasper Sørensen: I recently finished my master's degree in Business Administration and Business Computing from Copenhagen Business School. I now work with Lund & Bendsen (http://www.lundogbendsen.dk/display/web/Open+source) which is one of Denmark's leading Java-training and qualification companies.

At L&B we are also contemplating on opening an Open Source support and customization branch of the company and will of course include DataCleaner as a cornerstone of our product-portfolio in this regard, but I can't really tell you much more about that at this point in time.

 

image DQPRO: For the uninitiated – what does Open Source actually mean in terms of DataCleaner?

Kasper Sørensen: It means a lot of things.

Of course there is the obvious benefit in that it's free. But I would really like to stress that Open Source software is not so much about licensing as it is about flexibility and the freedom to use, extend and do a lot more exciting things with the product than what is possible with proprietary products.

DataCleaner has specifically been designed to be easy to extend for your own uses. This means that with basic Java skills you can develop your own profiles and validation rules for example.

Making more fundamental changes is also quite possible and we of course encourage anyone to do so and contribute it back to the community. It also means that integrating DataCleaner into your other applications and infrastructure is possible at a much finer level where you can "pick and choose" the things in DataCleaner that you want to use in different contexts, e.g. for your ETL-flow, data-entry applications or even simple things like a website – it's all possible because we provide you with the full insight into the application.

 

DQPRO: What new features do you have in store for the tool in the near future?

Kasper Sørensen: Yes I believe that the 1.5 release which is coming up soon will be one of the most important releases in terms of maturity for DataCleaner. Most of the effort for this release has gone into making the application 1) scalable to a "many-millions-of-rows" level, 2) easier to use and 3) more flexible in terms of result export and command-line execution. There are a few new profiles (date mask matcher and regex matcher) also but the main features are cross-cutting to make the application suitable for an enterprise scenario.

Other than that I can refer to the roadmap (http://eobjects.org/trac/roadmap) but it's always a work-in-progress thing and all it takes to modify the roadmap is the will of the participants.

 

image

 

DQPRO: Finally, how has the personal aspects of building an Open Source product benefited your career?

It has proven to be a great way to promote my skills in software development and Business Intelligence in particular.

I've had the pleasure of speaking and presenting the product and my visions at the Open Source Days conference last year and also at the Business Intelligence group at the Danish IT Society ("Dansk IT").

So in that way it has connected me to a lot of interesting people. The networking value has been essential when looking for a job and getting a number of freelance roles back when I was a student.

 

Next Steps

 

To download the DataCleaner tool just visit the product website here: http://datacleaner.eobjects.org/

See all our tutorials including other free software start by clicking here.

We will be posting a complete DataCleaner tutorial tomorrow, why not get Data Quality Pro daily tutorials and articles delivered to your Inbox?

 

Related Posts

 

Find all content in: ,

Data Profiling for Beginners - download a complete tutorial including free software to start your own data profiling initiative

Free Data Profiling Tutorial: Discovering Dependency Rules

Data Profiling Tutorial: Data Profiling for Beginners

Data Quality Assessment Tutorial: Pattern Analysis in Excel

Free DQ Pattern Analyser for Microsoft Access (part 1 in the series)

Need to trap data defects in Oracle? Download this free data quality pattern analyser


Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>