Job of the day: Direct Line Group (Bromley UK) need a Data Quality Coordinator with data governance expertise

Print to Page   |   Contact Us   |   Report Abuse   |   Sign In   |   Become a member
Open Source DataCleaner gets major update, Human Inference enters Open Source Data Quality Market
Share |

Open Source DataCleaner gets major update, Human Inference enters Open Source Data Quality Market

Author : Dylan Jones (Community Manager/Editor)
Published February 14, 2011

Editorial Categories:
[TB.1] Data Profiling and Data Quality Assessment, [TB.6] Data Cleansing and Data Matching, [TD.1] Data Quality Technology Briefing,
[TE.1] Supplier Briefing 



Human Inference enter the freemium data quality market with their recent acquisition of eObjects DataCleaner and founder, Kasper Sørensen.

I recently interviewed Kasper to find out more about the changes to DataCleaner 2.0 and I also spoke with Sabine Palinckx, CEO of Human Inference, to understand why Human Inference are taking such a significant step into the OS data quality arena.






Data Quality Pro: What’s new in DataCleaner 2.0 - is this a minor update or a major revision?

Kasper: Definitely a major shift in what the application can deliver. We basically rewrote the complete application, while of course re-using the good parts from version 1.x. But 2.0 contains so many new features that it’s hard to even comprehend for me.

If I was to point at what the major change is, then I would point at the fact that a job in DataCleaner 2.0 now can consist of chained components. We’re aiming to provide pre-processing facilities that target what the DQ user needs, so that they don’t have to turn to a fully-fledged ETL tool for this. In the typical DataCleaner style we’re also trying to keep things very simple so that processing your data isn’t something that you need hours and days of practice to do.

Data Quality Pro: Based on the new changes, how do you think DataCleaner 2.0 impacts the data quality OS marketplace?

Kasper:Well I see a tendency to move beyond "just” profiling.

In 2.0 there’s some exciting new themes like matching, transformation, cleansing and exporting data that we’re just starting to explore.

I definitely think that this gives an edge to DataCleaner as it will now reside in a sweet spot between the very complicated data mining tools, the time-consuming ETL tools and the constrained "pure profiling” applications.

In the OS space I feel that DataCleaner definitely has the lead in terms of functionality here.

So in my opinion our main challenge is actually more about getting the message out, because I do see that some competing projects have been better at that.

Data Quality Pro: What’s on the roadmap for DataCleaner 2.0 this year and beyond?

Kasper: Well we’re already starting to plan for DataCleaner 2.1 and the next versions after that. The next items on our schedule are things like:

  • More flow control, including the ability to merge/join sub-flows (this actually works already in the "engine”, and we’re using it internally in HI, we just couldn’t keep up with the features in the UI).
  • Integration with Human Inference’s on-demand platform, where you can get Name cleansing, Address cleansing, Deduplication and more.
  • Support for a different type of analyzers called "explorers”, which will be able to do custom querying and mining in your datastore (ie. not a part of the main processing flow, but defining it’s own). This will be used for eg. metadata comparison and referential integrity checks etc.

Data Quality Pro: Did you have a particular user type in mind when you redesigned the product?

Kasper: We had in mind that we wanted to retain the user friendliness and simplicity of the application. We also wanted to get rid of the some of the restrictions, such as the old "select whether you want to profile, validate or compare” schism, which was bad. Basically we want to provide a workbench that is easy to use for any data quality analysis related task.

The use case that is most in my mind when thinking about DataCleaner’s user interface is the case where you want to do an analysis of your (or your customers) data without too much hassle. You want it to be fast, reliable and easy to install. You don’t want the customer to complaint because he has to change something on his side. Basically you should be able to click a link in a browser, on a machine where you don’t have administrative rights, and off you go. This is achieved through our Java Web Start client and it works like a charm.

Another important thing that differentiates the user experience in DataCleaner from some of it’s competitors is the fact that we bundle measures together. If you look at for example the String analyzer component, it contains 20 measures that all pertain to string analysis. These measures are often related but you of course you cannot know that in advance. A lot of competing products let the user select which particular measures they are interested in and I think that’s a big mistake because profiling is also exploring - and if you’re only looking at one particular measure then you’re not going to find out about all the related ones. This is why I think DataCleaner is much more friendly to it’s user, because we don’t expect him/her to know everything up-front.

Data Quality Pro: What prompted Human Inference to get so heavily involved in OS Data Quality?

Sabine Palinckx: As a long recognized DQ vendor for the enterprise market we recognize the importance from the SME market. Especially if you want to expand in other regions than your home turf. We see that components of our current portfolio are of absolute benefit for the SME market, think about contact cleansing and matching.

We recognized that it is not that easy to transform an enterprise company in that new market. We already had long-term contact with eObjects and Kasper Sørensen in particular and in those discussions we recognized that starting a community is not the right thing but feeding and participating in such a community is what works.

Analysts did position DataCleaner as a very good profiling tool, however was lacking cleansing capabilities.

Within our Data Quality Lifecycle we start with our own legendary profiling tool but saw that DataCleaner had much more capabilities and execution power. Basically we were using DataCleaner already within our company and saw the strengths in content gathering and filtering, and data quality analyses. Why use it only internally and not provide/promote it to our existing customer base? Both of us saw an enormous synergy in combining the two.

Data Quality Pro: How do you see your involvement evolving over time?

Sabine Palinckx: We will actively participate in the community so that new releases of DataCleaner and MetaModel will be provided to the community. People can now sign up for active support on these products.

You will definitely see Human Inference, Human Reasoning elements enter in filters and validations.

Some high-end cleansing and filtering will be provided in packages on a more commercial base like you see on similar open source products like Pentaho.

To find out more, visit the open source project area for DataCleaner:

http://datacleaner.eobjects.org




Useful Resources Related to this Feature:

[TB.1] Data Profiling and Data Quality Assessment

Item Name Posted By Date Posted
Discovering Data Quality Rules with datamartist Link  more ] Administration 28/06/2011

[TB.6] Data Cleansing and Data Matching

Item Name Posted By Date Posted
There are no files to display.

[TD.1] Technology Briefing

Item Name Posted By Date Posted
DQ Technology Interview: Clavis Technology Link  more ] Administration 25/08/2011
DQ Technology Interview: Tony Rodriguez of X88 Link  more ] Administration 25/08/2011

[TE.1] Supplier Briefing

Item Name Posted By Date Posted
There are no files to display.

[TE.2] Supplier Profile

Item Name Posted By Date Posted
There are no files to display.

[TH.1] Industry Focus

Item Name Posted By Date Posted
There are no files to display.
Search Data Quality Pro
Please sign in here >
Data Quality Journal
Event Calendar

17/05/2012
[SEMINAR]: Solvency II Briefing (Manchester) with DataFlux

25/06/2012 » 28/06/2012
The Data Governance and Information Quality Conference (DGIQ2012), June 25-28, 2012, San Diego

Online Surveys
Popular Demos