2012-04-17 : DataCleaner adds data profiling to Pentaho

Today we announce an exciting new partnership with Pentaho, the leading open source Business Intelligence and Business Analytics stack! For the past years Human Inference, members of the DataCleaner community and Pentaho have been in close contact to design a new data quality package for the Pentaho Suite. DataCleaner plays a key part in this new solution.

DataCleaner’s integration in Pentaho is primarily focused on the open source ETL product, Pentaho Data Integration (aka Kettle). Pentaho and Human Inference will be running a joint webinar on May 10th to tell everyone about all the new features (register for the webinar here), but until then – here’s a summary!

Profile ETL steps using DataCleaner

When working with ETL you often find yourself asking what kinds of values to expect for a particular transformation. With the data quality package for Pentaho we offer a unique integration of profiling and ETL: Simply right click any step in your transformation, select ‘Profile’, and it will start up DataCleaner with the data available for profiling, which the step produces! Not only is this a great feature for Pentaho Data Integration, it is also a one-of-a-kind in the ETL space. We are very excited to see this great use of embedding DataCleaner into other applications.

Profile with DataCleaner in Pentaho Data Integration / Kettle
Right click any step to profile


Execute DataCleaner job

Another great feature in the Pentaho data quality package is that you now orchestrate and execute DataCleaner jobs using Pentaho Data Integration. This makes it significantly easier to manage scheduled executions, data quality monitoring and orchestration of multiple DataCleaner jobs. Mix and match DataCleaner’s DQ jobs with Kettle’s transformations and you’ve got the best of both worlds.

Execute DataCleaner jobs as part of your ETL flow
Execute DataCleaner jobs as part of your ETL flow


EasyDQ integration

Additionally, the data quality package for Pentaho contains the EasyDQ cleansing functions as ETL steps, similar to what you know from their DataCleaner counterparts.


Deduplication and merging via DataCleaner

In addition to embedding DataCleaner for profiling of steps, you can also start up DataCleaner when browsing databases in Pentaho Data Integration. This will create a database connection which is appropriate for more in-depth interactions with the Database. For example, you can use it to find duplicates in your source or destination databases.

Detect duplicates in your sources
Detect duplicates in your sources


For more information:

The press release from Pentaho:
Pentaho announces new Data Quality solution

Installation instructions and information from Pentaho:
Pentaho wiki: Human Inference

Example of using the DataCleaner profiler with Pentaho:
Pentaho wiki: Kettle Data Profiling with DataCleaner

Information about the EasyDQ functions for Pentaho:
EasyDQ Pentaho page