2012-04-17: DataCleaner adds data profiling to Pentaho
DataCleaner’s integration in Pentaho is primarily focused on the open source ETL product, Pentaho Data Integration (aka Kettle). Pentaho and Human Inference will be running a joint webinar on May 10th to tell everyone about all the new features (register for the webinar here), but until then – here’s a summary!
Profile ETL steps using DataCleaner
When working with ETL you often find yourself asking what kinds of values to expect for a particular transformation. With the data quality package for Pentaho we offer a unique integration of profiling and ETL: Simply right click any step in your transformation, select ‘Profile’, and it will start up DataCleaner with the data available for profiling, which the step produces! Not only is this a great feature for Pentaho Data Integration, it is also a one-of-a-kind in the ETL space. We are very excited to see this great use of embedding DataCleaner into other applications.

Right click any step to profile
Execute DataCleaner job
Another great feature in the Pentaho data quality package is that you now orchestrate and execute DataCleaner jobs using Pentaho Data Integration. This makes it significantly easier to manage scheduled executions, data quality monitoring and orchestration of multiple DataCleaner jobs. Mix and match DataCleaner’s DQ jobs with Kettle’s transformations and you’ve got the best of both worlds.

Execute DataCleaner jobs as part of your ETL flow
EasyDQ integration
Additionally, the data quality package for Pentaho contains the EasyDQ cleansing functions as ETL steps, similar to what you know from their DataCleaner counterparts.
Deduplication and merging via DataCleaner
In addition to embedding DataCleaner for profiling of steps, you can also start up DataCleaner when browsing databases in Pentaho Data Integration. This will create a database connection which is appropriate for more in-depth interactions with the Database. For example, you can use it to find duplicates in your source or destination databases.

Detect duplicates in your sources
For more information:
The press release from Pentaho:
Pentaho announces new Data Quality solution
Installation instructions and information from Pentaho:
Pentaho wiki: Human Inference
Example of using the DataCleaner profiler with Pentaho:
Pentaho wiki: Kettle Data Profiling with DataCleaner
Information about the EasyDQ functions for Pentaho:
EasyDQ Pentaho page
