2012-12-17 : DataCleaner 3.1 is out

Human Inference is happy to announce that DataCleaner 3.1 has been released and that it is available for download now! With DataCleaner 3.1 we’ve really focused on usability and day-to-day requirements of both the DataCleaner desktop data profiling application, and the web application for continuous data quality monitoring. Features that we feel really aids the user to do what he wants to do. Here’s a summary of what has been done.

Metric formulas – elaborated Data Quality KPIs
It is now possible to build much more elaborate Data Quality KPIs in DataCleaner’s monitoring web application. The user interface allows you to build complex formulas in a spreadsheet-like formula style; using variables collected by DataCleaner jobs.

Metric formulas can combine any number of metrics, constants and operations, as long as it can be expressed in a mathematical equation.

For instance – measure the rate of duplicate records in percentage of the total record count. Or measure the amount of product codes that conform to a set of multiple string patterns.

Ad-hoc querying – of any datastore
With DataCleaner 3.1 you can now perform ad-hoc queries to any datastore! Queries can be expressed in plain SQL and will be applied to databases as well as files, NoSQL databases and more, providing a truly helpful query mechanism to extend into your discovery and data profiling experience.

The query option is also available through a web service to monitoring users with the ADMIN role. The query is provided as a HTTP parameter or POST body, and the result is provided as an XHTML table.

Value matcher – a new analysis option
Often times you have a firm idea on which values should be allowed and expected for a particular field. In DataCleaner there’s always been the Value Distribution analysis option which would help you assert your assumptions. In DataCleaner 3.1 though, you have a more precise offering – the Value matcher. This analysis option allows you to specify a set of expected values and then perform a value distribution like analysis, specifically to validate and identify unexpected values.


Copying, deleting and management of jobs
Management of jobs and results in the DataCleaner monitor application has been improved greatly. You can now click a job in the Scheduling page of the monitor, and find management options available for operations such as renaming, copying, deleting and more. Each operation respects the linkages to other artifacts in the monitor, such as analysis results, schedules and more. This means that management of the monitoring repository has become a lot easier and mature.


Manage data quality history
Sometimes you’re facing situations where you actually want to do monitoring with historic data! It might be that you have historic dumps or backups of databases, which you wish to show and tell the story of. You can now do the analysis of this historic data, upload it to the DataCleaner monitor, and using a new web service, set a historic data of that particular analysis result. This means that your timelines will properly plot the results using their intended date, but with the results that you’ve collected maybe at a later point in time.

Clustered scheduler support (EE only)
The scheduler of DataCleaner monitor has been externalized, so that it can be replaced by the means of simple configuration. In the Enterprise Edition (EE) of DataCleaner, we provide a clustered scheduler, providing the ability to load balance and distribute your executions across a cluster of machines.

Single-signon (SSO) using CAS (EE only)
In the Enterprise Edition (EE) of DataCleaner we now provide a single-signon option for the monitor application. Now DataCleaner can be an integrated part of your IT infrastructure, also security-wise.

... And a lot more
The above is just a summary. More than thirty issues have been resolved in this release. We have solved several requests coming from the forums and community, and we encourage everyone to use this medium as a vehicle for change. We’re very happy to make the development of DataCleaner be heavily influenced by the streams in the community.