2012-03-28 : DataCleaner 2.5 is out!

Today we announce the general availability of DataCleaner 2.5! This release is the result of months of hard work by the core DataCleaner crew, the EasyDQ group and the community at large.

Let’s get straight to the “What’s new” question. There are plenty of major improvements in this release:

Saving results to disk
With DataCleaner 2.5 you can save, archive and share your analysis results. This is not only a time-saver for those who used to do manual exporting of analysis results, but it is also a means to improve your methodology around handling profiling results, sharing them with colleagues and for archiving historically profiles of your data.

Save results to disk


Saving is implemented so that future versions and/or custom solutions can take advantage of the results and potentially use it for scheduled profiling, data quality monitoring and more.

Data structure transformers
With the rise of Big Data and NoSQL databases comes more advanced data structures. In next generation databases we see key/value pairs and list structures that are cumbersome to deal with in tools built for traditional relational data. To solve these issues DataCleaner 2.5 ships with a new set of “data structure” transformers, which allow you to easily wrap and unwrap structures, to be able to get to the parts that you want to analyze or process.

Data structures


The data structure transformers also include parsers and writers for JSON data, which is one of the more common representations of NoSQL datastructures.

Filters and transformers are now all "Transformations"
Since DataCleaner 2.0 we’ve been pushing the idea of transformers and filters. The strength of these two types of components were evident from a technical perspective, but for the end-user the distinction has shown to be distracting from its main use-case: To process data in a flow of actions. Therefore DataCleaner 2.5 has consolidated these two terms, and made them available in a common metaphor for the user: Transformations. This means that the user will no longer have to look in multiple menus to find the component he is looking for.

New EasyDQ transformations: Merge duplicates and Due diligence check
The EasyDQ on-demand data quality platform team has also been busy. We present to you three new functions and an optional extension for the advanced users.

First is the Merge duplicates transformation. With this transformation you can turn your results from Duplicate detection into merged, golden records! The merge component is designed to handle a hierarchy of criteria when merging to make sure that critieria such as well-formedness, update date and manual overriding is taken into account.

Secondly we’ve introduced two services for Due diligence checks. These are transformations which will help you validate that the people you are engaging business with are not connected to sanction lists of terrorists, narcotics trafficking and other security threats.

These new features, as well as the other EasyDQ functions, are described in detail in the EasyDQ reference documentation.

Lastly, there's a new extension available, the EasyDQ essentials, which we recommend as a handy extra toolkit for those that want to go deep diving into the features of EasyDQ.

Defining datastore properties on the command line
One of the areas that have been heavily enforced in the later releases of DataCleaner is the command line interface. Using this interface you can set up DataCleaner to execute in all environments, in a scheduled or managed fashion. In DataCleaner 2.5 we’ve also made it possible to override datastore properties from the command line. Why? Because it allows you to reuse the same job on different datastore definitions. If you are for example scanning a directory for CSV files, and want to run a DataCleaner job on each file, this is a solution for you. Refer to the documentation for further explanation and examples.

Drill to detail information in value distribution results
The Value distribution analyzer now contains a drill to detail option, to make it possible to see the source records for each value in the distribution. This greatly helps usability when doing explorative data profiling.

Database-specific connection panels
The dialogs for setting up database connections have been enhanced with database-specific connection properties. This makes it a lot easier for the end-user to connect to a database without having to know the details of constructing a connection URL.

Database connection dialog


Database-specific configuration panels have been created for MySQL, PostgreSQL, Microsoft SQL Server and Oracle. Other database types are supported using the traditional way of connecting, as in previous versions of DataCleaner.

Execution and scheduling of DataCleaner jobs using Pentaho Data Integration
Pentaho Data Integration (PDI, aka. Kettle) is an open source ETL product that the EasyDQ and DataCleaner team has had a lot of interactions with. For the DataCleaner 2.5 release we are now announcing that in next version of Pentaho Data Integration you will be able to execute and schedule DataCleaner jobs using Pentaho’s infrastructure.

Execution in Pentaho Data Integration


While this is not available, released software as of today, we are looking forward to telling you more about this in the near future!

For those still reading, we also did some minor improvements in DataCleaner 2.5:
  • We’ve added some number transformations for generating IDs, incrementing numbers and more.
  • Implemented a Date range filter, similar to the Number range and String range filters.
  • Support for matching against Synonym catalogs in Reference data matcher (which is previously known as the Matching analyzer).
  • Now all components have flow visualizations in their configuration panel. This feature helps retain the overview when working with large analysis jobs.
  • The sample data (the ‘orderdb’ database) has been reworked to contain better examples of data quality issues.
  • User experience improvements; more elegant dialog designs and trimming of window layout.
We hope you all enjoy the new release of DataCleaner 2.5. Please let us know what you think on the forums, or on our LinkedIn group, or on Google Plus, or on Blogger, or tweet it, or...