2013-05-01 : DataCleaner 3.5 released

We are very proud and happy to present DataCleaner 3.5, which has just been released!

With the 3.x branch of DataCleaner we set forth on a mission to deliver monitoring, scheduling and management of your data quality directly in your browser. And now with the new release, we are building upon this platform to deliver an even richer feature set, a comfortable user experience and massive scalability through clustering and cloud computing.

To be more precise, these are the major stories that we've worked on for the DataCleaner 3.5 release:

Connectivity to Salesforce and SugarCRM

One of the most important sources of data is usually a company's CRM system. But it is also one of the more troublesome data sources if you look at the quality. For this reason we've made it easier to get the data out of these CRM systems and into DataCleaner! You can now use your Salesforce.com or your local SugarCRM system as if it was a regular database. Start by profiling the customer data to get an overview. But don't stop there - you can even use DataCleaner to also update your CRM data, once it is cleansed. More details are available in the brand new focus article about CRM data quality.

Wizards and other user experience improvements

The DataCleaner monitor is our main user interface going forward. So we want the experience to be at least as pleasant, flexible and rich as the desktop application. To meet this goal, we've made many user interface and user experience improvements, amongst others:
  • Several wizards are now available for registering datastores; including file-upload to the server for CSV files, database connection entry, guided registration of Salesforce.com credentials and more.
  • The job building wizards have also been extended with several enhanced features; Selection of value distribution and pattern finding fields in the Quick analysis wizard, a completely new wizard for creating EasyDQ based customer cleansing jobs and a new job wizard for firing Pentaho Data Integration jobs (read more below).
  • You can now ad-hoc query any datastore directly in the web user interface. This makes it easy to get quick or sporadic insights into the data without setting up jobs or other managed approaches of processing the data.
  • Once jobs or datastores are created, the user is guided to take action with the newly built object. For instance, you can very quickly run a job right after it's built, or query a datastore after it is registered.
  • Administrators can now directly upload jobs to the repository, which is especially handy if you want to hand-edit the XML content of the job files.
  • A lot of the technical cruft is now hidden away in favor of showing simple dialogs. For instance, when a job is triggered a large loading indicator is shown, and when finished the result will be shown. The advanced logging screen that was previously there can still be displayed upon clicking a link for additional details.
Distributed execution of jobs

To keep up with the massive amounts of data that many organizations are juggling with today, we had to take a critical look at how we process data in DataCleaner. Although DataCleaner is among the fastest data processing tools, it was previously limited to running on a single machine. For a long time we've been working on a major architecture change that enabled distribution of a DataCleaner job's workload over a cluster of machines. With this new approach to data processing, DataCleaner is truly fit for data quality on big data. More details are available in the documentation section.

Data visualization extension

Data profiling and data visualization do share some common interests - both are disciplines that help you understand the story that your data is telling. There are obviously also some differences, mainly being that data profiling is more targeted at identifying issues and exceptions rather than deriving or measuring business objectives. But confronted with visualization tools we've realized that sometimes there's a lot of profiling value in progressively visualizing data. For instance, a scatter plot can easily help you identify the numerical outliers of your datasets. This idea gave fuel to the idea of a visualization extension to DataCleaner. Therefore DataCleaner now also let's you do basic visualization tasks to aid you in your data quality analysis.

National identifiers extension

A very common issue in data quality projects is to validate national identifiers, such as social security numbers, EAN codes and more. In our commercial editions of DataCleaner, we now offer a wide range of validation components to check such identifiers.

Custom job engines

We've made the ultimate modularization of the DataCleaner monitoring system: The engine itself is a pluggable module. While we do encourage to use DataCleaner's engine as the primary vehicle for execution in DataCleaner monitor, it is not obligatory anymore. You can now schedule and monitor (both in terms of metric monitoring and history management) other types of jobs. For instance, you can provide your own piece of Java code and have it scheduled to run in DataCleaner monitor using the regular web user interface.

Pentaho job scheduling and execution

One major example of a pluggable job engine was introduced that we think deserves special attention: You can now invoke and monitor execution metrics of Pentaho Data Integration transformations. DataCleaner monitor by default ships with this job engine extension which connects to the Pentaho DI server ("Carte") and supervises the execution and result gathering of it. After execution you can track your Pentaho transformations in the timeline views of the monitoring dashboard, just like other metrics. For larger deployments of DataCleaner it may be convenient with dedicated ETL-style jobs in your data quality solution, and with this extension we provide an integration with a leading open source solution for just that. More details are available in the documentation section.

... And a whole lot more!

There's even a lot more to the 3.5 release than what is posted in these highlights. Take a look at the milestone page on the bugtracker for a more thorough listing of improvements made.

A non-functional aspect of DataCleaner is the reference documentation, which we've also done a lot to update. Additionally all the documentation pages now have a commenting feature, so that you can ask questions or provide feedback to the help that is in there. We'll be continuously providing more and more content in the documentation and on the website for you to get the best resources at your hands.

... Stay tuned for more!

On the front page of the DataCleaner website we'll be posting "feature focus" articles in the weeks to come. Please help us spread the word by promoting the release and the articles to your friends, colleagues and whom else might be interested.