2015-10-22 : DataCleaner 4.5 - The capacity to correct

We are happy to announce that DataCleaner 4.5 has just been released and we we would like to tell you about this important milestone for everybody's favourite Data Quality solution.

It really is an important milestone for DataCleaner because with this release we've addressed a couple of critical restrictions of the underlying engine's capacity to combine certain types of components in the same jobs. This change unlocks new potential to deliver an even more powerful Data Quality solution with virtually all the flexibility and easiness that you could ask for.



Let's jump right into the new features and improvements. The team implemented more than 200 improvements and bug fixes - here are the most interesting ones...

Output data streams
At the engine and API level we've added the concept of "output data streams", which means that every component can publish streams of data that can be consumed by other components. Users of our API can utilize this feature by implementing the HasOutputDataStreams interface.

If this sounds too technical for you, just appreciate that this capability is underlying the following 4 features/improvements.

Duplicate detection and merging in the same job
With the major updates we did to the UI in DataCleaner 4.0, it became clear that our users are also becoming more and more empowered to do more elaborate tasks using DataCleaner. One of the most frequent limitations we encountered in this respect, was that it was not possible to combine two complex tasks like duplicate detection and merging in a single job. Yet to experienced users it is a very useful scenario. So with the use a new data stream originating from Duplicate detection, you can now combine this job with either duplicate merging or any other duplicate post-processing step you might have.

Example job containing standardization, duplicate detection, merging and writing


Combine tables and data sources using the Union component
We have added a core transformation function to DataCleaner called 'Union'. The functionality of this transformation is comparable of a SQL UNION function - to append two or more datasets together as if they are one. In other words: If you have multiple data sources, or just multiple tables, with the same type of content, then you can use the Union component to utilize them as if they were one big table.

Example job using a Union to perform analysis on multiple customer databases and files


The Union transformation can be used in conjunction with a Composite datastore. That way you can combine data from different data source such as CSV files, relational databases, ElasticSearch indices or Salesforce.com (to give a few examples).

Check if your contacts have moved or passed away - and update your source - all in the same job.
Via the Neopost family of data quality companies we have integrated several address correction, movers checks, deceased checks and similar services for specific regions. Currently we cover the United Kingdom, United States of America, Germany and the Netherlands with such functionality. With DataCleaner 4.5, using these functions has become a lot easier since the flexibility in integrating these services with the use of output data streams means that you can both perform checks, get reports on the results and do the post-processing of the results in a single job!

Example result-screen report from UK movers, deceased and do-not-mail check.


Process the complete or incomplete records found by Completeness analyzer
Completeness is one of the major dimensions of data quality, and DataCleaner addresses this topic with the Completeness analyzer, as well as filtering techniques. In DataCleaner 4.5 the analysis of completeness no longer necessarily ends with the incomplete records. You can now also use the Completeness analyzer as a intermediate step - feeding e.g. the complete or incomplete records into automated post-processing steps.

Connect DataCleaner to its big sister, DataHub
Did you know that DataCleaner is a key architecture piece of the Human Inference/Neopost customer MDM solution, DataHub? DataHub serves the enterprise market for customer MDM and single customer view, and we've been improving the integration a lot in this release of DataCleaner - most noticeably with the DataHub connector, which allows DataCleaner users to seamlessly consume data from and publish to the DataHub.

The processing pipeline in DataHub.


Product data components: GTIN, EAN, VIN
We have added a new category of Data Quality functions which revolve around Product data.

New 'Product data' category.

With these functions, and more to come in the future, we are building a suite of ready-to-use components that validate and standardize the use of common industry codes for products in your database.

Component library restructured
The component library structure has been revisited and we've designed this so that the menus and search function are optimized for the tasks at hand. As you can also see from the screenshot above, the Improve category has changed a lot - now focusing more on specific domains of data and data quality checks.

Secure ElasticSearch connections with Shield
We now support ElasticSearch with Shield-based security. The connection you define for an ElasticSearch index can be reused both for a regular read/write style datastore, but also for searches, matching and other functions that we provide which integrate with ElasticSearch.

Easy access to the Hadoop Distributed File System (HDFS)
As Hadoop is becoming more and more the system of choice for Big Data projects, we've decided to improve the user experience for analyzing and working with files located on HDFS, the Hadoop Distributed File System.

Browsing HDFS to select your DataCleaner source data.

Now browsing and finding files on HDFS is just as convenient as it has always been on your local machine. It's no secret that the roadmap for DataCleaner involves tighter and tighter integration with Hadoop, and this is our first step to make the Hadoop-DataCleaner experience both effective and pleasant.

A new Delete from table component
We have added a component in the "Write" category that deletes records from a table in a datastore. Use in conjunction with filtering functions to e.g. delete dirty records or non-survivors found after merging duplicates.

Online component library reference
A lot has been done to further improve our reference documentation. In addition to updated chapters etc. we've launched the Component library online which provides a quick way to navigate documentation on an individual component level.

We're confident that you will enjoy the improved Datacleaner. Version 4.5 is a major step and we are proud to share it with you!