2011-12-14 : Easy as DataCleaner 2.4!

Merry christmas! Today we announce the release of DataCleaner 2.4, which marks a huge joint effort by the community and the team at Human Inference to bring together the best ideas of both open source and cloud-based Data Quality.

Here's what's new in DataCleaner 2.4:

EasyDataQuality integration

With DataCleaner 2.4 we've made an alliance with the newly launched EasyDQ.com service, which offers cloud-based Data Quality services. The services provided are:
  • Duplicate detection (aka. Deduplication or Fuzzy matching of records), which is free to use for up to 500,000 values.
  • Address data validation and cleansing. This allows you to check if addresses exist, if they are correctly formatted and even to suggest corrections in case you have mistakes.
  • Name data validation and cleansing. With the Name service, EasyDQ does not only format your names consistently, but also checks for misspellings and interprets the name parts.
  • Email and phone validation and cleansing. These services provide checking of email and phone data, making sure that email domains exist, that country codes are correct and much more.
No, these are not open source services, but they are offered at a reasonable price as well as a free starter package, and we thoroughly believe that the integration allows DataCleaner to become a much better tool for those who want it.

New analysis job components

Many of DataCleaner's users have reported that they use DataCleaner as a lightweight ETL tool. This is because we currently support basic reading, transformation and writing capabilities. With 2.4 we've added a few crucial components to add to this use-case where you want to do ad-hoc transformations, data quality checks and actually write the data back to your database.
  • Table lookup which allows you to look up any number of values based on any number of conditions. The lookup component has an intelligent caching mechanism and is highly performant. (Docs).
  • Insert into table is a new option when writing data. With this option we are making it possible for DataCleaner to not only produce new files, but also to insert records into existing databases. That makes it a much more flexible writing option.

MongoDB support! And a few more...

Another theme in DataCleaner 2.4 is support for the popular NoSQL database MongoDB. The support is offered both as a profiling service (eg. reading and analyzing data), but ALSO for writing data to MongoDB collections, using the Insert into Table component, which makes DataCleaner the first open source tool that offers data flow modelling and ETL functionality for MongoDB! We also improved on a few other datastores:
  • Support for MongoDB datastores, which are both readable and writable with DataCleaner. MongoDB uses a schemaless design principle, so you have the choice of either letting DataCleaner auto-detect a virtual schema, or define it yourself. (Docs).
  • Added more configuration options to Fixed width value files. Specifically, there is now the option to specify header line number.
  • Added support for custom table mapping of XML structures. For large XML files this is a recommended approach, since with a fixed table model, DataCleaner can do SAX-based XML parsing which is much less memory intensive and a lot faster. (Docs).
  • The Command Line Interface (Docs) has been further improved, by allowing you to inject job variables from the command line, which makes it possible to parameterize jobs and thereby reuse jobs for different purposes.

Besides these points, a few bugfixes where fixed and some minor features added. For a full list of changes, check out the DataCleaner 2.4 milestone description in trac.

We hope you enjoy DataCleaner 2.4. We built it to be used, so go grab it right away on the downloads page!