2017-07-07 : DataCleaner 5.2.0 released

Finally, after months of waiting we are proud to present you with a new DataCleaner release, 5.2.0!

Both open source and commercial contain these changes:
  • IBM's ICU library used for better Unicode standard compatibility.
  • Hashing component added
  • Fixed Pentaho plugin
  • Extended dictionary matcher (added possibility to ignore diacritic signs)
  • Added scroll bar for value distribution (result) page
  • Made connection lines lighter in desktop graphical job representation
  • Monitor is no longer available for the community edition
Commercial only:
  • Express Data Prep: a brand new wizard on the home screen of DataCleaner desktop. In just a few steps, it allows users to create a complete cleansing job for UK contact data and push it to DataCleaner monitor for repeated execution on data coming in from a hot folder.
  • Monitor: updated Cron expression errors handling
  • Monitor: improved validation of uploading files
  • Monitor: status of last job execution is now visible on schedule page
  • Monitor: now putting the actual data file in the hot folder can be used as a trigger to start a job, which then uses that data file as source data store
  • Duplicate Detection: updated to newest version, incl. performance better training with low column count.
  • Wizards made pluggable in DC Desktop.
  • Extended license information panel
  • Added Salutation generation transformer.
  • Added Sample select transformer.
  • Packaged new 'Name & Company Correction' component into DC Enterprise edition
  • Fixed Sanction list check component

2016-11-21 : DataCleaner 5.1.5 released

We had a rather bad issue with 5.1.4, so we decided to make a quick release of 5.1.5.

  • Running DataCleaner on OSX could fail.
  • CSV, Excel and staging writers could not edit output column names, and could cause lockups in the desktop edition.
We recommend that all users upgrade to 5.1.5, especially if you're using 5.1.4.

2016-11-08 : DataCleaner 5.1.4 released

Release 5.1.4 incoming!

We have another release for you, this time including these bugfixes:
  • Make requirements and scope buttons update properly when job changes
  • Stability fixes to Union and Coalesce components
  • Stability fix for Excel writer
  • Fix jTDS single-connection-multiple-threads writing exception
  • Save button only enabled when a job is being built
  • Monitor: Metrics from output datastreams available for timelines.
  • Monitor: Bad repository file upload does not delete old repository
  • Monitor: JavaScript transformer now works in monitor
... as well as these nice improvements:
  • Improve job loading and execution startup times
  • Reduce Excel writer validation overhead
  • Improve flexibility of date range filter
  • License overview panel (commercial only)
  • URL Parser results improved
  • Monitor: Improve scheduling page load times
  • Monitor: Old repository backed up when new repository is loaded
  • Monitor: Reference data can now be configured in UI
There's a known issue that upgraders should be aware of:
  • Duplicate Detection results saved in previous versions cannot be opened in 5.1.4
Enjoy! :)

2016-09-14 : DataCleaner 5.1.3 released

End of summer is nearing, but fret not, we have a new release for you!

5.1.3 is mostly a bugfix release, containing the following fixes:
  • Remove spurious logging from grouper and UI.
  • Remove non-functioning view online regex button.
  • Exported HTML report and monitor graphs should now work properly.
  • Remove error when setting up timeline for pattern finder analyzer with grouping.
  • Fix on-premise name correction (commerial only).
  • Multiple issues in deduplication component (commercial only).
We also have a single improvement:
  • Make duplicate detection training more intuitive (commercial only).

2016-08-16 : DataCleaner 5.1 released

A new release of DataCleaner has just hit the site!

Now you can trigger jobs in the monitor simply by dumping files into a folder. This is great for processing jobs that get delivered by other services. You can even use a .properties file if you need to change the configuration for the job execution.

We've also:
  • Extended our monitor UI to support adding HDFS datastores.
  • Added support for simple fixed-width mainframe/EBCDIC files
  • Made it possible to run a job on another configured datastore from the command line.
  • Extended the monitor REST job trigger to support setting configuration properties.
  • Added a news channel into DataCleaner, so you can catch new releases
.... And of course fixed bug and performance issues along the way.

We hope you enjoy DataCleaner 5.1 . Please go to the Download page to try it out.

2016-07-01 : datacleaner.org downtime

A scheduled server update will be carried out during this time:

Date: Saturday, 2 July 2016
Time: From 10:00 (CET) UTC/GMT + 2 hours to 12:00 (CET) UTC/GMT + 2 hours
Duration: 2 hours

This period of downtime will be scheduled for necessary updates to be applied to our servers.
Unfortunately, the DataCloud services will not be available during this time.
We apologize for the inconvenience that this may cause.

Thank you for your attention and understanding.

DataCleaner team

2016-04-01 : DataCleaner 5.0 released!

The next revolutionary step for your favorite Data Quality solution

We are very happy and proud to announce the release of our product DataCleaner 5.0! In this new version, two important themes were tackled with regard to DataCleaner functionality: Native support for Hadoop, enabling the management of quality on Big Data and online data enrichments through the new DataCloud platform.

As of this version, DataCleaner is offering native integration with the Hadoop eco-system. Hadoop has been disrupting the world of data in recent years by overcoming traditional barriers to data growth and sparking new business opportunities based on detailed modelling user behaviour and enabling data-driven decision making. With DataCleaner 5.0 we offer two styles of integrating with Hadoop. Next to the fact that the Hadoop Distributed File System (HDFS) can be used as a file system in DataCleaner, you can also treat Hadoop as an execution platform using the Spark processing engine. This way the processing is being brought to the data, leveraging the computing power of your Hadoop cluster, instead of the other way around.

Read more about DataCleaner on Hadoop.

Furthermore, we are introducing a new cloud platform: DataCloud. Using DataCloud with the DataCleaner, enrichment services are automatically discovered as soon as they are published, and added as functions in the DataCleaner user interface. DataCloud has been implemented with the so-called remote components capability. This capability can also be used as private cloud for custom services inside your organization.

Logging in to DataCloud

At this point in time, the following services are being offered via DataCloud:
  • Address correction
  • Name correction
  • Email correction
  • Phone correction
  • Consumer check for the Netherlands
  • Deceased check for the Netherlands
Naturally, many more services will be added in time. With a connection to DataCloud , you will get online data enrichments in your DataCleaner fast and easy.

Finally, DataCleaner 5.0 is offering some additional new features. As of now, we are offering an in-house 2 or 3 day training program for you and your team. Next to this program, we are offering a new connector to ElasticSearch via the REST protocol, a new connector to Neo4j, and lots of improvements on web user experience, desktop+monitor integration and configurability.

As you can see, DataCleaner 5.0 is another giant leap forward in the development of this smart data quality solution.

We want you to enjoy DataCleaner 5.0! Get it now.

2016-03-07 : New logo and visual identity

Maybe you've noticed new colors and imagery on our website? Well, we're happy to announce a new logo and look'n'feel for DataCleaner! As we are rapidly approaching the release of DataCleaner 5.0, we've decided to already launch the new visuals for DataCleaner online.

As can be seen from the visualization above, the graphical identity of DataCleaner has changed only a few times since 2008 when it was first launched. We are happy to introduce the new logo which aims to embrace a simpler, lighter and web-friendly style. You can expect DataCleaner to continuously evolve in this direction.

2016-01-15 : DataCleaner 4.5.4 released

An update to DataCleaner has just been released. This new version 4.5.4 contains a number of small but nice improvements to our favorite Data Quality tool:
  • Performance and interactivity of the Duplicate detection function has been improved.
  • The training mode of the Duplicate detection is now more tolerant towards changing its configured input columns.
  • The performance and memory footprint of the Datastore dictionary was improved.
  • The 'Grouper' component now has a number of new aggregation types for you to select. And it's memory footprint was decreased too.
  • A bug pertaining to the stability of the wiring of the Fuse / Coalesce fields component was fixed.
  • A bug pertaining to the output of the Invoke Child Analysis Job component was fixed.
  • The functionality of the Merge duplicates (simple) component has been slightly improved by also taking into account the most frequently occurring values in the merging of many duplicates into a single record.
We hope you enjoy this new release of DataCleaner which is available for download now.

2015-12-14 : DataCleaner 4.5.3 released

DataCleaner 4.5.3 has arrived with a bundle of feature improvements and bugfixes for everyone. It is available for immediate download for customers and trial users. Please find below the listing of changes:
  • Feature additions:
    • A new filter called Compare has been added. The compare-filter is used to compare two fields, or a field and a fixed value, using a dynamic set of operators such as "Greater than", "Equal" or even "Like" (equivalent to the LIKE operator in SQL).
    • In the Remove dictionary matches we've added a new output column called "Removed matches". This output column will contain all the dictionary terms that was removed from the inspected string.
    • A new transformation called Remove substring. This transformation is useful in data standardization scenarios where a substring has been extracted from a larger string, and the larger string needs to be sanitized.
  • Bugfixes:
    • An excessive logging setting in the Command-Line Interface (CLI) was discovered and changed, yielding better performance when executing jobs in this environment.
    • The "Create Excel spreadsheet" component would break if two fields had been mapped with the same name. This will now be detected and reported to the user as a validation error.
    • The HDFS file chooser dialog had an issue which caused "blur" events to reset the port number of the Hadoop namenode. This has been fixed.
    • The "Merge duplicates (advanced)" component had an issue in which re-opening a merge configuration could reset the configuration to scratch. This has been fixed.
    • The "Fuse/Coalesce fields" component's preview functionality wasn't working when columns in the job had changed names. This has been fixed.
We encourage everyone to get the latest DataCleaner version from the downloads page. Enjoy!

2015-11-26 : DataCleaner 4.5.2 released

Hello everybody!

We have just released DataCleaner 4.5.2 and are excited to tell you about all that we did to improve the premier open source data quality solution!
  • There’s a new component: Grouper. The Grouper component provides a simple way to do a group-by and aggregation like transformation. The user can pick any group key and select between a number of aggregation types for each column to be included in the output.
  • We added the ability to preview transformations that are in the scope of an Output Data Stream. For instance you can now do a preview of Merge duplicates when it is following a Duplicate detection. Or another example: You can use the new Grouper component and preview a transformation that consumes it’s grouped data stream.
  • We’ve improved the Regex parser component by adding a "match mode" property to allow also regex "find" semantics in addition to the "matches" semantics that was already there.
  • The “Cancel job" button/function can now more effectively interrupt and stop jobs from running even though some task is performing a long-running operation.
  • We've added keyboard shortcuts for interacting with the job editing canvas. Here’s a quick overview of the shortcuts we’ve added: F2 for rename, Enter for "open", ESC for "close", F5 for refresh, DEL/Backspace for "remove".
  • And a lot more minor bugfixes and improvements.
You can check out the GitHub milestone for a complete overview of the changes made.

We hope to see this release picked up by the community and customers - thank you for using our favourite Data Quality toolkit :-)

2015-11-09 : DataCleaner 4.5.1 released

DataCleaner 4.5.1 is out and we're happy to say that the DataCleaner product is becoming faster, better and maturing at the same time. This release is primarily about fixing minor issues. Here's a few of the changes we've made in this release:
  • JSON files can now be selected also from HDFS, web URLs and more.
  • Millisecond precision on the "Capture changed records" filter was implemented.
  • If you're writing Excel spreadsheets we now do a better job at validating the sheet names before attempting to write it.
  • The copy of "OrderDB" used in DataCleaner monitor wasn't containing the same table definition as the copy in DataCleaner desktop.
  • A bug in the "Upload result to DataCleaner monitor" button was fixed.
  • ... And a lot more, see the github milestone for details.
We hope you enjoy the new version which is available for download now!

2015-10-22 : DataCleaner 4.5 - The capacity to correct

We are happy to announce that DataCleaner 4.5 has just been released and we we would like to tell you about this important milestone for everybody's favourite Data Quality solution.

It really is an important milestone for DataCleaner because with this release we've addressed a couple of critical restrictions of the underlying engine's capacity to combine certain types of components in the same jobs. This change unlocks new potential to deliver an even more powerful Data Quality solution with virtually all the flexibility and easiness that you could ask for.

Let's jump right into the new features and improvements. The team implemented more than 200 improvements and bug fixes - here are the most interesting ones...

Output data streams
At the engine and API level we've added the concept of "output data streams", which means that every component can publish streams of data that can be consumed by other components. Users of our API can utilize this feature by implementing the HasOutputDataStreams interface.

If this sounds too technical for you, just appreciate that this capability is underlying the following 4 features/improvements.

Duplicate detection and merging in the same job
With the major updates we did to the UI in DataCleaner 4.0, it became clear that our users are also becoming more and more empowered to do more elaborate tasks using DataCleaner. One of the most frequent limitations we encountered in this respect, was that it was not possible to combine two complex tasks like duplicate detection and merging in a single job. Yet to experienced users it is a very useful scenario. So with the use a new data stream originating from Duplicate detection, you can now combine this job with either duplicate merging or any other duplicate post-processing step you might have.

Example job containing standardization, duplicate detection, merging and writing

Combine tables and data sources using the Union component
We have added a core transformation function to DataCleaner called 'Union'. The functionality of this transformation is comparable of a SQL UNION function - to append two or more datasets together as if they are one. In other words: If you have multiple data sources, or just multiple tables, with the same type of content, then you can use the Union component to utilize them as if they were one big table.

Example job using a Union to perform analysis on multiple customer databases and files

The Union transformation can be used in conjunction with a Composite datastore. That way you can combine data from different data source such as CSV files, relational databases, ElasticSearch indices or Salesforce.com (to give a few examples).

Check if your contacts have moved or passed away - and update your source - all in the same job.
Via the Neopost family of data quality companies we have integrated several address correction, movers checks, deceased checks and similar services for specific regions. Currently we cover the United Kingdom, United States of America, Germany and the Netherlands with such functionality. With DataCleaner 4.5, using these functions has become a lot easier since the flexibility in integrating these services with the use of output data streams means that you can both perform checks, get reports on the results and do the post-processing of the results in a single job!

Example result-screen report from UK movers, deceased and do-not-mail check.

Process the complete or incomplete records found by Completeness analyzer
Completeness is one of the major dimensions of data quality, and DataCleaner addresses this topic with the Completeness analyzer, as well as filtering techniques. In DataCleaner 4.5 the analysis of completeness no longer necessarily ends with the incomplete records. You can now also use the Completeness analyzer as a intermediate step - feeding e.g. the complete or incomplete records into automated post-processing steps.

Connect DataCleaner to its big sister, DataHub
Did you know that DataCleaner is a key architecture piece of the Human Inference/Neopost customer MDM solution, DataHub? DataHub serves the enterprise market for customer MDM and single customer view, and we've been improving the integration a lot in this release of DataCleaner - most noticeably with the DataHub connector, which allows DataCleaner users to seamlessly consume data from and publish to the DataHub.

The processing pipeline in DataHub.

Product data components: GTIN, EAN, VIN
We have added a new category of Data Quality functions which revolve around Product data.

New 'Product data' category.

With these functions, and more to come in the future, we are building a suite of ready-to-use components that validate and standardize the use of common industry codes for products in your database.

Component library restructured
The component library structure has been revisited and we've designed this so that the menus and search function are optimized for the tasks at hand. As you can also see from the screenshot above, the Improve category has changed a lot - now focusing more on specific domains of data and data quality checks.

Secure ElasticSearch connections with Shield
We now support ElasticSearch with Shield-based security. The connection you define for an ElasticSearch index can be reused both for a regular read/write style datastore, but also for searches, matching and other functions that we provide which integrate with ElasticSearch.

Easy access to the Hadoop Distributed File System (HDFS)
As Hadoop is becoming more and more the system of choice for Big Data projects, we've decided to improve the user experience for analyzing and working with files located on HDFS, the Hadoop Distributed File System.

Browsing HDFS to select your DataCleaner source data.

Now browsing and finding files on HDFS is just as convenient as it has always been on your local machine. It's no secret that the roadmap for DataCleaner involves tighter and tighter integration with Hadoop, and this is our first step to make the Hadoop-DataCleaner experience both effective and pleasant.

A new Delete from table component
We have added a component in the "Write" category that deletes records from a table in a datastore. Use in conjunction with filtering functions to e.g. delete dirty records or non-survivors found after merging duplicates.

Online component library reference
A lot has been done to further improve our reference documentation. In addition to updated chapters etc. we've launched the Component library online which provides a quick way to navigate documentation on an individual component level.

We're confident that you will enjoy the improved Datacleaner. Version 4.5 is a major step and we are proud to share it with you!

2015-08-21 : DataCleaner 4.0.10 released

We hope everybody is enjoying the summer. But if you're not out enjoying the sun, you might as well go get warm with the new DataCleaner release instead! ;-) In other words: DataCleaner version 4.0.10 is available.

So what is new in this release?
  • We have added support for connecting to Apache Hive via our existing JDBC interface. In addition to the existing Apache HBase connectivity, this is a good first step towards having DataCleaner as a Big Data Profiling engine for your data in Hadoop.
  • A bug related to version-conflicts in the Apache HBase connector was fixed.
  • We have made it easier to immediately register new datastores when you need them. For instance when configuring a Table lookup or the Insert into table component.
  • The result window has been improved slightly, now prioritizing analysis component reports over e.g. results from transformations.
  • Finally the reference documentation has been updated a lot, plus the index in the documentation now contains better sub-sectioning.
We hope you enjoy DataCleaner 4.0.10!

2015-07-22 : DataCleaner 4.0.9 released

DataCleaner 4.0.9 has been released today and with this release we're bringing a couple of improvements and bugfixes that we hope everybody will enjoy! Let's dive right into the news:

Improvements and new features:
  • We've made it possible to create and drop tables via the desktop UI of DataCleaner. Note that the term "table" here actually covers more than just relational database tables. It also includes Sheets in MS Excel datastores, Collections in MongoDB, Document types in CouchDB and ElasticSearch and so on... Basically all datastore types that support write-operations, except single-table datastores such as CSV datastores, support this functionality! The functionality is exposed via:
    • "Create table" enabled via the right-click menu of schemas in the tree on the left side of the application.
    • "Create table" enabled also via table-selection inputs in components such as Insert into table, Table lookup and Update table.
    • "Drop table" enabled via the right-click menu of tables in the tree on the left side of the application.
  • We've added the (optional) capability of specifying your Salesforce.com web service Endpoint URL. This allows you to use DataCleaner to connect to sandbox environments of Salesforce.com as well to your own custom endpoints.
  • The ElasticSearch support has been improved, allowing custom mappings as well as reusing the ElasticSearch datastore definitions now also for searching and indexing.
  • The sampling of records and selection of potential duplicates in the Duplicate detection function has been improved, leading to faster configuration because the decisions made during the training session are more representative.
  • The Duplicate detection model file format has been updated which has removed the need for a separate 'reference' file in order to save past training decisions. Compatibility with the old format has been retained, but using the new format adds many benefits for the user experience.
  • A thread starvation issue was fixed in DataCleaner monitor. The impact of this issue was great, but it happened only in rare and very customized cases. If custom listener objects on the DataCleaner monitor would throw an error, it would result in a resource never being freed up and taking up a thread from the Quartz-scheduling pool on the server. If this would happen many times the server could eventually run out of threads in that pool.
  • The vertical menu on the result screen is now doing a proper job of displaying the labels of the components that have results. This makes it easier to recognize which menu item points to what result item.
We hope you enjoy the new release. Please go to the Download page to try it out now.

2015-05-20 : DataCleaner 4.0.7 released

Hello everybody! Again we are ready to present to you a new DataCleaner release with a bit of improvements and a bit of bugfixing.

The main improvements made in this release pertain to the display of analysis results:
  • We have changed the layout of the screen so that results are organized vertically to the left instead of as tabs above. The left-side menu can be collapsed and expanded to maximize readability.
  • The 'Duplicate detection' function now allows you to export the duplicated records and pairs into any writeable Datastore you might have (whereas it used to be just Staging Tables and Excel Spreadsheets). This way the storage needed to perform deduplication can become more consolidated and be fit to your own liking.
  • The size of the result window is now remembered so that your preferred window size is retained.
Here's a screenshot of the new result screen layout and the new export functionality in Duplicate detection:

Another important feature we've enabled with this release is component documentation in the application itself. Double-click any component and then the new 'Documentation' button to display it's component reference page. This is very helpful for discovering and learning about the capabilities within DataCleaner.

We hope you enjoy version 4.0.7 of DataCleaner - it's available now via the Download page so go get it!

2015-04-29 : DataCleaner 4.0.5 released

Following our DataCleaner 4.0 release little over a month ago we have received an impressive amount of feedback. As with any major software release, the feedback certainly sparks many creative ideas and also makes us aware of things to improve. So thank you all for that.

It's because of the great feedback that we can today announce the availability of DataCleaner 4.0.5. This version of DataCleaner adds on top of the existing functions and features in DataCleaner 4.0, making them even more powerful. There is obviously also a number of minor bugfixes included in this release. Let's walk you through it:

Combined component requirements

It's now possible to combine many component requirements into one. This especially makes sense if you have a graph of validation/correction tasks and you wish to catch all invalid entries into the same "bucket" of rejected records or so. Here's an example:

Search in component library

We've added a search box to the component library of DataCleaner desktop. This makes it a lot easier to locate the component you're thinking of or to find components of relevance to what you have in mind.

Results from non-analyzers

Until now it has been so that only components of the technical type 'Analyzer' can produce a result. This made a clear distinction between the tasks of data correction/transformation and tasks that produced reports/results that could be displayed to the user. We have relaxed this distinction a bit, allowing transformation components also to produce a result. For now we only have a few examples of this ('Table lookup' and 'Country standardizer'), but more will certainly come in the future.

Less file-management in Duplicate detection

The configuration of the very popular Duplicate detection component was made a bit simpler by no longer requiring the user to consider file-location of the duplicate matching model. Now this file location is based on a default (which can of course be overridden by the user if wanted).

And much more

More than 10 minor bugfixes was addressed. A helping "Component description" documentation option was added, as well as improving the general reference documentation which now holds more tutorials and explanations of all the functionality in DataCleaner.

We hope you enjoy the new release. Keep the feedback coming, and go clean you data!

2015-04-29 : The world's smartest connected data solution

We're happy to present the new Human Inference company profile video - The world's smartest connected data solution. For everybody wondering what kind of company is building DataCleaner (and other cool data solutions), have a look!

Human Inference is part of the Neopost family.

2015-04-14 : Community Edition 4.0 ready for download

Hello everybody! We're happy to tell you about the availability of DataCleaner Community Edition 4.0. Following our commercial launch of the 4.0 product line, we're now ready with the Community Edition too. The Community Edition offers a core toolkit for data quality analysis suitable for developers, students, hobbyists and the like.

Go to the Download Community Edition page to get your copy. And let us know what you think on the Discussion forum, Twitter or GitHub.

For more information about the editions of DataCleaner, head on over to the Compare editions page.

2015-03-30 : DataCleaner 4.0 out now!

For the last months we have been working hard to achieve a very ambitious goal to take DataCleaner to a whole new level. The results of months of hard work are now available to you with a 4.0 tag on it. The overarching theme of this release is user experience and flexibility. More than 120 issues and user stories resolved, 900 code check-ins and more than 5000 lines of code touched.

Want a quick glance? Here's a small teaser that will give you an impression of our new user interface:

Read on to learn what the major features of this release are.

Visual graph-based job building

A new visual way of building jobs instead of navigating though component tabs makes all of the difference for the user and his capabilities. In DataCleaner 4.0 you benefit from a clear picture of how your data is going to be processed.

DataCleaner Job components connected in a graph display It's now much more feasible to get an overview of large jobs

The canvas not only shows the contents of your job. It also provides hints and guidance while you build it. Modifying the job is a matter of interacting with the nodes in the graph.

All the components "within a click’s distance"

In order to find data quality functions easier, the "Transform-Analyze" menu was made into an easy-to-navigate part of the tree structure on the left side of the application. The categories have been divided into "Transform-Improve-Analyze-Write" which makes for a clearer separation of components based on the type of task they help you with. To add a function - just drag it onto the job graph canvas.

Easy access to functions/components in the left-side tree. Right or double click to configure.

Quick Start Wizards

New users will benefit from the welcome screen guiding them how to make the first steps in the application. Commercial editions (read more) of DataCleaner include Quick Start Wizards that will answer the questions you might have about your data. Instead of manually assembling a job, the wizard asks a couple of questions and generates a job you can start your journey with. Such a job can be tweaked later on, if needed or just executed to see the insights.

Click through the Quick Start wizards and then you're all set up. Drop your file into the application and start processing the data

Welcome screen

Along with the wizards, the new welcome screen also changes the way new jobs are built. Click “New job from scratch” button to make use of drag-and-drop support or "Manage datastores" to work in a way known from previous versions of DataCleaner.

Refreshed look and feel

The visual part of the user interface (icons, colors etc.) has been revisited. The new clean and modern look should make working with DataCleaner more pleasant.

UK/US/DE Address Correction and Suppression features

DataCleaner has new components that integrate with UK, US and German address correction and suppression services from our partners. Now, without leaving DataCleaner's job workflow you can consult external databases for information about movers, do-not-mail declarations and verify the accuracy of address details.

Address Correction and Suppression Address Correction wizard

Mac OS X experience

We made sure you can work with DataCleaner on your Mac in the way you are used to. This includes support for Command + click and native “application bundle” configured from professional edition installers.

Improvements to deduplication

Deduplication scenarios have been reconsidered. Try our new “Untrained detection” mode for instant results with minimum configuration. The previous "Training Tool" and "Duplicate Detection" functions have been merged into one component. This eliminates the necessity of replacing Training Tool with Duplicate Detection on the way in order to fulfil the whole customized deduplication process.

Duplicate detection mode selection Model training mode - more helpful than ever An example duplicates report

ElasticSearch and Apache Cassandra connectivity

We continue to expand our portfolio of supported databases. In the DataCleaner 4.0 release we are happy to announce that we now support two new NoSQL databases: ElasticSearch (read+write) as well as Apache Cassandra (readonly).

Get your hands on DataCleaner again

Already had a trial license before? No worries, you can still give DataCleaner 4.0 a try – we’ve decided that this update is so significant that all trial licenses can be renewed. Just visit the Download page to obtain the application!

2014-12-11 : DataCleaner 3.7.2 released

We've cut another release this afternoon. DataCleaner version 3.7.2 is as the version number suggests a minor bugfix release.

The main concern that was addressed in this release was around loading of extensions/plugins. We fixed several issues pertaining to the loading sequence of and visibility of objects within extensions and the main distribution of DataCleaner. If you use extensions, we advise you to upgrade.

Furthermore an improvement to the "Capture Changed Records" filter was introduced - allowing it to work on numerical record version attributes instead of just update timestamp attributes. Lastly the license checking functionality of DataCleaner commercial editions was improved, making it easier to determine what is wrong when a license check is not successful.

2014-10-21 : DataCleaner 3.7 - Connect, Check, Consolidate

This morning version 3.7 of DataCleaner hit the streets and it’s ready to hook up, eager to spread tender loving care to your databases and data files. The keywords of this release are "Connect, Check, Consolidate" since this has been the focus of our development: Connecting to a lot of data sources, checking the data for inconsistencies and consolidating data through migrations and deduplication.


We've added connectivity in DataCleaner to Apache HBase and JSON files. Apache HBase is a popular Hadoop database, a distributed, scalable, big data store. JSON is a data representation format that is becoming increasingly popular for Web technologies, web services and NoSQL databases.


The analytical capabilities of DataCleaner have also been improved. We’ve added an efficient Unique Key check feature. This allows you to easily and quickly check for duplicate keys (or other expected unique values) in your datasets.


Talking about duplicates, the Duplicate Detection feature of DataCleaner professional edition has been improved in many ways. We’ve made several improvements to the user interface, making more options available for the advanced users. We’ve also published an online video tutorial to get people started. On the technical side, the deduplication model is now represented in a more readable XML format and the algorithm for detecting initial duplicates for training has been improved.

Beyond these user-facing features, we've worked on several behind-the-scenes improvements. In fact, what we normally refer to as the "engine" of DataCleaner – AnalyzerBeans – was finally given the big "1.0" version tag to resemble the completeness of this core component.

All in all, it's a release that we hope you enjoy and that we are very happy about. Do go and get your free download and take it for a spin!

2014-07-17 : DataCleaner 3.6.2 released

So far the reception of DataCleaner 3.6 has been quite awesome and we are happy to see all the interest in our latest software releases. This also means that we have new stuff ready and updated, because all the interest easily translates into improvements and feature requests.

So let's look at what we have in store for today, when we announce the release of DataCleaner 3.6.2!
  • We've made several improvements to the Duplicate Detection feature. Several minor bugs where fixed and matching quality was improved - both for the initial "potential duplicates" training set generation, and for the final building of matching rules.
  • The progress bar of a running job in the desktop UI has been beautified and made more interactive - it will set colors and update itself while the job is running.
  • In clustered setups, jobs can now be cancelled across the cluster. No more waiting for all the slave instances to finish their jobs - they will cancel within seconds if the master node tells them to.
  • We've added transformations for URL encoding and HTML encoding. For usages of DataCleaner where strings are being prepared for insertion into URLs or web sites, this is a great utility.
  • For DataCleaner enterprise edition, our Hadoop integration is being improved a lot. We have fixed several minor issues here.
  • Datastores configured in the desktop UI are now automatically persisted in the conf.xml file, making it easier to manage datastores also outside of the UI.
  • A bug pertaining to the "Merge Duplicates" feature from EasyDQ was fixed.
So all in all a lot of cool but minor improvements. Go get the latest DataCleaner now!

2014-06-27 : DataCleaner 3.6.1 released

It's time to get your DataCleaner installations updated, because we have a new release for you!

DataCleaner 3.6.1 is a bugfix and minor improvements release, but that doesn't mean in any way that it's boring stuff :-) Take a look at these changes:
  • For users that want to do transformations quick and simply write the results somewhere, we've now allowed any job to be executed, even without any analyzers. The result will be an option dialog like this to select where to put the data:
  • In the DataCleaner monitor webapp, a critical bug was fixed which caused Linux deployments to treat the example 'DC' tenant's repository with a wrong filename. This has been fixed and the example tenant is now called 'demo'.
  • A new triggering mode has been introduced to the monitoring and scheduling functionality: One-time triggering. Using a single date and time instant, you can now get a job triggered once if needed.
  • The styling and javascript API of the DataCleaner monitor webapp has received several updates.
  • A user role "ROLE_GOD" was introduced, allowing certain users to have control over all tenants in the DataCleaner monitor webapp.
  • A fix was implemented for the clustered execution mode, ensuring that execution chunks are ordered correctly depending on the capabilities and natural ordering of the underlying datastore.
  • Clustered jobs can now be cancelled throughout the cluster. This means that the master will inform all slaves that the job should be ended and resources made free again.
We hope that you enjoy this update, and that you will go ahead and get DataCleaner right away!

2014-05-20 : DataCleaner 3.6 is out - new features, new editions

It's exciting times in the DataCleaner team - and we're happy to announce that DataCleaner 3.6 is now generally available along with new cool features, both in community and commercial editions. Go get it now.

Duplicate Detection

With DataCleaner 3.6 we are finally launching a new and extensive Duplicate Detection feature. With Duplicate Detection you can apply fuzzy logic to identify the records in your data that are duplicate entries for the same real-life thing. Use it to identify duplicate customers, products or anything else of relevance. It’s a great way to improve data quality and to have better interactions with customers, co-workers etc. Read more about Duplicate Detection here.

Referential Integrity

Another exiting new feature in DataCleaner 3.6 is the Referential Integrity analyzer. With this analyzer you can easily check the integrity between multiple tables in a single step. The analyzer works with tables from the same datastore, and even also with tables from different sources. This means that you can effectively cross-check data from disparate sources that may be out of sync and cause data quality issues.

More and better Progress Information

We’ve also done a lot to improve the responsiveness of DataCleaner’s screens while processing large jobs. The loading indicators and progress logs are now more responsive, and the layout of results has changed from being table oriented to result-type oriented. All in all it gives a nicer, more smooth experience with more overview of what is going on.

Commercial Editions revisited

Finally, the offering of commercial editions of DataCleaner has been changed to fit better with individuals and professionals using DataCleaner. Now you can get support and professional edition features for a very low entry price. This we believe will fit the marketplace well and provide an awesome commercial open source Data Quality solution that is approachable for everyone. Read more about our commercial editions here.

The documentation for DataCleaner 3.6 has also been updated quite a lot and applies to both community and professional edition - go check it out.

There's more

Actually we did a whole lot more than just this. For a full overview, please check out the milestones completed in our GitHub issue tracker:
We hope you enjoy DataCleaner 3.6!

2014-03-15 : DataCleaner 3.5.10 released

We've cut another release of DataCleaner - version 3.5.10! And although this is "just" a minor release version bump, the changes are pretty encouraging and the version numbering scheme does not really do it justice.

So... What's new then?
  • You can now compose jobs so that a DataCleaner job actually calls/invokes another "child" job as a single transformation. This is an important feature because it allows users to organize and compose complex data processing flows into smaller chunks of work. The new "Invoke child Analysis Job" transformation inlines the transformation section of the child job at execution time, which means that there is practically no overhead to this approach.

  • As a convenience for the above scenario, it is now allowed to save jobs without any analysis section in them. These jobs will thus be "incomplete", but that might actually be the point when composing and putting jobs together.
  • Another new transformation was added: Coalesce multiple fields. This transformation is useful for scenarios where multiple sets of fields are interchangeable, or when multiple interchangeable transformations produce the same set of fields. The "coalesce" transformation can roughly be translated into "pick the first non-empty values". When there's multiple sets of fields in your data processing stream, for instance multiple address definitions, and you need to select just one, then this is very convenient.

  • The handling of source columns has been simplified. Previously we tried to limit the source queries based upon only the source columns that where strictly needed to perform the analysis. But many users gave us the feedback that this caused trouble because the drill-to-detail information available in the analysis results would then be missing important fields for further exploration. So the power is now in the hands of the users: The fields added in the "Source" section of the job are the fields that will be queried.
  • A change was made to the execution engine in dealing with complex filtering and requirement configurations. Previously, if a component (transformation or analysis) consumed inputs from other components, ALL requirements had to be satisfied, which mostly just causes the requirement to never become true. Now the logic has been changed to be inclusive so that if any of the direct input sources' requirements are satisfied, then the component's inferred requirement is also satisfied. Most users will not notice this change, but it does mean that it is now possible to merge separate filtered data streams back into a single stream.
  • An issue was fixed in the access to repository files. Read/write locking is now in place which avoids access conflicts by different processes.
  • The 'requirement' button in DataCleaner has also been reworked. It did not always properly respond to changes in other panels, but now it is consistent.
  • Finally, the 'About' dialog was improved slightly and now contains more licensing information :-)
We hope you will enjoy this release of DataCleaner. Head over to the downloads page and get your copy now.

2014-02-26 : DataCleaner embraces GitHub as collaboration platform

We want to stay on top of technology that enhances collaboration and involvement. Therefore we have made a significant move of the DataCleaner source code from our Subversion system towards the social coding platform GitHub. This move was made to give the community further tools for collaboration and to also benefit from the improved source control system features of Git itself.

GitHub - social coding

With GitHub we now have a central and social platform where anyone can pitch in on the development effort. One particularly useful tool for contributors is that they can submit pull requests which are basically suggested changes made in their own maintained copies of the source code - without necesarily impacting the main code tree.

We will be embracing GitHub for the technical development of DataCleaner only. This means that end users should not be much concerned about this move, but developers should be using GitHub for source code and issue management.
  • Visit our GitHub organization 'datacleaner' and check out the projects there. That includes both the projects you know, and maybe also some new ideas that you didn't know.
  • Or go directly to the main projects; AnalyzerBeans (the processing engine of DataCleaner) and the DataCleaner project itself.

2013-11-22 : DataCleaner 3.5.7 released

Hi everyone!

We've just released DataCleaner version 3.5.7!

For this release we've made 4 important improvements to performance and stability. So although it doesn't seem like a big release in numbers or functionality, it's a good one since we spent the time on making an already good product better at what it does best.

The issues resolved in this release are:
  • A flag has been added to the CSV datastore options, making it possible to disable values in CSV files that span multiple lines. Disabling this feature in our CSV parser enabled us to increase parsing speed significantly and at the same time handle poorly/inconsistently formatted CSV files much better. Since many CSV files anyway don't contain values that would be allowed to span multiple lines, we think this is a great way to gain the extra performance and stability.
  • A change was made to the way we monitor progress log information. This means that we now have a much more effective and performant way to monitor progress of DataCleaner jobs, which especially speeds up performance on the server side.
  • A minor modification to the progress logs have been implemented: The progress information statements now always shows the time of the statement.
  • A minor bug was fixed: The CSV datastore dialog of the monitor web application would sometimes show an unexpected error if you did not fill out escape characters, quote characters and so on.
You can grab the new version of DataCleaner at the downloads page - enjoy!

2013-10-25 : Cosmetic improvements available in DataCleaner 3.5.6

We've just cut another release of DataCleaner with some minor cosmetic/specialized bugfixes and improvements. We're happy to be able to make users happier with these little additions to our favourite open source data quality tool:
  • The monitoring webapp's CSV datastore dialog now supports TXT files as well as CSV and TSV files.
  • A bug was fixed pertaining to the "Max rows" filter's tab in the UI sometimes making uncloseable tabs for other components as well.
  • A bug was fixed causing sometimes the order of selected input columns of a component to not be retained when saving and loading the job.
  • Various improvements to API and stability of internal utilities.
For the extra curious reader; check the milestone report. And go download DataCleaner 3.5.6 already now!

2013-09-24 : DataCleaner 3.5.5 released

We've just released DataCleaner 3.5.5, which is primarily a minor bugfix release. Here's a summary of the improvements made:
  • The 'Synonym lookup' transformation now has a option to look up every token of the input. This is useful if you're doing replacement of synonyms within the values of a long text field.
  • Blocking execution of DataCleaner jobs through the monitor's web service for this could sometimes fail with a bug caused by the blocking thread. This issue has been fixed.
  • An improvement was made in the way jobs and the sequence of components are closed / cleaned up after execution.
  • The JNLP / Java WebStart version of DataCleaner was exposed by a bug in the Java runtime causing certain JAR files not to be recognized by the WebStart launcher, under certain circumstances. This issue has been fixed by making slight modifications to those JAR files.
  • A few dead links in the documentation was fixed.
You can download the new DataCleaner now at the downloads page! Do let us know what you think of it on the discussion forum.

2013-09-05 : DataCleaner 3.5.4 released

DataCleaner version 3.5.4 has just been released and is available for download as of now.

This is primarily a bugfix release, but a few minor improvements has also made the cut for the release. Here's a summary.
  • It is now possible to hide output columns of transformations. Hiding will not affect the processing flow at all, but simply hide them from the user interface, and thus potentially making the experience more clean, when interacting with other components.
  • A new web service has been added to the monitoring web application, which provides a way to poll the status of the execution of a particular job.
  • A bug was fixed, causing the HTML report to fail for certain analysis types when no records had been processed.
  • And 6 other minor bug has been adressed.
For more details, consult the milestone summary in our issue tracking system. We hope you enjoy this release, and encourage you to provide feedback in any way possible.

2013-07-01 : DataCleaner 3.5.2 and 3.5.3 released

Hello everyone,

A little summer holiday treat for everyone: Last Friday we released DataCleaner 3.5.2 ... And then today, a few days later, we have just released DataCleaner 3.5.3. The reason being that these are bugfix released and unfortunately one bug escaped the first release. Sorry about that, but rest assured that both releases was contributing to the overall better product.

The improvements made are:
  • A bug was fixed which cased the DataCleaner monitor to show a result link for all jobs, even if they didn't produce a result. This only happened rarely though, for instance when building a custom Java job that returns null.
  • An advanced JavaScript transformer was added to the portfolio of built-in transformations. Using this transformer the user can build a stateful JavaScript object which is capable of both transforming, aggregating and filtering records.
  • Job and Datastore wizards now have 'Back' buttons.
  • A new dedicated 'extensions' folder is available in the DataCleaner desktop application. Use this folder to dump extension JAR files in, if you want them to be automatically loaded during application startup.
  • A new service was added to DataCleaner monitor, which enables administrators to download and upload (backup and restore) a complete monitoring repository in one go.
  • A bug was fixed which caused the desktop application's "DataCleaner monitor" dialog to crash when using default user preferences.
Head on over to the downloads page to get this latest release!

2013-06-12 : DataCleaner 3.5.1 released

It's always a bit difficult to write a really enthusiastic release announcement about a release that is essentially a bugfix release. And then again ... We've just released DataCleaner 3.5.1 and it is definately mostly a "minor improvements" release but some of these minor improvements are actually pretty cool! Let's have a look at a few highlights:

Capture changed records

A new filter was added to enable incremental processing of records that have not been processed before, e.g. for profiling or copying only modified records. The new filters's name is Capture changed records, referring to the concept of Change data capture.

Queued execution of jobs

The DataCleaner monitor will now queue the execution of the same job, if it is triggered multiple times. This ensures that you don't accidentally run the same job concurrently which may lead to all sorts of issues, depending on what the job does.

Minor bugfixes

Several bugfixes was implemented, see the full list on the 3.5.1 milestone page on our bugtracker.

The release is available at the downloads page and via the WebStart client. We hope you enjoy!

2013-05-01 : DataCleaner 3.5 released

We are very proud and happy to present DataCleaner 3.5, which has just been released!

With the 3.x branch of DataCleaner we set forth on a mission to deliver monitoring, scheduling and management of your data quality directly in your browser. And now with the new release, we are building upon this platform to deliver an even richer feature set, a comfortable user experience and massive scalability through clustering and cloud computing.

To be more precise, these are the major stories that we've worked on for the DataCleaner 3.5 release:

Connectivity to Salesforce and SugarCRM

One of the most important sources of data is usually a company's CRM system. But it is also one of the more troublesome data sources if you look at the quality. For this reason we've made it easier to get the data out of these CRM systems and into DataCleaner! You can now use your Salesforce.com or your local SugarCRM system as if it was a regular database. Start by profiling the customer data to get an overview. But don't stop there - you can even use DataCleaner to also update your CRM data, once it is cleansed. More details are available in the brand new focus article about CRM data quality.

Wizards and other user experience improvements

The DataCleaner monitor is our main user interface going forward. So we want the experience to be at least as pleasant, flexible and rich as the desktop application. To meet this goal, we've made many user interface and user experience improvements, amongst others:
  • Several wizards are now available for registering datastores; including file-upload to the server for CSV files, database connection entry, guided registration of Salesforce.com credentials and more.
  • The job building wizards have also been extended with several enhanced features; Selection of value distribution and pattern finding fields in the Quick analysis wizard, a completely new wizard for creating EasyDQ based customer cleansing jobs and a new job wizard for firing Pentaho Data Integration jobs (read more below).
  • You can now ad-hoc query any datastore directly in the web user interface. This makes it easy to get quick or sporadic insights into the data without setting up jobs or other managed approaches of processing the data.
  • Once jobs or datastores are created, the user is guided to take action with the newly built object. For instance, you can very quickly run a job right after it's built, or query a datastore after it is registered.
  • Administrators can now directly upload jobs to the repository, which is especially handy if you want to hand-edit the XML content of the job files.
  • A lot of the technical cruft is now hidden away in favor of showing simple dialogs. For instance, when a job is triggered a large loading indicator is shown, and when finished the result will be shown. The advanced logging screen that was previously there can still be displayed upon clicking a link for additional details.
Distributed execution of jobs

To keep up with the massive amounts of data that many organizations are juggling with today, we had to take a critical look at how we process data in DataCleaner. Although DataCleaner is among the fastest data processing tools, it was previously limited to running on a single machine. For a long time we've been working on a major architecture change that enabled distribution of a DataCleaner job's workload over a cluster of machines. With this new approach to data processing, DataCleaner is truly fit for data quality on big data. More details are available in the documentation section.

Data visualization extension

Data profiling and data visualization do share some common interests - both are disciplines that help you understand the story that your data is telling. There are obviously also some differences, mainly being that data profiling is more targeted at identifying issues and exceptions rather than deriving or measuring business objectives. But confronted with visualization tools we've realized that sometimes there's a lot of profiling value in progressively visualizing data. For instance, a scatter plot can easily help you identify the numerical outliers of your datasets. This idea gave fuel to the idea of a visualization extension to DataCleaner. Therefore DataCleaner now also let's you do basic visualization tasks to aid you in your data quality analysis.

National identifiers extension

A very common issue in data quality projects is to validate national identifiers, such as social security numbers, EAN codes and more. In our commercial editions of DataCleaner, we now offer a wide range of validation components to check such identifiers.

Custom job engines

We've made the ultimate modularization of the DataCleaner monitoring system: The engine itself is a pluggable module. While we do encourage to use DataCleaner's engine as the primary vehicle for execution in DataCleaner monitor, it is not obligatory anymore. You can now schedule and monitor (both in terms of metric monitoring and history management) other types of jobs. For instance, you can provide your own piece of Java code and have it scheduled to run in DataCleaner monitor using the regular web user interface.

Pentaho job scheduling and execution

One major example of a pluggable job engine was introduced that we think deserves special attention: You can now invoke and monitor execution metrics of Pentaho Data Integration transformations. DataCleaner monitor by default ships with this job engine extension which connects to the Pentaho DI server ("Carte") and supervises the execution and result gathering of it. After execution you can track your Pentaho transformations in the timeline views of the monitoring dashboard, just like other metrics. For larger deployments of DataCleaner it may be convenient with dedicated ETL-style jobs in your data quality solution, and with this extension we provide an integration with a leading open source solution for just that. More details are available in the documentation section.

... And a whole lot more!

There's even a lot more to the 3.5 release than what is posted in these highlights. Take a look at the milestone page on the bugtracker for a more thorough listing of improvements made.

A non-functional aspect of DataCleaner is the reference documentation, which we've also done a lot to update. Additionally all the documentation pages now have a commenting feature, so that you can ask questions or provide feedback to the help that is in there. We'll be continuously providing more and more content in the documentation and on the website for you to get the best resources at your hands.

... Stay tuned for more!

On the front page of the DataCleaner website we'll be posting "feature focus" articles in the weeks to come. Please help us spread the word by promoting the release and the articles to your friends, colleagues and whom else might be interested.

2013-01-22 : DataCleaner 3.1.2 is out

We're happy to announce another release of DataCleaner - version 3.1.2. This version is a minor improvement and bugfix release.

So what's new? Here's the summary:
  • We've added a web service in the monitoring application for getting a (list of) metric values. This makes the monitoring even more usable as a key infrastructure component, as a way to monitor data (quality) and expose the results to third party applications. Read more in the documentation.
Connectivity in DataCleaner monitor
  • The 'Table lookup' component has been improved by adding join semantics as a configurable property. Using the join semantics you can tweak if you wish the lookup to work semantically like a LEFT JOIN or an INNER JOIN.
  • The EasyDQ components have been upgraded, adding further configuration options and a richer deduplication result interface.
  • Performance improvements have been a specific focus of this release. Improvements have been made in the engine of DataCleaner to further utilize a streaming processing approach in certain corner cases which was not covered previously.
For more details on the individual issues worked on, visit our milestone page.

The 3.1.2 release should be a drop-in replacement of other 3.x releases, so go download and upgrade now!

2013-01-09 : Country standardization and similarity evaluation in EasyDQ additionals

We've added two additional transformations to the EasyDQ additionals package, provided by Human Inference.

The two transformations are:
  • Country standardization. This allows you to get direct access to EasyDQ's country standardizer and unify different spellings, formats and more of country names and codes.
  • Similarity evaluator. This feature provides a low-level function for comparing two sets of values. For instance, if you've done a reference data lookup, you will often be interested in knowing if the result of the lookup matches the data that you already have. Using the similarity evaluator you can easily compare the incoming and resulting values and thereby make visible the improvements and changes you are doing to your data with these lookups.
The extension is available, like always, in the Extensions section of the website.

2013-01-04 : DataCleaner 3.1.1 is released

We have a nice little release for you today, which contains the usual maintenance fixes, but also some improvements and minor new features. DataCleaner 3.1.1 is ready for download as of now.

Let's dive into the news ...
  • The date and time related analysis options have been expanded, adding distribution analyzers for week numbers, months and years. All analyzers related to date and time are now grouped within a submenu called "Date and time" under "Analyze".
  • An optional "descriptive statistics" option has been added to the Number analyzer and the Date/time analyzer. This option adds additional metrics to the results of these analyzers, such as Median, Skewness, percentiles and Kurtosis. These metrics are optional since their memory footprint is somewhat larger than the existing metrics.
  • The lines in the timeline charts of the monitoring web application now have small dots in them. This is especially useful for charts with few (or even only one) observations in them - to point out exactly where the observation points are.

  • The query parser when invoking ad-hoc queries have also been substantially improved. Now queries can contain DISTINCT clauses, *-wildcards, subqueries and are fault-tolerant towards text-case issues.
  • Two new transformers have been added for generating UUIDs and for generating timestamps.
For the full list of improvements, go to the milestone page on our bugtracker.

We hope you enjoy this release, and go get it immediately from the downloads page.

2012-12-17 : DataCleaner 3.1 is out

Human Inference is happy to announce that DataCleaner 3.1 has been released and that it is available for download now! With DataCleaner 3.1 we’ve really focused on usability and day-to-day requirements of both the DataCleaner desktop data profiling application, and the web application for continuous data quality monitoring. Features that we feel really aids the user to do what he wants to do. Here’s a summary of what has been done.

Metric formulas – elaborated Data Quality KPIs
It is now possible to build much more elaborate Data Quality KPIs in DataCleaner’s monitoring web application. The user interface allows you to build complex formulas in a spreadsheet-like formula style; using variables collected by DataCleaner jobs.

Metric formulas can combine any number of metrics, constants and operations, as long as it can be expressed in a mathematical equation.

For instance – measure the rate of duplicate records in percentage of the total record count. Or measure the amount of product codes that conform to a set of multiple string patterns.

Ad-hoc querying – of any datastore
With DataCleaner 3.1 you can now perform ad-hoc queries to any datastore! Queries can be expressed in plain SQL and will be applied to databases as well as files, NoSQL databases and more, providing a truly helpful query mechanism to extend into your discovery and data profiling experience.

The query option is also available through a web service to monitoring users with the ADMIN role. The query is provided as a HTTP parameter or POST body, and the result is provided as an XHTML table.

Value matcher – a new analysis option
Often times you have a firm idea on which values should be allowed and expected for a particular field. In DataCleaner there’s always been the Value Distribution analysis option which would help you assert your assumptions. In DataCleaner 3.1 though, you have a more precise offering – the Value matcher. This analysis option allows you to specify a set of expected values and then perform a value distribution like analysis, specifically to validate and identify unexpected values.

Copying, deleting and management of jobs
Management of jobs and results in the DataCleaner monitor application has been improved greatly. You can now click a job in the Scheduling page of the monitor, and find management options available for operations such as renaming, copying, deleting and more. Each operation respects the linkages to other artifacts in the monitor, such as analysis results, schedules and more. This means that management of the monitoring repository has become a lot easier and mature.

Manage data quality history
Sometimes you’re facing situations where you actually want to do monitoring with historic data! It might be that you have historic dumps or backups of databases, which you wish to show and tell the story of. You can now do the analysis of this historic data, upload it to the DataCleaner monitor, and using a new web service, set a historic data of that particular analysis result. This means that your timelines will properly plot the results using their intended date, but with the results that you’ve collected maybe at a later point in time.

Clustered scheduler support (EE only)
The scheduler of DataCleaner monitor has been externalized, so that it can be replaced by the means of simple configuration. In the Enterprise Edition (EE) of DataCleaner, we provide a clustered scheduler, providing the ability to load balance and distribute your executions across a cluster of machines.

Single-signon (SSO) using CAS (EE only)
In the Enterprise Edition (EE) of DataCleaner we now provide a single-signon option for the monitor application. Now DataCleaner can be an integrated part of your IT infrastructure, also security-wise.

... And a lot more
The above is just a summary. More than thirty issues have been resolved in this release. We have solved several requests coming from the forums and community, and we encourage everyone to use this medium as a vehicle for change. We’re very happy to make the development of DataCleaner be heavily influenced by the streams in the community.

2012-11-30 : Human Inference and Neopost join forces

Neopost, the European leader and number two worldwide supplier of mailroom solutions, today announced that it has completed the acquisition of Human Inference.

With products and services marketed in 90 countries and subsidiaries in 29 countries, the Neopost Group has 5,900 employees all over the world, 1,300 sales representatives and 450 R&D engineers.

As the postal sector is undergoing major changes, Neopost is anticipating the needs of its customers by bringing new services and technological innovation to the market. Therefore, Neopost has been acquiring multiple companies; several components have been added to the mix, all relating to the topic of communications between people. Satori software, a US-based data quality vendor has been part of the mix for a while and GMC, a Swiss-based Customers Communications Management vendor has been acquired recently. For Neopost, Human Inference is a strategic acquisition helping them to create the portfolio that they need to bring future-proof solutions to the market and their current customers.

Neopost has chosen Human Inference for its strong expertise, its proven solutions and its splendid reputation. We will continue to operate independently, with an unchanged management team. Our core values will remain to be our guidelines. Our customers will be able to enjoy an even broader set of solutions, which we believe will be in perfect fit with our single customer view-strategy. In addition, Human Inference will be able to use the sales and distribution channels of Neopost, which will give us the opportunity to service new markets.

Human Inference CEO Winfried van Holland said: "We are very pleased to join Neopost. This offers us access to new markets and the support and relationships from a large organization. Our solutions fit perfectly in Neopost’s portfolio. This way Neopost customers, Human Inference customers, common customers and the DataCleaner community members will benefit from a broader range of solutions allowing them to reduce their risk, become more efficient and grow their profit by deploying a single customer view."

See here the press release on the Neopost website.

2012-11-08 : Community contributor contest!

Who will post the best content for use in DataCleaner?

Human Inference is announcing a competition for the DataCleaner community. The goal is to provide the best contribution for our favourite open source data quality tool.

What kind of contributions?
Submitted content can be of many forms:
  • Educational content like tutorials, videos etc.
  • Regular Expressions for the RegexSwap.
  • DataCleaner extensions for the ExtensionSwap.
  • Reference data for inclusion in the tool.
  • Use case descriptions – tell the community about your experiences.
  • Third party tool integration.

We do cherish everything in the community being free. But we will also be giving a nice prize to the winner with the best submission. With the prize we want to encourage further creativity and technological discovery. So the winner will have the option of either a Android tablet of their own choice (for instance the new Google Nexus 7) or a Lego Mindstorms programmable and modular robot system.

We want to send a special thank you to the CUBRID affiliates program for helping in sponsoring the prizes.

In addition to winning a prize, all submissions will be reviewed and mentioned on the DataCleaner website.

Content must be submitted before Christmas (December 24) 2012. Post a comment on this discussion topic to tell the community where and how to retrieve your submitted content. We also encourage people to join our Google+ community hangouts where authors will be invited to present their contributions.

Submitted contributions (so far)
Here's a list of the submitted contributions in the contest so far:

2012-10-31 : DataCleaner 3.0.3 is out

Dear DataCleaner users and developers,

We have a new release for you today, version 3.0.3 of DataCleaner. Grab it before your neighbor at the download page.

The focus of this release has been stability, performance and convenience for monitoring repository maintenance. Thus, the new and improved list follows:
  • We've added a service for renaming jobs in the monitoring repository. You can access this as a RESTful web service or interactively in the UI:
Renaming jobs
  • A web service was added for changing the historic date of an analysis result in the monitoring repository. This is convenient if you have historic dumps of data that you wish to include in a timeline.
  • The documentation has been updated with more elaborate descriptions of the web services available for repository navigation, job invocation and more.
  • The login dialog in the desktop application had a low-level version conflict, which caused it to be unusable. This has been fixed.
  • The web application has been made compatible with legacy JSF containers, making the range of applicable Java Webservers wider.
  • Caching of configuration in the web application was greatly improving, leading to faster page load and job initialization times.
We hope you enjoy this release. It should be 100% backwards compatible with other 3.x releases, so we encourage everyone to upgrade.

2012-10-31 : Community hangouts - sign up now

We are happy to invite everyone to a new initiative: The DataCleaner community hangout. The community hangout is a chance for users and developers of DataCleaner to meet face-to-face online every once in a while.

The last couple of weeks we've been trying out the new concept with a limited amount of people, and we are now ready to make the invite to everyone with an interest!

Community hangouts

The date of the next hangout is Tuesday the 6th of November at 10:00 CET. Please be aware of any timezone differences.

The hangouts are happening on Google+ on a semi-weekly basis. The frequency will be adjusted according to the interest in the community. To kick it off we will from the Human Inference side provide some presentations and discussion topics for the first couple of sessions. But the idea is also to engage users and friends to join the hangouts with their own input.

For the next hangout, project founder Kasper Sørensen will be demoing the new monitoring web application, and how it relates to the traditional desktop application.

For more information, go to our Google+ page and sign up to the next hangout.

2012-10-12 : DataCleaner 3.0.2 released

It's friday afternoon and we have a little weekend gift to share with everyone. The last couple of weeks we've been working on a number of small but nice feature improvements and minor bugfixes in DataCleaner. These are now all available in DataCleaner version 3.0.2 - go grab it at the downloads page.

Here's a wrap-up of the work that we've done:
  • When triggering a job in the monitoring web application, the panel auto-refreshes every second to get the latest state of the execution.
  • File-based datastores (such as CSV or Excel spreadsheets) with absolute paths are now correctly resolved in the monitoring web application.
  • The "Select from key/value map" transformer now supports nested select expressions like "Address.Street" or "orderlines[0].product.name".
  • The table lookup mechanism have been optimized for performance, using prepared statements when running against JDBC databases.
  • Administrators can now download file-based datastores directly from the "Datastores" page.
  • Exception handling in the monitoring web application has been improved a bit, making the error messages more precise and intuitive.
We hope you enjoy the new version. It should be a drop-in replacement of previous DataCleaner 3 releases, so no need to wait, upgrade now.

If you're using DataCleaner and think it would be fun to meet up with team members from Human Inference who work on the product, as well as consultants and other users of it - join our new Google+ page from where we will start doing community hangouts and thereby invite you to share ideas, questions and good vibes.

2012-10-01 : DataCleaner 3.0.1 released

Thank you to all for the positive attention about our recent DataCleaner 3 release. With this information we've been able to quickly and effectively identify a few minor improvements and have introduced these in a new release: Version 3.0.1.

The primary bugfix in this release was about restoring the mapping of columns and specific enumerable categorizations. For instance in the new Completeness analyzer, we found that after reloading a saved job, the mapping was not always correct.

Furthermore a few internal improvements have been made, making it easier to deploy the DataCleaner monitor web application in environments using the Spring Framework.

Last but not least, the visualization settings in the desktop application have been improved by automatically taking a look at the job being visualized and toggling displayed artifacts based on the screen size and amount of details needed to show it nicely.

DataCleaner 3.0.1 is available for download on our downloads page. We wish you good luck cleaning your data, and enjoy the software.

2012-09-20 : DataCleaner 3 released

Dear friends, users, customers, developers, analysts, partners and more!

After an intense period of development and a long wait, it is our pleasure to finally announce that DataCleaner 3 is available. We at Human Inference invite you all to our celebration! Impatient to try it out? Go download it right now!

So what is all the fuzz about? Well, in all modesty, we think that with DataCleaner 3 we are redefining 'the premier open source data quality solution'. With DataCleaner 3 we've embraced a whole new functional area of data quality, namely data monitoring.

Traditionally, DataCleaner has its roots in data profiling. In the former years, we've added several related additional functions:- transformations, data cleansing, duplicate detection and more. With data monitoring we basically deliver all of the above, but in a continuous environment for analyzing, improving and reporting on your data. Furthermore, we will deliver these functions in a centralized web-based system.

So how will the users benefit from this new data monitoring environment? We've tried to answer this question using a series of images:

Monitor the evolution of your data:

Share your data quality analysis with everyone:

Continuously monitor and improve your data's quality:

Connect DataCleaner to your infrastructure using web services:

The monitoring web application is a fully fledged environment for data quality, covering several functional and non-functional areas:
  • Display of timeline and trends of data quality metrics
  • Centralized repository for managing and containing jobs, results, timelines etc.
  • Scheduling and auditing of DataCleaner jobs
  • Providing web services for invoking DataCleaner transformations
  • Security and multi-tenancy
  • Alerts and notifications when data quality metrics are out of their expected comfort zones.
Naturally, the traditional desktop application of DataCleaner continues to be the tool of choice for expert users and one-time data quality efforts. We've even enhanced the desktop experience quite substantially:
  • There is a new Completeness analyzer which is very useful for simply identifying records that have incomplete fields.
  • You can now export DataCleaner results to nice-looking HTML reports that you can give to your manager, or send to your XML parser!
  • The new monitoring environment is also closely integrated with the desktop application. Thus, the desktop application now has the ability to publish jobs and results to the monitor repository, and to be used as an interactive editor for content already in the repository.
  • New date-oriented transformations are now available: Date range filter, which allows you to subset datasets based on date ranges, and format date, which allows to format a date using a date mask.
  • The Regex Parser (which was previously only available through the ExtensionSwap) has now been included in DataCleaner. This makes it very convenient to parse and standardize rich text fields using regular expressions.
  • There's a new Text case transformer available. With this transformation you can easily convert between upper/lower case and proper capitalization of sentences and words.
  • Two new search/replace transformations have been added: Plain search/replace and Regex search/replace.
  • The user experience of the desktop application has been improved. We've added several in-application help messages, made the colors look brighter and clearer and improved the font handling.
More than 50 features and enhancements were implemented in this release, in addition to incorporating several hundreds of upstream improvements from dependent projects.

We hope you will enjoy everything that is new about DataCleaner 3. And do watch out for follow-up material in the coming weeks and months. We will be posting more and more online material and examples to demonstrate the wonderful new features that we are very proud of.

2012-06-04 : The plans for DC 3.0 revealed

We are celebrating the plans to build a version 3.0 of DataCleaner, where we hope to be pushing the limits of what you can expect from your open source data quality applications. A few big themes for version 3.0 has already been decided:
  • A data quality monitoring web application.
  • A multi-tenant repository for data quality artifacts (jobs, profiling results, configurations, datastore definitions etc.)
  • Being able to edit data (in the desktop application).
  • Wizards to guide users through their first-time user experience with DataCleaner.
Go read Kasper Sørensen's blog post about the data quality monitoring application, which underlines the general direction and scope of the release!

2012-04-30 : DataCleaner 2.5.2 released

DataCleaner 2.5.2 has just been released. The DataCleaner 2.5.2 release is a minor release, but does contain some significant feature improvements and enhancements. Here's a walkthrough of this release:

Apache CouchDB support

We've added support for the NoSQL database Apache CouchDB. DataCleaner supports both reading from, analyzing and writing to your CouchDB instances.

CouchDB support
Connect to CouchDB databases

Update table writer

Following our previous efforts to bring ETLightweight-style features into DataCleaner, we've added a writer which updates records in a table. You can use this for example to insert or update records based on specific conditions.

Like the Insert into table writer, the new DataCleaner Update table writer is not restricted to SQL-based databases, but any datastore type which supports writing (currently relational databases, CSV files, Excel spreadsheets, MongoDB databases and MongoDB databases), but the semantics are the same as with a traditional UPDATE TABLE statement in SQL.

Drill-to-detail information saved in result files

When using the Save result feature of DataCleaner 2.5, some users experienced that their drill-to-detail information was lost. In DataCleaner 2.5.2 we now also persist this information, making your DQ archives much more valuable when investigating historic data incidents.

Improved EasyDQ error handling

The EasyDQ components have been improved in terms of error handling. If a momentary network issue occurs or another similar issue causes a few records to fail, the EasyDQ components will now gracefully recover and most importantly - your batch work will prevail even in spite of errors.

Table mapping for NoSQL datastores

Since CouchDB and MongoDB are not table based, but have a more dynamic structure we provide two approaches to working with them: The default, which is to let DataCleaner autodetect a table structure, and the advanced which allows you to manually specify your desired table structure. Previously the advanced option was only available through XML configuration, but now the user interface contains appropriate dialogs for doing this directly in the application.

We hope you enjoy the new 2.5.2 version of DataCleaner. Go get it now at the downloads page.

2012-04-17 : DataCleaner adds data profiling to Pentaho

Today we announce an exciting new partnership with Pentaho, the leading open source Business Intelligence and Business Analytics stack! For the past years Human Inference, members of the DataCleaner community and Pentaho have been in close contact to design a new data quality package for the Pentaho Suite. DataCleaner plays a key part in this new solution.

DataCleaner’s integration in Pentaho is primarily focused on the open source ETL product, Pentaho Data Integration (aka Kettle). Pentaho and Human Inference will be running a joint webinar on May 10th to tell everyone about all the new features (register for the webinar here), but until then – here’s a summary!

Profile ETL steps using DataCleaner

When working with ETL you often find yourself asking what kinds of values to expect for a particular transformation. With the data quality package for Pentaho we offer a unique integration of profiling and ETL: Simply right click any step in your transformation, select ‘Profile’, and it will start up DataCleaner with the data available for profiling, which the step produces! Not only is this a great feature for Pentaho Data Integration, it is also a one-of-a-kind in the ETL space. We are very excited to see this great use of embedding DataCleaner into other applications.

Profile with DataCleaner in Pentaho Data Integration / Kettle
Right click any step to profile

Execute DataCleaner job

Another great feature in the Pentaho data quality package is that you now orchestrate and execute DataCleaner jobs using Pentaho Data Integration. This makes it significantly easier to manage scheduled executions, data quality monitoring and orchestration of multiple DataCleaner jobs. Mix and match DataCleaner’s DQ jobs with Kettle’s transformations and you’ve got the best of both worlds.

Execute DataCleaner jobs as part of your ETL flow
Execute DataCleaner jobs as part of your ETL flow

EasyDQ integration

Additionally, the data quality package for Pentaho contains the EasyDQ cleansing functions as ETL steps, similar to what you know from their DataCleaner counterparts.

Deduplication and merging via DataCleaner

In addition to embedding DataCleaner for profiling of steps, you can also start up DataCleaner when browsing databases in Pentaho Data Integration. This will create a database connection which is appropriate for more in-depth interactions with the Database. For example, you can use it to find duplicates in your source or destination databases.

Detect duplicates in your sources
Detect duplicates in your sources

For more information:

The press release from Pentaho:
Pentaho announces new Data Quality solution

Installation instructions and information from Pentaho:
Pentaho wiki: Human Inference

Example of using the DataCleaner profiler with Pentaho:
Pentaho wiki: Kettle Data Profiling with DataCleaner

Information about the EasyDQ functions for Pentaho:
EasyDQ Pentaho page

2012-04-10 : Minor improvements and bugfixes in version 2.5.1 of DataCleaner

Today we've released DataCleaner 2.5.1. This is a maintenance release with only minor bugfixes and improvements. But nevertheless we encourage users to upgrade!

Here are the news in DataCleaner 2.5.1:
  • A bug was fixed in the Table lookup transformation, which caused it to be unable to have multiple output columns.
  • CSV file escape characters have been made configurable.
  • A minor bug pertaining to empty strings in the Concatenator was fixed.
  • Support for the Cubrid database was added.
  • The converter transformations was adapted to be able to work on multiple fields, not just single fields.
For more information, please refer to the 2.5.1 milestone in the trac system.

We hope you enjoy the new version of DataCleaner!

2012-03-28 : DataCleaner 2.5 is out!

Today we announce the general availability of DataCleaner 2.5! This release is the result of months of hard work by the core DataCleaner crew, the EasyDQ group and the community at large.

Let’s get straight to the “What’s new” question. There are plenty of major improvements in this release:

Saving results to disk
With DataCleaner 2.5 you can save, archive and share your analysis results. This is not only a time-saver for those who used to do manual exporting of analysis results, but it is also a means to improve your methodology around handling profiling results, sharing them with colleagues and for archiving historically profiles of your data.

Save results to disk

Saving is implemented so that future versions and/or custom solutions can take advantage of the results and potentially use it for scheduled profiling, data quality monitoring and more.

Data structure transformers
With the rise of Big Data and NoSQL databases comes more advanced data structures. In next generation databases we see key/value pairs and list structures that are cumbersome to deal with in tools built for traditional relational data. To solve these issues DataCleaner 2.5 ships with a new set of “data structure” transformers, which allow you to easily wrap and unwrap structures, to be able to get to the parts that you want to analyze or process.

Data structures

The data structure transformers also include parsers and writers for JSON data, which is one of the more common representations of NoSQL datastructures.

Filters and transformers are now all "Transformations"
Since DataCleaner 2.0 we’ve been pushing the idea of transformers and filters. The strength of these two types of components were evident from a technical perspective, but for the end-user the distinction has shown to be distracting from its main use-case: To process data in a flow of actions. Therefore DataCleaner 2.5 has consolidated these two terms, and made them available in a common metaphor for the user: Transformations. This means that the user will no longer have to look in multiple menus to find the component he is looking for.

New EasyDQ transformations: Merge duplicates and Due diligence check
The EasyDQ on-demand data quality platform team has also been busy. We present to you three new functions and an optional extension for the advanced users.

First is the Merge duplicates transformation. With this transformation you can turn your results from Duplicate detection into merged, golden records! The merge component is designed to handle a hierarchy of criteria when merging to make sure that critieria such as well-formedness, update date and manual overriding is taken into account.

Secondly we’ve introduced two services for Due diligence checks. These are transformations which will help you validate that the people you are engaging business with are not connected to sanction lists of terrorists, narcotics trafficking and other security threats.

These new features, as well as the other EasyDQ functions, are described in detail in the EasyDQ reference documentation.

Lastly, there's a new extension available, the EasyDQ essentials, which we recommend as a handy extra toolkit for those that want to go deep diving into the features of EasyDQ.

Defining datastore properties on the command line
One of the areas that have been heavily enforced in the later releases of DataCleaner is the command line interface. Using this interface you can set up DataCleaner to execute in all environments, in a scheduled or managed fashion. In DataCleaner 2.5 we’ve also made it possible to override datastore properties from the command line. Why? Because it allows you to reuse the same job on different datastore definitions. If you are for example scanning a directory for CSV files, and want to run a DataCleaner job on each file, this is a solution for you. Refer to the documentation for further explanation and examples.

Drill to detail information in value distribution results
The Value distribution analyzer now contains a drill to detail option, to make it possible to see the source records for each value in the distribution. This greatly helps usability when doing explorative data profiling.

Database-specific connection panels
The dialogs for setting up database connections have been enhanced with database-specific connection properties. This makes it a lot easier for the end-user to connect to a database without having to know the details of constructing a connection URL.

Database connection dialog

Database-specific configuration panels have been created for MySQL, PostgreSQL, Microsoft SQL Server and Oracle. Other database types are supported using the traditional way of connecting, as in previous versions of DataCleaner.

Execution and scheduling of DataCleaner jobs using Pentaho Data Integration
Pentaho Data Integration (PDI, aka. Kettle) is an open source ETL product that the EasyDQ and DataCleaner team has had a lot of interactions with. For the DataCleaner 2.5 release we are now announcing that in next version of Pentaho Data Integration you will be able to execute and schedule DataCleaner jobs using Pentaho’s infrastructure.

Execution in Pentaho Data Integration

While this is not available, released software as of today, we are looking forward to telling you more about this in the near future!

For those still reading, we also did some minor improvements in DataCleaner 2.5:
  • We’ve added some number transformations for generating IDs, incrementing numbers and more.
  • Implemented a Date range filter, similar to the Number range and String range filters.
  • Support for matching against Synonym catalogs in Reference data matcher (which is previously known as the Matching analyzer).
  • Now all components have flow visualizations in their configuration panel. This feature helps retain the overview when working with large analysis jobs.
  • The sample data (the ‘orderdb’ database) has been reworked to contain better examples of data quality issues.
  • User experience improvements; more elegant dialog designs and trimming of window layout.
We hope you all enjoy the new release of DataCleaner 2.5. Please let us know what you think on the forums, or on our LinkedIn group, or on Google Plus, or on Blogger, or tweet it, or...

2012-02-06 : EasyDQ releases patch for DataCleaner 2.4.2

The EasyDQ on-demand data quality platform, which DataCleaner is integrated with, has released a patch for DataCleaner version 2.4.2. The patch includes a critical bugfix for the Inter-Dataset matching analyzer.

If you're using this functionality, please download the patch and place it in the lib/ folder of DataCleaner. This will automatically apply the fix and matching multiple datasets will be working again.

The patch has also been applied to the Java WebStart version of DataCleaner, so WebStart users will not need to do anything.

2012-01-24 : DataCleaner 2.4.2 released

We've just released DataCleaner version 2.4.2, which is a bugfix and minor enhancements release. Please update to this latest version, which has a whole bunch of items fixed:
  • Database connection can now specify if multiple connections can be made or not. This solves an issue related to databases that did not allow this, and a potential application halt if no more connections was available.
  • There's now a separate distribution of DataCleaner specific for Mac OS. Using this version of DataCleaner you'll see a much nicer OS integration than previously.
  • Performance of the engine has been improved by providing some job-level metrics as lazy loaded values. For instance, the estimated row count is now lazy loaded, so in situations where this metric is not needed (eg. the command line interface and embedded use of DataCleaner), it will not be calculated.
  • The command line interface now has additional options to save the results of an analysis to a file, given a variety of output formats. Saved files can later be opened in the User Interface, allowing for a DIY data quality monitoring solution (see Kasper Sørensen's blog for more details).
  • An issue with correct prefixing of table names in INSERT statements was fixed in the downstream dependencies for the "Insert into table" component.
For full details about all changes, check out the trac roadmap for DataCleaner 2.4.2, AnalyzerBeans 0.10 and MetaModel 2.2.1.

2012-01-02 : DataCleaner 2.4.1 released

As our new years present to all of you, we have a new release of DataCleaner. DataCleaner 2.4.1 is largely a release of bugfixes and minor feature enhancements.

Here's an overview of the improvements we've made:

Feature enhancements:
  • Batch loading features we're greatly improved when writing data to database tables. Expect to see many orders of magnitude improvements here.
  • Writing to data has been more conveniently made available by adding the options to the window menu.
  • You can now easily rename components of a job by double clicking their tabs.
  • The Javascript transformer now has syntax coloring, so that your Javascripts are easier to inspect and modify.
  • When reading from and writing to the same datastore (eg. the DataCleaner staging area) we've made sure that the table cache of that datastore is refreshed. Previously some scenarios allowed you to see an out-of-date view of the tables.
  • A potential deadlock when starting up the application was solved. This deadlock was a consequence of an issue in the JVM, but we worked around it by synchronizing all calls to the particular API in Java.
The full list is also available on the DataCleaner 2.4.1 milestone in the roadmap.

The 2.4.1 release should work as a drop-in replacement of DataCleaner 2.4, so we encourage everyone to upgrade. Get it on the downloads page. Happy new year.

2011-12-14 : Easy as DataCleaner 2.4!

Merry christmas! Today we announce the release of DataCleaner 2.4, which marks a huge joint effort by the community and the team at Human Inference to bring together the best ideas of both open source and cloud-based Data Quality.

Here's what's new in DataCleaner 2.4:

EasyDataQuality integration

With DataCleaner 2.4 we've made an alliance with the newly launched EasyDQ.com service, which offers cloud-based Data Quality services. The services provided are:
  • Duplicate detection (aka. Deduplication or Fuzzy matching of records), which is free to use for up to 500,000 values.
  • Address data validation and cleansing. This allows you to check if addresses exist, if they are correctly formatted and even to suggest corrections in case you have mistakes.
  • Name data validation and cleansing. With the Name service, EasyDQ does not only format your names consistently, but also checks for misspellings and interprets the name parts.
  • Email and phone validation and cleansing. These services provide checking of email and phone data, making sure that email domains exist, that country codes are correct and much more.
No, these are not open source services, but they are offered at a reasonable price as well as a free starter package, and we thoroughly believe that the integration allows DataCleaner to become a much better tool for those who want it.

New analysis job components

Many of DataCleaner's users have reported that they use DataCleaner as a lightweight ETL tool. This is because we currently support basic reading, transformation and writing capabilities. With 2.4 we've added a few crucial components to add to this use-case where you want to do ad-hoc transformations, data quality checks and actually write the data back to your database.
  • Table lookup which allows you to look up any number of values based on any number of conditions. The lookup component has an intelligent caching mechanism and is highly performant. (Docs).
  • Insert into table is a new option when writing data. With this option we are making it possible for DataCleaner to not only produce new files, but also to insert records into existing databases. That makes it a much more flexible writing option.

MongoDB support! And a few more...

Another theme in DataCleaner 2.4 is support for the popular NoSQL database MongoDB. The support is offered both as a profiling service (eg. reading and analyzing data), but ALSO for writing data to MongoDB collections, using the Insert into Table component, which makes DataCleaner the first open source tool that offers data flow modelling and ETL functionality for MongoDB! We also improved on a few other datastores:
  • Support for MongoDB datastores, which are both readable and writable with DataCleaner. MongoDB uses a schemaless design principle, so you have the choice of either letting DataCleaner auto-detect a virtual schema, or define it yourself. (Docs).
  • Added more configuration options to Fixed width value files. Specifically, there is now the option to specify header line number.
  • Added support for custom table mapping of XML structures. For large XML files this is a recommended approach, since with a fixed table model, DataCleaner can do SAX-based XML parsing which is much less memory intensive and a lot faster. (Docs).
  • The Command Line Interface (Docs) has been further improved, by allowing you to inject job variables from the command line, which makes it possible to parameterize jobs and thereby reuse jobs for different purposes.

Besides these points, a few bugfixes where fixed and some minor features added. For a full list of changes, check out the DataCleaner 2.4 milestone description in trac.

We hope you enjoy DataCleaner 2.4. We built it to be used, so go grab it right away on the downloads page!

2011-09-29 : DataCleaner 2.3 has been released!

Today we announce the release of DataCleaner 2.3. It contains new functionality, usability improvements and technical changes that make it even more useful for your data quality work. Curious? Just read on!

International data support
  • If you are working with international data, then you might have different character sets in your data, for example Chinese or Hebrew. We added the Character set distribution analyzer, which is a profiling option that lets you figure out which character sets are used in your data.
  • Working with data containing different character sets can be problematic. Using the new Transliterate transformer you can now transliterate strings from different writing systems to Latin characters.
  • There is also a new webcast demonstration, focusing on the international data capabilities of DataCleaner 2.3 in the documentation section.
Grouping of analysis results by a secondary column
  • The Pattern analyzer is now able to group patterns based on a secondary column. This is useful for analyses like:
    • Get patterns of phone numbers, grouped by country.
    • Get patterns of email username based on email domain.
  • Something similar has been done for the Value Distribution analyzer; this allows for analyses such as:
    • Are all city names distinct, when grouped by postal code?
    • What is the distribution of gender within particular customer types?
Improved charts
  • The Pattern finder results can now be shown in a chart. This makes the distribution visible and shows how much of a "long tail" of patterns there is.
  • The output of the value distribution analyzer has been improved in a couple of areas:
    • The readability of the chart has been improved.
    • It shows the total number of rows and the distinct count over these rows: the number of different values that exist in the rows. This helps in figuring out how often duplicate values exist.
    • If there are empty strings, we use the <BLANK> keyword for it, so that it is easier to recognize them.
  • Next to the already existing output formats (CSV files and H2 datastores) we added writing output to Excel spreadsheets.
  • After writing to a datastore, it is now possible previewing the output, so that you can check whether the output is according to your expectations.
  • It is now also possible to add the output as a new datastore, so that it can be used as input for a new job.
Other improvements
  • Documentation has been generally improved. In particular, logging and command line interface descriptions have been added.
  • The extension mechanism has been improved by modularizing several pieces of the application and introducing Google Guice as a generally available dependency injection framework for extension developers.
  • And of course we did more than twenty small improvements and bug fixes.
We hope you enjoy the new version of DataCleaner, which you can get a copy of on the downloads page.

2011-08-11 : Check out the Regex Parser extension!

As stated earlier, Human Inference is dedicated to deliver a rich set of extensions to the DataCleaner community, as well as we are seeing third party interest in contributing to the ExtensionSwap.

Today we've published a new extension which many DataCleaner users will hopefully find useful: The Regex parser.

With this extension you can easily implement your own parsing logic around regular expressions. The idea is that you use a regular expression to identify groups in your strings. These substring groups are extracted from the original value and isolated so you can process them individually. A quite nice application of DataCleaner's transformer mechanism!

For more information on how to create your own extensions, please refer to the DataCleaner develop page.

2011-06-27 : DataCleaner 2.2: Profiling everywhere

DataCleaner 2.2 has been released as of today! This is an exciting new version of our Data Quality Analysis (DQA) and Data Profiling application that is now a lot more extensible, embeddable and compliant with new datastores.

Here's a summary of the news in this release:

  • The main driver for this release has been a story about extensibility. While releasing the application we are simultaniously releasing a a new DataCleaner website which features an important new area: The ExtensionSwap. The idea of the ExtensionSwap is to allow sharing of extensions to DataCleaner and installation simply by clicking a button in the browser!
  • The DataCleaner extension API has been improved a lot in this release, making it possible to create your own transformers, analyzers and filters. If you feel your extensions could be of interest to other users, please share it on the ExtensionSwap and we provide a channel for you to easily distribute it to thousands of users. The Extension API and the ExtensionSwap is further explained in our new videos for developers and other techies with an interest.
  • We are also releasing a set of initial extensions on the ExtensionSwap: The HIquality Contacts for DataCleaner extension which provides advanced Name, Phone and Email cleansing, based on Human Inferences natural language processing DQ web services. We are also shipping a sample extension which will serve as an example for developers wanting to try out extension development themselves. In the coming months we will make sure to post even more extensions originating from our internal portfolio of tools that we use at Human Inference's knowledge gathering teams.
  • In addition to extensibility we are also focusing on embeddability. We want to be able to embed DataCleaner easily into other applications to make profiling and data analysis possible anywhere! We've created a new bootstrapping API which allows applications to bundle DataCleaner and bootstrap it with a dynamic configuration or run it in a "single datastore mode", where the application is tuned towards just inspecting a single datastore (typically defined by the application that embeds DataCleaner). We already have some really interesting cases of embedding DataCleaner in the works - both in other open source applications as well as commercial applications.

  • We've added support for analyzing SAS data sets. This is something we're quite proud of as we are, to our knowledge, the first major open source application to provide such functionality, ultimately liberating a lot of SAS users. The SAS interoperability part was created as a separate project, SassyReader, so we expect to see adoption in DataCleaner's complimentary open source communities soon too!
  • We've also added support for another type of datastore: Fixed width files. Fixed width files are text files where each column has a fixed width. There is no separator or quote character, like CSV files, instead each line are equal in length and each line will be tokenized according to a set of value lengths.
  • An option to "fail on inconsistencies" was added to CSV file and fixed width file datastores. These flags add a format integrity check when using these text file based datastores.
  • A bug was fixed, which caused CSV separator settings not to be retained in the user interface, when editing a CSV datastore.
  • Japanese and other characters are not supported in the user interface. This "bug" was a matter of investigating available fonts on the system and selecting a font that can render the particular characters. On most modern systems there will be capable fonts available, but on some Unix and Linux branches there might still be limitations.

Other improvements
  • The documentation section has been updated! Ever since the initial 2.0 release the documentation have been far behind, but we've finally managed to get it up to date. There are still pieces missing in the docs, but it should definately be useful for basic usage as well as a reference for most topics.
  • Application startup time was improved by parallelizing the configuration loading and by delaying the initialization of those parts of the configuration that are not needed for the initial window display.
  • The phonetic similarity finder analyzer have been removed from the main distribution, as this was quite experimental and serves mostly as a proof of concept and an appetizer to the community to create more advanced matching analyzers. You can now find and install the phonetic similarity finder on the ExtensionSwap.
  • Cancelled or errornous job handling was improved and the user interface responds more correctly by disabling buttons and progress indicators, if a job has stopped.
  • Fixed a few minor UI issues pertaining to table sizing and use of scrollbars.

2011-05-16 : DataCleaner 2.1.1 is here!

Another release of DataCleaner sees the light of day today! Although this is not a major release, but a minor one, it does ship some quite nice stabilizing improvements and minor enhancements to the UI.

Enhancements in 2.1.1:
  • Added a search/filtering text field on the datastores list. This enables you to quickly find your datastore if you have registered more datastores than available on the screen.
  • Reference data for country codes was added to the standard distribution, thanks goes to Graham Rhind for providing these.
  • Added a horizontal scroll bar to the data previewing windows of there are more than 10 columns.
  • Ability to add an extension package with new functionality in the Options dialog at runtime. More focus on extensions will follow in the upcoming releases.
  • We've exposed an early preview of our Command-Line Interface (CLI) by allowing you to invoke the application with the "-usage" parameter which will show the CLI options.
  • Added number formatting options to the "Convert to Number" transformer.
Bugfixes in 2.1.1:
  • Fixed an out-of-memory issue when querying tables with a LOT of columns (150+).
  • Fixed an issue that cause the "Limit analysis" check box to not be checked correctly when a job was re-opened after saving.
  • Not really a bugfix as it was never an official feature, but now we support restoring user preferences (the userpreferences.dat file) from previous versions of DataCleaner.
Thanks to everyone involved in the making of this release of DataCleaner.

DataCleaner 2.1.1 is available as a traditional download or as a Java Web Start application on the downloads page. Keep in touch with your feedback to the application on the forums.

2011-04-04 : DataCleaner 2.1 adds charts, stoppable jobs, database drivers and unifies the UI

We're happy to announce the release of DataCleaner 2.1! This is a quite significant release and something that we hope users will recognize as a step forward from the 2.0 versions.

The major news in DataCleaner 2.1 are:
  • There was a lot of work done on the user interface (see media page):
    • We decided to remove the left-hand side window containing environment configuration options.
    • Instead all these options have now been moved to the job building window so the user only has to focus on a single window for all the interactions needed to build a job.
    • The welcome/login dialog has also been removed in favor of a more discrete panel that can be pulled in or hidden from the main window.
    • Datastore selection and management is considered the first activity in the application, which is why it is also the first step to handle in the main window.
  • You can now stop jobs in case you decide to change something before it is done.
  • Bar and line charts were added to a lot of the analysis result screens, including String analyzer, Number analyzer, Date/time analyzer and Weekday distribution (see media page).
  • All "preview data" windows now contain paging controls so you can move backwards and forwards in the data set.
  • Most common database drivers (MySQL, PostgreSQL, Oracle, MS SQL Server and Sybase) have been added to a default set of drivers.
  • Configuration of the Quick analysis function in the Options dialog.
  • Various minor bugfixes.
  • Transformer for extracting date parts (year, month, day etc.) from date columns.
We hope you enjoy DataCleaner 2.1. Please head over to the downloads page to get it!

2011-03-07 : DataCleaner 2.0.2 released

Eobjects.org and its contributors are pleased to announce that DataCleaner 2.0.2 has just been released.

DataCleaner 2.0.2 is a minor, but not unimportant, release containing a few bugfixes and a set of 8 feature enhancements:
  • Tabs and buttons in the workbench are disabled when no source columns have been selected.
  • A special widget have been added to the "Source" tab, making it very easy to apply row count based sampling of the input data.
  • When possible, filters now have the ability to optimize the query of a job (aka. Push-down optimization). This was implemented for the "Max rows", "Equals" and "Not null" filters.
  • The growing amount of transformers caused a long list in the "Add transformer" popup. Therefore transformers are now grouped by category and displayed accordingly.
  • The visualization of execution flow now allows removing column items and filter outcome items, making the graph more comprehensible, especially for very large jobs.
  • The "Coalesce string" transformer now has a "Consider empty strings as null" flag, which is particularly useful when dealing with CSV files.
  • Text-based dictionaries and synonym catalogs will get their cached values flushed, if the file they read from changes.
  • The "Convert to date" transformer now includes the ability to specify your own date masks, if date strings require it.
  • A bug was fixed when passing null values to the the email standardizer.
  • A bug was fixed pertaining to proper presentation of "mixed" tokens in the the Pattern finder.

With these improvements in place we see that DataCleaner 2.x is really catching along and we're very pleased with the quality and pace of improvements we are seeing. Go to the Downloads page right away to grab the new version.

2011-02-21 : DataCleaner 2.0.1 released

Since the release of DataCleaner 2.0, we've seen a renewed interest and a lot of activity around eobjects.org, DataCleaner and Human Inference. We're happy to get all this valuable feedback and it has also meant that there where some low hanging fruit to as well as a few very minor bugs that we could easily add into the existing DataCleaner 2.0 release. This is why, already a week after 2.0 was released, we're releasing an update: 2.0.1.

The update consist of minor updates:
  • Filter outcomes where added to the flow visualization.
  • A bug was fixed in the widget for selecting the tokenizer's separators.
  • The "Equals" filter can now have multiple values to compare with.
  • Some minor cosmetical improvements.
For more detail, take a look at the milestone contents at Trac.

DataCleaner 2.0.1 is available at the downloads page and the update has also been automatically applied to our Java Web Start users.

2011-02-13 : Watch out, dirty data! DataCleaner 2.0 is in town!

The Open Source software community eobjects.org is happy to announce the release of DataCleaner 2.0. This release marks the biggest advance in technology and features for the DataCleaner platform throughout the history of the project.

Amongst exciting new features in DataCleaner 2.0 are:
  • Data transformations, allowing you to preprocess, extract, refine, combine and calculate data items as a part of your data profiling jobs.
  • Filtering, sampling and subflow management, allowing you to define criteria to exclude and include particular items of data.
  • Richer reporting with charts, graphs, navigation trees and more.
  • A bunch of new data quality functions for date gap analysis, phonetic similarity finding, synonym lookups and more.
  • More configuration options and added data quality measures for existing data quality functions like the Pattern finder, String analyzer and more.
  • Reusable profiling jobs, where you define your processing flow once and consequently run it on any data.
  • Support for MS Excel 2007+ spreadsheets.
For more information about what’s new in DataCleaner 2.0, see the full list of new features in DataCleaner 2.0.

Today it was also announced that Human Inference, the European data quality authority has finished their acquisition of the eobjects.org site, to actively enter the market for entry-level Open Source data quality products. All projects on eobjects.org will remain open source and the benefit for the community and the products are apparent. The release of DataCleaner 2.0 is the first visible outcome of the acquisition, resulting from several months of intense cooperation between Human Inference and the community members, to put together a state-of-the-art data profiling application.

For more information about the eobjects.org acquisition, see the press release on the Human Inference website.

Times are really exciting in the eobjects.org community these days. We hope you’re all as enthusiastic about the new DataCleaner 2.0 as we are. The application is ready for download and for immediate launch through Java Web Start, so visit the DataCleaner website now.

2010-05-15 : DataCleaner 1.5.4 released with dBase and MS Access support

Here it is: DataCleaner 1.5.4 :)

Although this release is a minor release it contains a few exciting features and fixes:
  • We've updated the MetaModel version to 1.2 which adds support for two new datastores:
    • dBase databases (.dbf files)
    • MS Access databases (.mdb files)
  • We've fixed a bug pertaining to text-file dictionary "file not found" errors.
  • A lot of the other underlying libraries have been updated, providing improvements to performance and stability.
Head on over to the downloads page to grab the new DataCleaner.

2009-10-18 : DataCleaner 1.5.3 released

After much waiting, we are finally ready to release DataCleaner 1.5.3. Here's the wrap-up on what's been going on:
  • The MetaModel dependency has been upgraded to version 1.1.8, which means:
    • Improved Excel spreadsheet support
    • Improved SQL Server support
    • Improved performance for CSV files
  • Fixed a bug that caused certain database connection errors to be ignored in terms of user feedback.
  • Fixed a bug that caused re-opening of database dictionaries to throw a NullPointerException.
  • Fixed a bug related to dictionary lookups of null values.
  • Added support for Teradata databases.
  • Added connection templates for SQL Server connections.
  • Added support for selection of custom encodings when reading CSV files.
  • Fixed a minor bug relating to reading files on the classpath when running in Java WebStart mode (which manifested in an exception thrown when clicking on "About DataCleaner").
So as you can see, it's been a mix of minor bugfixes and a couple of improvements to compatibility and performance regarding certain datastores. We hope you enjoy this new release of DataCleaner. As always, you can ...Let us know what you think!

2009-09-08 : New book on Open Source Business Intelligence tells the DataCleaner-story

About half a year ago we received an exciting inquiry from Jos van Dongen on behalf of him and his co-author Roland Bouman, telling us that they where writing a new book about Open Source Business Intelligence and in particular Pentaho-based solutions. And for this they where looking into DataCleaner for the data profiling section of the book!

The book is now out! It's called "Pentaho Solutions" and it's published by Wiley Publishing. You can read about it and buy it on their website as well.

The book contains a walkthrough for building a data warehouse using Open Souce tools and in doing so applying DataCleaner for the important job of profiling and validation.

We congratulate Roland Bouman and Jos van Dongen for their great work to promote Open Source Business Intelligence and thank them for mentioning DataCleaner while they're at it!

2009-07-14 : eobjects.org announces Open Source data quality with DataCleaner 1.5.2

Dear DataCleaner users,

We are happy to announce the release of DataCleaner 1.5.2. Users of DataCleaner 1.5.0 or 1.5.1 won't be able to see a lot of changes in the user interface, but this release actually holds quite a lot of improvements “beneath the surface”:
  • The most notable improvement is in the Value Distribution Profile. Previously this profile consumed quite a lot of memory which could lead to out-of-memory errors in extreme cases. This has been fixed by using on-disk caching with the berkeley db when nescesary.
  • Another notable feature is that we can now distribute DataCleaner as a single JAR file. This means that we will be serving the application as a Java WebStart application (ie. run it as if it's an online application) and we are also considering other distribution options.
  • When starting the application, it automatically downloads regular expressions from the RegexSwap.
  • A bug in regards to matching number-based columns in dictionaries was reported and fixed.
  • A bug in regards to invalid characters in XML-export formats was reported and fixed.
  • When opening files, we are now ignoring suffix case so that .CSV files can be opened as well as .csv.
  • The number of columns shown in the preview window are automatically restricted if there are too many to show on a single screen.
You can download DataCleaner from the downloads page or you can use our new feature: Get it via Java WebStart!

This release underlines the ongoing evolution of DataCleaner to be a more and more professionally capable data profiler and data quality tool. Seeing that DataCleaner is being used in large corporations world wide I wish to address some thoughts that I have been having and that I know users are pondering with: How do you best combine the low adoption cost of Open Source applications like DataCleaner with the high flexibility that most commercial business-software provide? To service this need we've opened up a new division of the company that I work with, Lund&Bendsen. Whether you need to deploy DataCleaner to high-scale installations, integrate the applications with your existing systems or develop customized profiles, validation rules or satisfy other enterprise needs, we offer you first class services and in-depth expertise you wont find anywhere else.

To cut to the chase: DataCleaner 1.5.2 is here and we wish to extends the community development with a professional effort. So don't hesitate to let us know if you see an opportunity to invest. Adding value by targeting your use of the product is in the interest of both customer, developer and community and this is the reason our business is there.

To all you non-business users out there: Sorry for the obvious commercial rant and we hope you all enjoy the newest DataCleaner release.

Best regards,[[BR]]
Kasper Sørensen[[BR]]
Founder of eobjects.org and the DataCleaner project

2009-04-20 : DataCleaner 1.5.1 released

We're happy to announce the release of DataCleaner version 1.5.1. This release is a minor release, nevertheless containing a few nice features - especially for the users who are enjoying the exporting features that was introduced in 1.5:
  • An additional HTML export format have been added to the built-in export formats (usable when exporting Profiler results in the desktop app and when executing the runjob command-line tool).
  • The export format is now choosable directly in the desktop app.
  • Four new measures where added to the String Analysis profile: avg. chars and max/min/avg white spaces.
The new version of DataCleaner is (as always) downloadable for free on the downloads page and feedback from users is also greatly appreciated. Post your comments and questions at our discussion forum.
We hope that you all enjoy DataCleaner 1.5.1.

2009-03-15 : DataCleaner 1.5 released!

"Finally!" one might say. And this is definately what is going through my head right as I write this news-item. Finally, DataCleaner 1.5 has been released! Once again the effort to bring about the best open source data quality solution is bearing fruit.

The new release is definately one of the most significant ones in the history of DataCleaner. The overall goal of the release has been to step up from the shadows of the "small tools" pool and mark DataCleaner as an enterprise-ready application for profiling and validating datastores of all kinds - both in scheduled mode, on servers and in an intuitive desktop environment.

For those of you with an interest in every little detail about this release, please feel free to review the complete list of changes - for everyone else, here's the recap:
  • Change of license to LGPL.
  • Multi-threaded execution of Profiler and Validator.
  • Command line (batch) execution of DataCleaner tasks.
  • More elaborate status information during profiler and validator execution.
  • New profile: Date mask matcher.
  • New profile: Regex matcher.
  • Load regex from the online RegexSwap repository.
  • Automatic download and install of popular database drivers.
  • More file types supported (.dat, .txt)
  • XML file support improved (.xml)
  • Memory improvements in Time analysis profile.
  • Improved logging when running profiling and validation.
  • Information schema provided for file-based datastores.
  • Lazy-loading of columns in datastore-tree.
We hope you enjoy the new DataCleaner 1.5! Now go over and download it right away.

2009-02-12 : Data quality pro launches DataCleaner articles

Things are starting to shape up for the big release of DataCleaner 1.5. We are starting off with a bit of excitement around in the data quality community.

data quality pro
Probably the most dedicated online magazine about data quality, data quality pro, have launched a series of articles about profiling, validating and comparing data with DataCleaner. So far an introductory tutorial (including a complete and realistic example data-set) and a background article/interview have been published.

We hope that you will enjoy the articles and we thank data quality pro for their great interest in our community.

2009-02-10 : First commercial support company for DataCleaner and MetaModel

Today we are announcing the first company, Lund&Bendsen, to officially support DataCleaner and MetaModel on a commercial level. These eobjects.org projects are, as you know, independent projects that are run with the community in mind. But as time goes on they grow and for companies to pick them up and start using them in a commercial setting we also welcome third party commercial support to help spread the projects to environments where community-based support is insufficient.

Lund&Bendsen is a Danish company with a strong expertise in Java development and training. Their service offerings include training, customization, integration and enhancement of DataCleaner and MetaModel so if your company is considering applying DataCleaner they might be interested in hiring some professionals to aid them in the process.

Over time more companies are expected to join in on commercial support for the eobjects.org projects. Keep up to date on the DataCleaner support page and don't hesitate to contact us for any inquiries in this regard either.

2009-01-26 : Independent analysis firm points at DataCleaner for open source data quality

The Technology Evaluation Centers (TEC) have published an interesting, unbiased and independent analysis of the market for Open Source business intelligence products. We are delighted to see that the article features a section about data quality and that TEC points at DataCleaner as a competent choise within the open source products:
In such situations, where the vendor does not support a specific functionality,
organizations can look to complementary open source solutions; the DataCleaner
project from eobjects.org, for instance, provides functionality to help profile
data and monitor data quality. It also points to a significant advantage with
open source applications: the fact that software is developed by the community
and for the community makes it much simpler to share innovative solutions
quickly and seamlessly.
You can read the whole article by Anna Mallikarjunan from TEC by going to their website (user registration is required).

2009-01-22 : Another release candidate (2) of DataCleaner 1.5 ready for download

Another batch of updates, fixes and improvements for the upcoming DataCleaner release is ready. This time it's Release Candidate 2 offering a preview of what's to come in DataCleaner 1.5.
The main changes since Release Candidate 1 are multithreaded execution, the command line interface (runjob.sh / runjob.cmd), some UI updates and a few bugfixes. Go download the release candidate and use it as an opportunity to influence the development process by posting your comments on the DataCleaner forum.

2009-01-12 : Release Candidate 1 of DataCleaner 1.5 out

After working hard for a couple of days to implement substantial new features regarding integration of eobjects services and automatic download and install of popular database drivers, a new release candidate of DataCleaner is ready!
We hope that a lot of people will use the release candidate and provide feedback for further development towards the 1.5 final release.

2009-01-09 : A few screenshots of recent development

I've spent the last couple of days implementing a couple of cool enhancements to the DataCleaner desktop-application:
  • Automatic download and install of popular database drivers. Followed along with template connection strings in the "Open database" dialog. This will hopefully make it much easier for less experienced users to set up a connection to their database of choice.
  • Direct integration with the new RegexSwap system so that the regexes that you post online will be accessible from within the desktop-application.
Screenshots have been posted to the screenshots page.

Wait for DataCleaner 1.5 for these features or [BuildingDataCleaner build it yourself] to check them out now.

2009-01-05 : DataCleaner launches new regex sharing subsite - RegexSwap

Only a few days after the launch of the new DataCleaner website, we are once again ready with new exciting features. This time we are launching the first edition of our new regular expression (regex) sharing subsite called "RegexSwap".

RegexSwap is a specialized forum for sharing, categorizing, commenting and voting on regular expressions that can be used in DataCleaner and other regex-based applications. It is really easy to post your own regular expressions, test them online on the website, comment and vote on the regexes that you have found useful. In time the next releases of DataCleaner will also take advantage of this online "always up to date" regex resource and offer direct integration with RegexSwap.

RegexSwap is still in beta but is ready at a functional level which is why we are launching publically it now. It will recieve dedicated attention in the weeks and months to come.

2009-01-02 : A new website for DataCleaner

Dear everybody,

As a special christmas present we have been working hard to design a new website for DataCleaner! Hopefully you will all enjoy the new site, which have been designed to further support our community and let it grow by incorporating more features to socialize and share ideas online. So go visit it now at the new URL:
Among the new features are a more personal profile system which is linked to some of the communities that our users already use frequently, namely LinkedIn and SourceForge. We have a whole new media with cool screenshots and webcasts. We are also redesigning our mailing list structure. Instead of the single mailing list that we have been using so far, we are launching new "announcement" and "dev" mailing lists.

Our goal is to continuously launch new features on the website. The first one being a user survey to gain a better insight into the minds of our users and community. So be sure to fill it out. In the future we will add more exiting features such as online sharing of regular expressions and reference data for DataCleaner dictionaries.

The old website will continue to exist, but primarily as a wiki and bugtracking system. During the next couple of days we will be editing the wiki pages to make them more suitable for wiki-style editing (by everyone) as opposed to the former readonly strategy.

We hope you like our christmas present and that you will let us know. and we wish you all a great 2009. Without a doubt, it will bring exiting times for DataCleaner and the DataCleaner community.

2008-10-13 : DataCleaner 1.5 "snapshot" released

As we're moving steadily along towards the release of DataCleaner 1.5 we are fixing a few bugs and enhancing a lot of features. This leads to the desire to release our work since practically nothing has undergone changes that could destabilize the application since the 1.4 release. So today we're releasing DataCleaner 1.5 "snapshot". This also marks the first release under our new LGPL license.

Here are the changes from 1.4 so far:
  • Change of license to LGPL.
  • New profile: Date mask matcher.
  • New profile: Regex matcher.
  • More file types supported (.dat, .txt)
  • XML file support improved (.xml)
Although this is in principle a development/beta release, we feel that it would be worth working with for most of your profiling needs. So... Go on, [GetDataCleaner download it], tell us what you think and we'll see you around!

2008-10-06 : Eobjects announces change in preferred license

We've made a principal decision at eobjects.org to change the preferred license of our projects from the Apache License 2.0 to the Lesser General Public License (LGPL).

The main difference between the two licenses are that the LGPL requires any modifications to be contributed back to the Open Source community (ie. licensed under a similar license; LGPL or GPL). The eobjects.org projects are gaining the obvious advantages of the LGPL by ensuring that improvements are submitted back to the projects. This also means that we don't risk that anyone sell modified versions of our projects. It is still just as appropriate to use the projects as a part of commercial applications, but any modifications must be contributed back to the community.
Initially this change in license will affect the two flagship projects of eobjects.org: DataCleaner and MetaModel. This means that the next versions of these projects (DataCleaner 1.5 and MetaModel 1.1 accordingly) will be LGPL licensed. Also, new projects will be LGPL licensed unless special circumstances suggest otherwise.

2008-09-26 : Go watch the new appetizer webcast of DataCleaner 1.4

We've just uploaded a [wiki:DataCleanerProfilerIntroWebcast/1.4 webcast of the new DataCleaner 1.4] which provides a long awaited update for the old 0.4 webcasts!

Go enjoy the webcast - and be sure to [GetDataCleaner download the newest version of DataCleaner]. Over and out!

2008-09-16 : Two new releases planned for DataCleaner

After some considerations about the future of DataCleaner, we've updated the roadmap to reflect our current plans for the direction of development. We are planning on releasing DataCleaner 1.4 by the end of the month and after that two new milestones have been added:
  • DataCleaner 1.5: The main focus of this release is to provide a command line interface for our data quality framework. This means that users will be able to easily create batch jobs that they can schedule using their favorite scheduler. Other features will also include Pattern Finder improvements and a couple of new profiles.
  • DataCleaner 1.6: We have a lot of suggestions that have been filling up our backlog. DataCleaner 1.6 will be all about getting everybody's needs into the application before we get ready to begin the webapp. Some of the exciting features of DataCleaner 1.6 will be relationship profiling and exporting of results.

2008-08-26 : Development/snapshot release of DataCleaner 1.4

We've released a development/snapshot release of DataCleaner 1.4 in order to get early reactions for all the improvements and new features as well as supporting our users with up to date functionality. In my own opinion the development release is just as stable and "safe to use" as 1.3, but of course it lacks a bit of the manual testing that we put into the real releases.

You can download the development release at our sourceforge download site.

Here's a short list of fixes since DataCleaner 1.3:

* Better memory handling and garbage collection
* Reference columns in drill-to-details windows
* Better error handling when loading schemas
* Quoting of string values in visualized tables (in order to distinguish empty strings and white spaces)
* New profile: Value Distribution, which is an improved version of the Repeated Values profile. The Value Distribution profile has an option to configure the top/bottom n values to include in the result.
* Better control of profile result column width.
* Bugfix: Copy to clipboard functions now work properly.
* Bugfix: Scrollbars added to visualized tables.

Take a look at the roadmap for more current developments of DataCleaner.