Today we announce the release of DataCleaner
2.3. It contains new functionality, usability improvements and technical changes that make it even more useful for your data quality work. Curious? Just read on!
International data support
- If you are working with international data, then you might have different character sets in your data, for example Chinese or Hebrew. We added the Character set distribution analyzer, which is a profiling option that lets you figure out which character sets are used in your data.
- Working with data containing different character sets can be problematic. Using the new Transliterate transformer you can now transliterate strings from different writing systems to Latin characters.
- There is also a new webcast demonstration, focusing on the international data capabilities of DataCleaner 2.3 in the documentation section.
Grouping of analysis results by a secondary column
- The Pattern analyzer is now able to group patterns based on a secondary column. This is useful for analyses like:
- Get patterns of phone numbers, grouped by country.
- Get patterns of email username based on email domain.
- Something similar has been done for the Value Distribution analyzer; this allows for analyses such as:
- Are all city names distinct, when grouped by postal code?
- What is the distribution of gender within particular customer types?
- The Pattern finder results can now be shown in a chart. This makes the distribution visible and shows how much of a "long tail" of patterns there is.
- The output of the value distribution analyzer has been improved in a couple of areas:
- The readability of the chart has been improved.
- It shows the total number of rows and the distinct count over these rows: the number of different values that exist in the rows. This helps in figuring out how often duplicate values exist.
- If there are empty strings, we use the <BLANK> keyword for it, so that it is easier to recognize them.
- Next to the already existing output formats (CSV files and H2 datastores) we added writing output to Excel spreadsheets.
- After writing to a datastore, it is now possible previewing the output, so that you can check whether the output is according to your expectations.
- It is now also possible to add the output as a new datastore, so that it can be used as input for a new job.
- Documentation has been generally improved. In particular, logging and command line interface descriptions have been added.
- The extension mechanism has been improved by modularizing several pieces of the application and introducing Google Guice as a generally available dependency injection framework for extension developers.
- And of course we did more than twenty small improvements and bug fixes.
We hope you enjoy the new version of DataCleaner, which you can get a copy of on the downloads page