DataCleaner 2.3 has been released!
International data support
- If you are working with international data, then you might have different character sets in your data, for example Chinese or Hebrew. We added the Character set distribution analyzer, which is a profiling option that lets you figure out which character sets are used in your data.
- Working with data containing different character sets can be problematic. Using the new Transliterate transformer you can now transliterate strings from different writing systems to Latin characters.
- There is also a new webcast demonstration, focusing on the international data capabilities of DataCleaner 2.3 in the documentation section.
Grouping of analysis results by a secondary column
- The Pattern analyzer is now able to group patterns based on a secondary column. This is useful for analyses like:
- Get patterns of phone numbers, grouped by country.
- Get patterns of email username based on email domain.
- Something similar has been done for the Value Distribution analyzer; this allows for analyses such as:
- Are all city names distinct, when grouped by postal code?
- What is the distribution of gender within particular customer types?
Improved charts
- The Pattern finder results can now be shown in a chart. This makes the distribution visible and shows how much of a "long tail" of patterns there is.
- The output of the value distribution analyzer has been improved in a couple of areas:
- The readability of the chart has been improved.
- It shows the total number of rows and the distinct count over these rows: the number of different values that exist in the rows. This helps in figuring out how often duplicate values exist.
- If there are empty strings, we use the <BLANK> keyword for it, so that it is easier to recognize them.
Output
- Next to the already existing output formats (CSV files and H2 datastores) we added writing output to Excel spreadsheets.
- After writing to a datastore, it is now possible previewing the output, so that you can check whether the output is according to your expectations.
- It is now also possible to add the output as a new datastore, so that it can be used as input for a new job.
Other improvements
- Documentation has been generally improved. In particular, logging and command line interface descriptions have been added.
- The extension mechanism has been improved by modularizing several pieces of the application and introducing Google Guice as a generally available dependency injection framework for extension developers.
- And of course we did more than twenty small improvements and bug fixes.
We hope you enjoy the new version of DataCleaner, which you can get a copy of on the downloads page.

Follow us twitter
Join our LinkedIn group