Click to go to the DataCleaner forum page.
Discussion forum
We encourage you to also check out the DataCleaner forum where you can also share your own news and experiences with DataCleaner.
2012-01-24 - by kasper.
no user comments
We've just released DataCleaner version 2.4.2, which is a bugfix and minor enhancements release. Please update to this latest version, which has a whole bunch of items fixed:
For full details about all changes, check out the trac roadmap for DataCleaner 2.4.2, AnalyzerBeans 0.10 and MetaModel 2.2.1.
- Database connection can now specify if multiple connections can be made or not. This solves an issue related to databases that did not allow this, and a potential application halt if no more connections was available.
- There's now a separate distribution of DataCleaner specific for Mac OS. Using this version of DataCleaner you'll see a much nicer OS integration than previously.
- Performance of the engine has been improved by providing some job-level metrics as lazy loaded values. For instance, the estimated row count is now lazy loaded, so in situations where this metric is not needed (eg. the command line interface and embedded use of DataCleaner), it will not be calculated.
- The command line interface now has additional options to save the results of an analysis to a file, given a variety of output formats. Saved files can later be opened in the User Interface, allowing for a DIY data quality monitoring solution (see Kasper Sørensen's blog for more details).
- An issue with correct prefixing of table names in INSERT statements was fixed in the downstream dependencies for the "Insert into table" component.
For full details about all changes, check out the trac roadmap for DataCleaner 2.4.2, AnalyzerBeans 0.10 and MetaModel 2.2.1.
2012-01-02 - by kasper.
no user comments
As our new years present to all of you, we have a new release of DataCleaner. DataCleaner 2.4.1 is largely a release of bugfixes and minor feature enhancements.
Here's an overview of the improvements we've made:
Feature enhancements:
Bugfixes:
The full list is also available on the DataCleaner 2.4.1 milestone in the roadmap.
The 2.4.1 release should work as a drop-in replacement of DataCleaner 2.4, so we encourage everyone to upgrade. Get it on the downloads page. Happy new year.
Here's an overview of the improvements we've made:
Feature enhancements:
- Batch loading features we're greatly improved when writing data to database tables. Expect to see many orders of magnitude improvements here.
- Writing to data has been more conveniently made available by adding the options to the window menu.
- You can now easily rename components of a job by double clicking their tabs.
- The Javascript transformer now has syntax coloring, so that your Javascripts are easier to inspect and modify.
Bugfixes:
- When reading from and writing to the same datastore (eg. the DataCleaner staging area) we've made sure that the table cache of that datastore is refreshed. Previously some scenarios allowed you to see an out-of-date view of the tables.
- A potential deadlock when starting up the application was solved. This deadlock was a consequence of an issue in the JVM, but we worked around it by synchronizing all calls to the particular API in Java.
The full list is also available on the DataCleaner 2.4.1 milestone in the roadmap.
The 2.4.1 release should work as a drop-in replacement of DataCleaner 2.4, so we encourage everyone to upgrade. Get it on the downloads page. Happy new year.
2011-12-14 - by kasper.
no user comments
Merry christmas! Today we announce the release of DataCleaner 2.4, which marks a huge joint effort by the community and the team at Human Inference to bring together the best ideas of both open source and cloud-based Data Quality.
Here's what's new in DataCleaner 2.4:
With DataCleaner 2.4 we've made an alliance with the newly launched EasyDQ.com service, which offers cloud-based Data Quality services. The services provided are:
No, these are not open source services, but they are offered at a reasonable price as well as a free starter package, and we thoroughly believe that the integration allows DataCleaner to become a much better tool for those who want it.
Many of DataCleaner's users have reported that they use DataCleaner as a lightweight ETL tool. This is because we currently support basic reading, transformation and writing capabilities. With 2.4 we've added a few crucial components to add to this use-case where you want to do ad-hoc transformations, data quality checks and actually write the data back to your database.
Another theme in DataCleaner 2.4 is support for the popular NoSQL database MongoDB. The support is offered both as a profiling service (eg. reading and analyzing data), but ALSO for writing data to MongoDB collections, using the Insert into Table component, which makes DataCleaner the first open source tool that offers data flow modelling and ETL functionality for MongoDB! We also improved on a few other datastores:
Besides these points, a few bugfixes where fixed and some minor features added. For a full list of changes, check out the DataCleaner 2.4 milestone description in trac.
We hope you enjoy DataCleaner 2.4. We built it to be used, so go grab it right away on the downloads page!
Here's what's new in DataCleaner 2.4:
EasyDataQuality integration
With DataCleaner 2.4 we've made an alliance with the newly launched EasyDQ.com service, which offers cloud-based Data Quality services. The services provided are:
- Duplicate detection (aka. Deduplication or Fuzzy matching of records), which is free to use for up to 500,000 values.
- Address data validation and cleansing. This allows you to check if addresses exist, if they are correctly formatted and even to suggest corrections in case you have mistakes.
- Name data validation and cleansing. With the Name service, EasyDQ does not only format your names consistently, but also checks for misspellings and interprets the name parts.
- Email and phone validation and cleansing. These services provide checking of email and phone data, making sure that email domains exist, that country codes are correct and much more.
No, these are not open source services, but they are offered at a reasonable price as well as a free starter package, and we thoroughly believe that the integration allows DataCleaner to become a much better tool for those who want it.
New analysis job components
Many of DataCleaner's users have reported that they use DataCleaner as a lightweight ETL tool. This is because we currently support basic reading, transformation and writing capabilities. With 2.4 we've added a few crucial components to add to this use-case where you want to do ad-hoc transformations, data quality checks and actually write the data back to your database.
- Table lookup which allows you to look up any number of values based on any number of conditions. The lookup component has an intelligent caching mechanism and is highly performant. (Docs).
- Insert into table is a new option when writing data. With this option we are making it possible for DataCleaner to not only produce new files, but also to insert records into existing databases. That makes it a much more flexible writing option.
MongoDB support! And a few more...
Another theme in DataCleaner 2.4 is support for the popular NoSQL database MongoDB. The support is offered both as a profiling service (eg. reading and analyzing data), but ALSO for writing data to MongoDB collections, using the Insert into Table component, which makes DataCleaner the first open source tool that offers data flow modelling and ETL functionality for MongoDB! We also improved on a few other datastores:
- Support for MongoDB datastores, which are both readable and writable with DataCleaner. MongoDB uses a schemaless design principle, so you have the choice of either letting DataCleaner auto-detect a virtual schema, or define it yourself. (Docs).
- Added more configuration options to Fixed width value files. Specifically, there is now the option to specify header line number.
- Added support for custom table mapping of XML structures. For large XML files this is a recommended approach, since with a fixed table model, DataCleaner can do SAX-based XML parsing which is much less memory intensive and a lot faster. (Docs).
- The Command Line Interface (Docs) has been further improved, by allowing you to inject job variables from the command line, which makes it possible to parameterize jobs and thereby reuse jobs for different purposes.
Besides these points, a few bugfixes where fixed and some minor features added. For a full list of changes, check out the DataCleaner 2.4 milestone description in trac.
We hope you enjoy DataCleaner 2.4. We built it to be used, so go grab it right away on the downloads page!
2011-11-30 - by kasper.
no user comments
There's a new and nice extension ready for you at the ExtensionSwap: The network tools extension.
Network tools can be used to work with IP addresses in data, resolve hostnames and more. Give it a look if you're dealing with network addresses (or eg. email addresses, website visitors etc.) in your data.
2011-09-29 - by emil.
1 user comment(s)
Today we announce the release of DataCleaner 2.3. It contains new functionality, usability improvements and technical changes that make it even more useful for your data quality work. Curious? Just read on!
International data support
Grouping of analysis results by a secondary column
Improved charts
Output
Other improvements
We hope you enjoy the new version of DataCleaner, which you can get a copy of on the downloads page.
International data support
- If you are working with international data, then you might have different character sets in your data, for example Chinese or Hebrew. We added the Character set distribution analyzer, which is a profiling option that lets you figure out which character sets are used in your data.
- Working with data containing different character sets can be problematic. Using the new Transliterate transformer you can now transliterate strings from different writing systems to Latin characters.
- There is also a new webcast demonstration, focusing on the international data capabilities of DataCleaner 2.3 in the documentation section.
Grouping of analysis results by a secondary column
- The Pattern analyzer is now able to group patterns based on a secondary column. This is useful for analyses like:
- Get patterns of phone numbers, grouped by country.
- Get patterns of email username based on email domain.
- Something similar has been done for the Value Distribution analyzer; this allows for analyses such as:
- Are all city names distinct, when grouped by postal code?
- What is the distribution of gender within particular customer types?
Improved charts
- The Pattern finder results can now be shown in a chart. This makes the distribution visible and shows how much of a "long tail" of patterns there is.
- The output of the value distribution analyzer has been improved in a couple of areas:
- The readability of the chart has been improved.
- It shows the total number of rows and the distinct count over these rows: the number of different values that exist in the rows. This helps in figuring out how often duplicate values exist.
- If there are empty strings, we use the <BLANK> keyword for it, so that it is easier to recognize them.
Output
- Next to the already existing output formats (CSV files and H2 datastores) we added writing output to Excel spreadsheets.
- After writing to a datastore, it is now possible previewing the output, so that you can check whether the output is according to your expectations.
- It is now also possible to add the output as a new datastore, so that it can be used as input for a new job.
Other improvements
- Documentation has been generally improved. In particular, logging and command line interface descriptions have been added.
- The extension mechanism has been improved by modularizing several pieces of the application and introducing Google Guice as a generally available dependency injection framework for extension developers.
- And of course we did more than twenty small improvements and bug fixes.
We hope you enjoy the new version of DataCleaner, which you can get a copy of on the downloads page.
2011-08-11 - by kasper.
no user comments
As stated earlier, Human Inference is dedicated to deliver a rich set of extensions to the DataCleaner community, as well as we are seeing third party interest in contributing to the ExtensionSwap.
Today we've published a new extension which many DataCleaner users will hopefully find useful: The Regex parser.
With this extension you can easily implement your own parsing logic around regular expressions. The idea is that you use a regular expression to identify groups in your strings. These substring groups are extracted from the original value and isolated so you can process them individually. A quite nice application of DataCleaner's transformer mechanism!
For more information on how to create your own extensions, please refer to the DataCleaner develop page.
Today we've published a new extension which many DataCleaner users will hopefully find useful: The Regex parser.
With this extension you can easily implement your own parsing logic around regular expressions. The idea is that you use a regular expression to identify groups in your strings. These substring groups are extracted from the original value and isolated so you can process them individually. A quite nice application of DataCleaner's transformer mechanism!
For more information on how to create your own extensions, please refer to the DataCleaner develop page.
2011-06-27 - by manuel.
no user comments
DataCleaner 2.2 has been released as of today! This is an exciting new version of our Data Quality Analysis (DQA) and Data Profiling application that is now a lot more extensible, embeddable and compliant with new datastores.
Here's a summary of the news in this release:
Extensibility
- The main driver for this release has been a story about extensibility. While releasing the application we are simultaniously releasing a a new DataCleaner website which features an important new area: The ExtensionSwap. The idea of the ExtensionSwap is to allow sharing of extensions to DataCleaner and installation simply by clicking a button in the browser!
- The DataCleaner extension API has been improved a lot in this release, making it possible to create your own transformers, analyzers and filters. If you feel your extensions could be of interest to other users, please share it on the ExtensionSwap and we provide a channel for you to easily distribute it to thousands of users. The Extension API and the ExtensionSwap is further explained in our new webcast demonstration for developers and other techies with an interest.
- We are also releasing a set of initial extensions on the ExtensionSwap: The HIquality Contacts for DataCleaner extension which provides advanced Name, Phone and Email cleansing, based on Human Inferences natural language processing DQ web services. We are also shipping a sample extension which will serve as an example for developers wanting to try out extension development themselves. In the coming months we will make sure to post even more extensions originating from our internal portfolio of tools that we use at Human Inference's knowledge gathering teams.
- In addition to extensibility we are also focusing on embeddability. We want to be able to embed DataCleaner easily into other applications to make profiling and data analysis possible anywhere! We've created a new bootstrapping API which allows applications to bundle DataCleaner and bootstrap it with a dynamic configuration or run it in a "single datastore mode", where the application is tuned towards just inspecting a single datastore (typically defined by the application that embeds DataCleaner). We already have some really interesting cases of embedding DataCleaner in the works - both in other open source applications as well as commercial applications.
Compatibility
- We've added support for analyzing SAS data sets. This is something we're quite proud of as we are, to our knowledge, the first major open source application to provide such functionality, ultimately liberating a lot of SAS users. The SAS interoperability part was created as a separate project, SassyReader, so we expect to see adoption in DataCleaner's complimentary open source communities soon too!
- We've also added support for another type of datastore: Fixed width files. Fixed width files are text files where each column has a fixed width. There is no separator or quote character, like CSV files, instead each line are equal in length and each line will be tokenized according to a set of value lengths.
- An option to "fail on inconsistencies" was added to CSV file and fixed width file datastores. These flags add a format integrity check when using these text file based datastores.
- A bug was fixed, which caused CSV separator settings not to be retained in the user interface, when editing a CSV datastore.
- Japanese and other characters are not supported in the user interface. This "bug" was a matter of investigating available fonts on the system and selecting a font that can render the particular characters. On most modern systems there will be capable fonts available, but on some Unix and Linux branches there might still be limitations.
Other improvements
- The documentation section has been updated! Ever since the initial 2.0 release the documentation have been far behind, but we've finally managed to get it up to date. There are still pieces missing in the docs, but it should definately be useful for basic usage as well as a reference for most topics.
- Application startup time was improved by parallelizing the configuration loading and by delaying the initialization of those parts of the configuration that are not needed for the initial window display.
- The phonetic similarity finder analyzer have been removed from the main distribution, as this was quite experimental and serves mostly as a proof of concept and an appetizer to the community to create more advanced matching analyzers. You can now find and install the phonetic similarity finder on the ExtensionSwap.
- Cancelled or errornous job handling was improved and the user interface responds more correctly by disabling buttons and progress indicators, if a job has stopped.
- Fixed a few minor UI issues pertaining to table sizing and use of scrollbars.
2011-05-16 - by kasper.
no user comments
Another release of DataCleaner sees the light of day today! Although this is not a major release, but a minor one, it does ship some quite nice stabilizing improvements and minor enhancements to the UI.
Enhancements in 2.1.1:
Bugfixes in 2.1.1:
Thanks to everyone involved in the making of this release of DataCleaner.
DataCleaner 2.1.1 is available as a traditional download or as a Java Web Start application on the downloads page. Keep in touch with your feedback to the application on the forums.
Enhancements in 2.1.1:
- Added a search/filtering text field on the datastores list. This enables you to quickly find your datastore if you have registered more datastores than available on the screen.
- Reference data for country codes was added to the standard distribution, thanks goes to Graham Rhind for providing these.
- Added a horizontal scroll bar to the data previewing windows of there are more than 10 columns.
- Ability to add an extension package with new functionality in the Options dialog at runtime. More focus on extensions will follow in the upcoming releases.
- We've exposed an early preview of our Command-Line Interface (CLI) by allowing you to invoke the application with the "-usage" parameter which will show the CLI options.
- Added number formatting options to the "Convert to Number" transformer.
Bugfixes in 2.1.1:
- Fixed an out-of-memory issue when querying tables with a LOT of columns (150+).
- Fixed an issue that cause the "Limit analysis" check box to not be checked correctly when a job was re-opened after saving.
- Not really a bugfix as it was never an official feature, but now we support restoring user preferences (the userpreferences.dat file) from previous versions of DataCleaner.
Thanks to everyone involved in the making of this release of DataCleaner.
DataCleaner 2.1.1 is available as a traditional download or as a Java Web Start application on the downloads page. Keep in touch with your feedback to the application on the forums.
2011-04-04 - by kasper.
no user comments
We're happy to announce the release of DataCleaner 2.1! This is a quite significant release and something that we hope users will recognize as a step forward from the 2.0 versions.
The major news in DataCleaner 2.1 are:
- There was a lot of work done on the user interface (see media page):
- We decided to remove the left-hand side window containing environment configuration options.
- Instead all these options have now been moved to the job building window so the user only has to focus on a single window for all the interactions needed to build a job.
- The welcome/login dialog has also been removed in favor of a more discrete panel that can be pulled in or hidden from the main window.
- Datastore selection and management is considered the first activity in the application, which is why it is also the first step to handle in the main window.
- You can now stop jobs in case you decide to change something before it is done.
- Bar and line charts were added to a lot of the analysis result screens, including String analyzer, Number analyzer, Date/time analyzer and Weekday distribution (see media page).
- All "preview data" windows now contain paging controls so you can move backwards and forwards in the data set.
- Most common database drivers (MySQL, PostgreSQL, Oracle, MS SQL Server and Sybase) have been added to a default set of drivers.
- Configuration of the Quick analysis function in the Options dialog.
- Various minor bugfixes.
- Transformer for extracting date parts (year, month, day etc.) from date columns.
We hope you enjoy DataCleaner 2.1. Please head over to the downloads page to get it!
2011-03-07 - by ankitk.
no user comments
Eobjects.org and its contributors are pleased to announce that DataCleaner 2.0.2 has just been released.
DataCleaner 2.0.2 is a minor, but not unimportant, release containing a few bugfixes and a set of 8 feature enhancements:
With these improvements in place we see that DataCleaner 2.x is really catching along and we're very pleased with the quality and pace of improvements we are seeing. Go to the Downloads page right away to grab the new version.
DataCleaner 2.0.2 is a minor, but not unimportant, release containing a few bugfixes and a set of 8 feature enhancements:
- Tabs and buttons in the workbench are disabled when no source columns have been selected.
- A special widget have been added to the "Source" tab, making it very easy to apply row count based sampling of the input data.
- When possible, filters now have the ability to optimize the query of a job (aka. Push-down optimization). This was implemented for the "Max rows", "Equals" and "Not null" filters.
- The growing amount of transformers caused a long list in the "Add transformer" popup. Therefore transformers are now grouped by category and displayed accordingly.
- The visualization of execution flow now allows removing column items and filter outcome items, making the graph more comprehensible, especially for very large jobs.
- The "Coalesce string" transformer now has a "Consider empty strings as null" flag, which is particularly useful when dealing with CSV files.
- Text-based dictionaries and synonym catalogs will get their cached values flushed, if the file they read from changes.
- The "Convert to date" transformer now includes the ability to specify your own date masks, if date strings require it.
- A bug was fixed when passing null values to the the email standardizer.
- A bug was fixed pertaining to proper presentation of "mixed" tokens in the the Pattern finder.
With these improvements in place we see that DataCleaner 2.x is really catching along and we're very pleased with the quality and pace of improvements we are seeing. Go to the Downloads page right away to grab the new version.
2011-02-21 - by kasper.
no user comments
Since the release of DataCleaner 2.0, we've seen a renewed interest and a lot of activity around eobjects.org, DataCleaner and Human Inference. We're happy to get all this valuable feedback and it has also meant that there where some low hanging fruit to as well as a few very minor bugs that we could easily add into the existing DataCleaner 2.0 release. This is why, already a week after 2.0 was released, we're releasing an update: 2.0.1.
The update consist of minor updates:
For more detail, take a look at the milestone contents at Trac.
DataCleaner 2.0.1 is available at the downloads page and the update has also been automatically applied to our Java Web Start users.
The update consist of minor updates:
- Filter outcomes where added to the flow visualization.
- A bug was fixed in the widget for selecting the tokenizer's separators.
- The "Equals" filter can now have multiple values to compare with.
- Some minor cosmetical improvements.
For more detail, take a look at the milestone contents at Trac.
DataCleaner 2.0.1 is available at the downloads page and the update has also been automatically applied to our Java Web Start users.
2011-02-13 - by kasper.
2 user comment(s)
The Open Source software community eobjects.org is happy to announce the release of DataCleaner 2.0. This release marks the biggest advance in technology and features for the DataCleaner platform throughout the history of the project.
Amongst exciting new features in DataCleaner 2.0 are:
Today it was also announced that Human Inference, the European data quality authority has finished their acquisition of the eobjects.org site, to actively enter the market for entry-level Open Source data quality products. All projects on eobjects.org will remain open source and the benefit for the community and the products are apparent. The release of DataCleaner 2.0 is the first visible outcome of the acquisition, resulting from several months of intense cooperation between Human Inference and the community members, to put together a state-of-the-art data profiling application.
For more information about the eobjects.org acquisition, see the press release on the Human Inference website.
Times are really exciting in the eobjects.org community these days. We hope you’re all as enthusiastic about the new DataCleaner 2.0 as we are. The application is ready for download and for immediate launch through Java Web Start, so visit the DataCleaner website now.
Amongst exciting new features in DataCleaner 2.0 are:
- Data transformations, allowing you to preprocess, extract, refine, combine and calculate data items as a part of your data profiling jobs.
- Filtering, sampling and subflow management, allowing you to define criteria to exclude and include particular items of data.
- Richer reporting with charts, graphs, navigation trees and more.
- A bunch of new data quality functions for date gap analysis, phonetic similarity finding, synonym lookups and more.
- More configuration options and added data quality measures for existing data quality functions like the Pattern finder, String analyzer and more.
- Reusable profiling jobs, where you define your processing flow once and consequently run it on any data.
- Support for MS Excel 2007+ spreadsheets.
Today it was also announced that Human Inference, the European data quality authority has finished their acquisition of the eobjects.org site, to actively enter the market for entry-level Open Source data quality products. All projects on eobjects.org will remain open source and the benefit for the community and the products are apparent. The release of DataCleaner 2.0 is the first visible outcome of the acquisition, resulting from several months of intense cooperation between Human Inference and the community members, to put together a state-of-the-art data profiling application.
For more information about the eobjects.org acquisition, see the press release on the Human Inference website.
Times are really exciting in the eobjects.org community these days. We hope you’re all as enthusiastic about the new DataCleaner 2.0 as we are. The application is ready for download and for immediate launch through Java Web Start, so visit the DataCleaner website now.
2010-05-15 - by kasper.
no user comments
Here it is: DataCleaner 1.5.4 :)
Although this release is a minor release it contains a few exciting features and fixes:
Head on over to the downloads page to grab the new DataCleaner.
Although this release is a minor release it contains a few exciting features and fixes:
- We've updated the MetaModel version to 1.2 which adds support for two new datastores:
- dBase databases (.dbf files)
- MS Access databases (.mdb files)
- We've fixed a bug pertaining to text-file dictionary "file not found" errors.
- A lot of the other underlying libraries have been updated, providing improvements to performance and stability.
Head on over to the downloads page to grab the new DataCleaner.
2009-10-18 - by kasper.
1 user comment(s)
After much waiting, we are finally ready to release DataCleaner 1.5.3. Here's the wrap-up on what's been going on:
So as you can see, it's been a mix of minor bugfixes and a couple of improvements to compatibility and performance regarding certain datastores. We hope you enjoy this new release of DataCleaner. As always, you can ...
Let us know what you think!
- The MetaModel dependency has been upgraded to version 1.1.8, which means:
- Improved Excel spreadsheet support
- Improved SQL Server support
- Improved performance for CSV files
- Fixed a bug that caused certain database connection errors to be ignored in terms of user feedback.
- Fixed a bug that caused re-opening of database dictionaries to throw a NullPointerException.
- Fixed a bug related to dictionary lookups of null values.
- Added support for Teradata databases.
- Added connection templates for SQL Server connections.
- Added support for selection of custom encodings when reading CSV files.
- Fixed a minor bug relating to reading files on the classpath when running in Java WebStart mode (which manifested in an exception thrown when clicking on "About DataCleaner").
So as you can see, it's been a mix of minor bugfixes and a couple of improvements to compatibility and performance regarding certain datastores. We hope you enjoy this new release of DataCleaner. As always, you can ...
Let us know what you think!
2009-09-08 - by kasper.
1 user comment(s)
About half a year ago we received an exciting inquiry from Jos van Dongen on behalf of him and his co-author Roland Bouman, telling us that they where writing a new book about Open Source Business Intelligence and in particular Pentaho-based solutions. And for this they where looking into DataCleaner for the data profiling section of the book!
The book is now out! It's called "Pentaho Solutions" and it's published by Wiley Publishing. You can read about it and buy it on their website as well.
The book contains a walkthrough for building a data warehouse using Open Souce tools and in doing so applying DataCleaner for the important job of profiling and validation.
We congratulate Roland Bouman and Jos van Dongen for their great work to promote Open Source Business Intelligence and thank them for mentioning DataCleaner while they're at it!
2009-07-14 - by kasper.
no user comments
Dear DataCleaner users,
We are happy to announce the release of DataCleaner 1.5.2. Users of DataCleaner 1.5.0 or 1.5.1 won't be able to see a lot of changes in the user interface, but this release actually holds quite a lot of improvements “beneath the surface”:
You can download DataCleaner from the downloads page or you can use our new feature: Get it via Java WebStart!
This release underlines the ongoing evolution of DataCleaner to be a more and more professionally capable data profiler and data quality tool. Seeing that DataCleaner is being used in large corporations world wide I wish to address some thoughts that I have been having and that I know users are pondering with: How do you best combine the low adoption cost of Open Source applications like DataCleaner with the high flexibility that most commercial business-software provide? To service this need we've opened up a new division of the company that I work with, Lund&Bendsen. Whether you need to deploy DataCleaner to high-scale installations, integrate the applications with your existing systems or develop customized profiles, validation rules or satisfy other enterprise needs, we offer you first class services and in-depth expertise you wont find anywhere else.
To cut to the chase: DataCleaner 1.5.2 is here and we wish to extends the community development with a professional effort. So don't hesitate to let us know if you see an opportunity to invest. Adding value by targeting your use of the product is in the interest of both customer, developer and community and this is the reason our business is there.
To all you non-business users out there: Sorry for the obvious commercial rant and we hope you all enjoy the newest DataCleaner release.
Best regards,[[BR]]
Kasper Sørensen[[BR]]
Founder of eobjects.org and the DataCleaner project
We are happy to announce the release of DataCleaner 1.5.2. Users of DataCleaner 1.5.0 or 1.5.1 won't be able to see a lot of changes in the user interface, but this release actually holds quite a lot of improvements “beneath the surface”:
- The most notable improvement is in the Value Distribution Profile. Previously this profile consumed quite a lot of memory which could lead to out-of-memory errors in extreme cases. This has been fixed by using on-disk caching with the berkeley db when nescesary.
- Another notable feature is that we can now distribute DataCleaner as a single JAR file. This means that we will be serving the application as a Java WebStart application (ie. run it as if it's an online application) and we are also considering other distribution options.
- When starting the application, it automatically downloads regular expressions from the RegexSwap.
- A bug in regards to matching number-based columns in dictionaries was reported and fixed.
- A bug in regards to invalid characters in XML-export formats was reported and fixed.
- When opening files, we are now ignoring suffix case so that .CSV files can be opened as well as .csv.
- The number of columns shown in the preview window are automatically restricted if there are too many to show on a single screen.
You can download DataCleaner from the downloads page or you can use our new feature: Get it via Java WebStart!
This release underlines the ongoing evolution of DataCleaner to be a more and more professionally capable data profiler and data quality tool. Seeing that DataCleaner is being used in large corporations world wide I wish to address some thoughts that I have been having and that I know users are pondering with: How do you best combine the low adoption cost of Open Source applications like DataCleaner with the high flexibility that most commercial business-software provide? To service this need we've opened up a new division of the company that I work with, Lund&Bendsen. Whether you need to deploy DataCleaner to high-scale installations, integrate the applications with your existing systems or develop customized profiles, validation rules or satisfy other enterprise needs, we offer you first class services and in-depth expertise you wont find anywhere else.
To cut to the chase: DataCleaner 1.5.2 is here and we wish to extends the community development with a professional effort. So don't hesitate to let us know if you see an opportunity to invest. Adding value by targeting your use of the product is in the interest of both customer, developer and community and this is the reason our business is there.
To all you non-business users out there: Sorry for the obvious commercial rant and we hope you all enjoy the newest DataCleaner release.
Best regards,[[BR]]
Kasper Sørensen[[BR]]
Founder of eobjects.org and the DataCleaner project
2009-04-20 - by kasper.
no user comments
We're happy to announce the release of DataCleaner version 1.5.1. This release is a minor release, nevertheless containing a few nice features - especially for the users who are enjoying the exporting features that was introduced in 1.5:
- An additional HTML export format have been added to the built-in export formats (usable when exporting Profiler results in the desktop app and when executing the runjob command-line tool).
- The export format is now choosable directly in the desktop app.
- Four new measures where added to the String Analysis profile: avg. chars and max/min/avg white spaces.
- Fill out our online user survey, or
- Post your comments and questions at our discussion forum.
2009-03-15 - by kasper.
1 user comment(s)
"Finally!" one might say. And this is definately what is going through my head right as I write this news-item. Finally, DataCleaner 1.5 has been released! Once again the effort to bring about the best open source data quality solution is bearing fruit.
The new release is definately one of the most significant ones in the history of DataCleaner. The overall goal of the release has been to step up from the shadows of the "small tools" pool and mark DataCleaner as an enterprise-ready application for profiling and validating datastores of all kinds - both in scheduled mode, on servers and in an intuitive desktop environment.
For those of you with an interest in every little detail about this release, please feel free to review the complete list of changes - for everyone else, here's the recap:
We hope you enjoy the new DataCleaner 1.5! Now go over and download it right away.
The new release is definately one of the most significant ones in the history of DataCleaner. The overall goal of the release has been to step up from the shadows of the "small tools" pool and mark DataCleaner as an enterprise-ready application for profiling and validating datastores of all kinds - both in scheduled mode, on servers and in an intuitive desktop environment.
For those of you with an interest in every little detail about this release, please feel free to review the complete list of changes - for everyone else, here's the recap:
- Change of license to LGPL.
- Multi-threaded execution of Profiler and Validator.
- Command line (batch) execution of DataCleaner tasks.
- More elaborate status information during profiler and validator execution.
- New profile: Date mask matcher.
- New profile: Regex matcher.
- Load regex from the online RegexSwap repository.
- Automatic download and install of popular database drivers.
- More file types supported (.dat, .txt)
- XML file support improved (.xml)
- Memory improvements in Time analysis profile.
- Improved logging when running profiling and validation.
- Information schema provided for file-based datastores.
- Lazy-loading of columns in datastore-tree.
We hope you enjoy the new DataCleaner 1.5! Now go over and download it right away.
2009-02-12 - by kasper.
1 user comment(s)
Things are starting to shape up for the big release of DataCleaner 1.5. We are starting off with a bit of excitement around in the data quality community.
Probably the most dedicated online magazine about data quality, data quality pro, have launched a series of articles about profiling, validating and comparing data with DataCleaner. So far an introductory tutorial (including a complete and realistic example data-set) and a background article/interview have been published:
We hope that you will enjoy the articles and we thank data quality pro for their great interest in our community.
Probably the most dedicated online magazine about data quality, data quality pro, have launched a series of articles about profiling, validating and comparing data with DataCleaner. So far an introductory tutorial (including a complete and realistic example data-set) and a background article/interview have been published:
- Learn how to profile and validate data (for free) using DataCleaner
- Interview with Kasper Sørensen, creator of DataCleaner
We hope that you will enjoy the articles and we thank data quality pro for their great interest in our community.
2009-02-10 - by kasper.
no user comments
Today we are announcing the first company, Lund&Bendsen, to officially support DataCleaner and MetaModel on a commercial level. These eobjects.org projects are, as you know, independent projects that are run with the community in mind. But as time goes on they grow and for companies to pick them up and start using them in a commercial setting we also welcome third party commercial support to help spread the projects to environments where community-based support is insufficient.
Lund&Bendsen is a Danish company with a strong expertise in Java development and training. Their service offerings include training, customization, integration and enhancement of DataCleaner and MetaModel so if your company is considering applying DataCleaner they might be interested in hiring some professionals to aid them in the process.
Over time more companies are expected to join in on commercial support for the eobjects.org projects. Keep up to date on the DataCleaner support page and don't hesitate to contact us for any inquiries in this regard either.
Lund&Bendsen is a Danish company with a strong expertise in Java development and training. Their service offerings include training, customization, integration and enhancement of DataCleaner and MetaModel so if your company is considering applying DataCleaner they might be interested in hiring some professionals to aid them in the process.
Over time more companies are expected to join in on commercial support for the eobjects.org projects. Keep up to date on the DataCleaner support page and don't hesitate to contact us for any inquiries in this regard either.
2009-01-26 - by kasper.
no user comments
The Technology Evaluation Centers (TEC) have published an interesting, unbiased and independent analysis of the market for Open Source business intelligence products. We are delighted to see that the article features a section about data quality and that TEC points at DataCleaner as a competent choise within the open source products:
In such situations, where the vendor does not support a specific functionality,You can read the whole article by Anna Mallikarjunan from TEC by going to their website (user registration is required).
organizations can look to complementary open source solutions; the DataCleaner
project from eobjects.org, for instance, provides functionality to help profile
data and monitor data quality. It also points to a significant advantage with
open source applications: the fact that software is developed by the community
and for the community makes it much simpler to share innovative solutions
quickly and seamlessly.
2009-01-22 - by kasper.
no user comments
Another batch of updates, fixes and improvements for the upcoming DataCleaner release is ready. This time it's Release Candidate 2 offering a preview of what's to come in DataCleaner 1.5.
The main changes since Release Candidate 1 are multithreaded execution, the command line interface (runjob.sh / runjob.cmd), some UI updates and a few bugfixes. Go download the release candidate and use it as an opportunity to influence the development process by posting your comments on the DataCleaner forum.
- DataCleaner download site: http://datacleaner.eobjects.org/downloads
The main changes since Release Candidate 1 are multithreaded execution, the command line interface (runjob.sh / runjob.cmd), some UI updates and a few bugfixes. Go download the release candidate and use it as an opportunity to influence the development process by posting your comments on the DataCleaner forum.
2009-01-12 - by kasper.
no user comments
After working hard for a couple of days to implement substantial new features regarding integration of eobjects services and automatic download and install of popular database drivers, a new release candidate of DataCleaner is ready!
We hope that a lot of people will use the release candidate and provide feedback for further development towards the 1.5 final release.
- DataCleaner download site: http://datacleaner.eobjects.org/downloads
We hope that a lot of people will use the release candidate and provide feedback for further development towards the 1.5 final release.
2009-01-09 - by kasper.
no user comments
I've spent the last couple of days implementing a couple of cool enhancements to the DataCleaner desktop-application:
Screenshots have been posted to the media page.
Wait for DataCleaner 1.5 for these features or [BuildingDataCleaner build it yourself] to check them out now.
- Automatic download and install of popular database drivers. Followed along with template connection strings in the "Open database" dialog. This will hopefully make it much easier for less experienced users to set up a connection to their database of choice.
- Direct integration with the new RegexSwap system so that the regexes that you post online will be accessible from within the desktop-application.
Screenshots have been posted to the media page.
Wait for DataCleaner 1.5 for these features or [BuildingDataCleaner build it yourself] to check them out now.
2009-01-05 - by kasper.
2 user comment(s)
Only a few days after the launch of the new DataCleaner website, we are once again ready with new exciting features. This time we are launching the first edition of our new regular expression (regex) sharing subsite called "RegexSwap".
RegexSwap is a specialized forum for sharing, categorizing, commenting and voting on regular expressions that can be used in DataCleaner and other regex-based applications. It is really easy to post your own regular expressions, test them online on the website, comment and vote on the regexes that you have found useful. In time the next releases of DataCleaner will also take advantage of this online "always up to date" regex resource and offer direct integration with RegexSwap.
RegexSwap is still in beta but is ready at a functional level which is why we are launching publically it now. It will recieve dedicated attention in the weeks and months to come.
RegexSwap is a specialized forum for sharing, categorizing, commenting and voting on regular expressions that can be used in DataCleaner and other regex-based applications. It is really easy to post your own regular expressions, test them online on the website, comment and vote on the regexes that you have found useful. In time the next releases of DataCleaner will also take advantage of this online "always up to date" regex resource and offer direct integration with RegexSwap.
RegexSwap is still in beta but is ready at a functional level which is why we are launching publically it now. It will recieve dedicated attention in the weeks and months to come.
2009-01-02 - by kasper.
1 user comment(s)
Dear everybody,
As a special christmas present we have been working hard to design a new website for DataCleaner! Hopefully you will all enjoy the new site, which have been designed to further support our community and let it grow by incorporating more features to socialize and share ideas online. So go visit it now at the new URL:
Among the new features are a more personal profile system which is linked to some of the communities that our users already use frequently, namely LinkedIn and SourceForge. We have a whole new media section with cool screenshots and webcasts. We are also redesigning our mailing list structure. Instead of the single mailing list that we have been using so far, we are launching new "announcement" and "dev" mailing lists.
Our goal is to continuously launch new features on the website. The first one being a user survey to gain a better insight into the minds of our users and community. So be sure to fill it out. In the future we will add more exiting features such as online sharing of regular expressions and reference data for DataCleaner dictionaries.
The old website will continue to exist, but primarily as a wiki and bugtracking system. During the next couple of days we will be editing the wiki pages to make them more suitable for wiki-style editing (by everyone) as opposed to the former readonly strategy.
We hope you like our christmas present and that you will let us know. and we wish you all a great 2009. Without a doubt, it will bring exiting times for DataCleaner and the DataCleaner community.
As a special christmas present we have been working hard to design a new website for DataCleaner! Hopefully you will all enjoy the new site, which have been designed to further support our community and let it grow by incorporating more features to socialize and share ideas online. So go visit it now at the new URL:
Among the new features are a more personal profile system which is linked to some of the communities that our users already use frequently, namely LinkedIn and SourceForge. We have a whole new media section with cool screenshots and webcasts. We are also redesigning our mailing list structure. Instead of the single mailing list that we have been using so far, we are launching new "announcement" and "dev" mailing lists.
Our goal is to continuously launch new features on the website. The first one being a user survey to gain a better insight into the minds of our users and community. So be sure to fill it out. In the future we will add more exiting features such as online sharing of regular expressions and reference data for DataCleaner dictionaries.
The old website will continue to exist, but primarily as a wiki and bugtracking system. During the next couple of days we will be editing the wiki pages to make them more suitable for wiki-style editing (by everyone) as opposed to the former readonly strategy.
We hope you like our christmas present and that you will let us know. and we wish you all a great 2009. Without a doubt, it will bring exiting times for DataCleaner and the DataCleaner community.
2008-10-13 - by kasper.
no user comments
As we're moving steadily along towards the release of DataCleaner 1.5 we are fixing a few bugs and enhancing a lot of features. This leads to the desire to release our work since practically nothing has undergone changes that could destabilize the application since the 1.4 release. So today we're releasing DataCleaner 1.5 "snapshot". This also marks the first release under our new LGPL license.
Here are the changes from 1.4 so far:
Although this is in principle a development/beta release, we feel that it would be worth working with for most of your profiling needs. So... Go on, [GetDataCleaner download it], tell us what you think and we'll see you around!
Here are the changes from 1.4 so far:
- Change of license to LGPL.
- New profile: Date mask matcher.
- New profile: Regex matcher.
- More file types supported (.dat, .txt)
- XML file support improved (.xml)
Although this is in principle a development/beta release, we feel that it would be worth working with for most of your profiling needs. So... Go on, [GetDataCleaner download it], tell us what you think and we'll see you around!
2008-10-06 - by kasper.
no user comments
We've made a principal decision at eobjects.org to change the preferred license of our projects from the Apache License 2.0 to the Lesser General Public License (LGPL).
The main difference between the two licenses are that the LGPL requires any modifications to be contributed back to the Open Source community (ie. licensed under a similar license; LGPL or GPL). The eobjects.org projects are gaining the obvious advantages of the LGPL by ensuring that improvements are submitted back to the projects. This also means that we don't risk that anyone sell modified versions of our projects. It is still just as appropriate to use the projects as ''a part of'' commercial applications, but any modifications must be contributed back to the community.
Initially this change in license will affect the two flagship projects of eobjects.org: DataCleaner and MetaModel. This means that the next versions of these projects (DataCleaner 1.5 and MetaModel 1.1 accordingly) will be LGPL licensed. Also, new projects will be LGPL licensed unless special circumstances suggest otherwise.
The main difference between the two licenses are that the LGPL requires any modifications to be contributed back to the Open Source community (ie. licensed under a similar license; LGPL or GPL). The eobjects.org projects are gaining the obvious advantages of the LGPL by ensuring that improvements are submitted back to the projects. This also means that we don't risk that anyone sell modified versions of our projects. It is still just as appropriate to use the projects as ''a part of'' commercial applications, but any modifications must be contributed back to the community.
- The Apache License 2.0: http://www.apache.org/licenses/LICENSE-2.0
- Lesser General Public License: http://www.gnu.org/licenses/lgpl-3.0.txt
Initially this change in license will affect the two flagship projects of eobjects.org: DataCleaner and MetaModel. This means that the next versions of these projects (DataCleaner 1.5 and MetaModel 1.1 accordingly) will be LGPL licensed. Also, new projects will be LGPL licensed unless special circumstances suggest otherwise.
2008-09-26 - by kasper.
1 user comment(s)
We've just uploaded a [wiki:DataCleanerProfilerIntroWebcast/1.4 webcast of the new DataCleaner 1.4] which provides a long awaited update for the old 0.4 webcasts!
Go enjoy the webcast - and be sure to [GetDataCleaner download the newest version of DataCleaner]. Over and out!
Go enjoy the webcast - and be sure to [GetDataCleaner download the newest version of DataCleaner]. Over and out!
2008-09-21 - by kasper.
no user comments
I'm please to announce the release of DataCleaner 1.4! This is a release that we feel will satisfy a lot of users with improvements and fixes for a lot of issues. Here's a very short compilation of changes, for more details, take a look at the roadmap.
We hope you enjoy the new version of DataCleaner - [GetDataCleaner Get it now]!.
- Replaced "Repeated values" profile with better and more advanced "Value distribution" profile.
- Dictionary matcher drill-to-details options.
- New application logo.
- Lots of small bugfixes and UI beautifications.
- Lots of sample dictionaries and regexes.
We hope you enjoy the new version of DataCleaner - [GetDataCleaner Get it now]!.
2008-09-16 - by kasper.
no user comments
After some considerations about the future of DataCleaner, we've updated the roadmap to reflect our current plans for the direction of development. We are planning on releasing DataCleaner 1.4 by the end of the month and after that two new milestones have been added:
- DataCleaner 1.5: The main focus of this release is to provide a command line interface for our data quality framework. This means that users will be able to easily create batch jobs that they can schedule using their favorite scheduler. Other features will also include Pattern Finder improvements and a couple of new profiles.
- DataCleaner 1.6: We have a lot of suggestions that have been filling up our backlog. DataCleaner 1.6 will be all about getting everybody's needs into the application before we get ready to begin the webapp. Some of the exciting features of DataCleaner 1.6 will be relationship profiling and exporting of results.
2008-09-08 - by kasper.
no user comments
Great news everybody. The Open Source Days '08 conference in Copenhagen will feature a so-called Lightning Speak by Kasper Sørensen on the topic of DataCleaner and the eobjects.org community.
We're really happy to get the message of DataCleaner out to more people and a conference like this is an ideal spot for demonstrations, discussions and experiences. Read more about the lightning speak at Kasper's blog:
Update: The presentation is over and you can now also read the retrospective at Kasper's blog:
We're really happy to get the message of DataCleaner out to more people and a conference like this is an ideal spot for demonstrations, discussions and experiences. Read more about the lightning speak at Kasper's blog:
Update: The presentation is over and you can now also read the retrospective at Kasper's blog:
- http://kasper.eobjects.dk/2008/10/fast-as-lightning.html
2008-08-26 - by kasper.
no user comments
We've released a development/snapshot release of DataCleaner 1.4 in order to get early reactions for all the improvements and new features as well as supporting our users with up to date functionality. In my own opinion the development release is just as stable and "safe to use" as 1.3, but of course it lacks a bit of the manual testing that we put into the real releases.
You can download the development release at our sourceforge download site.
Here's a short list of fixes since DataCleaner 1.3:
* Better memory handling and garbage collection
* Reference columns in drill-to-details windows
* Better error handling when loading schemas
* Quoting of string values in visualized tables (in order to distinguish empty strings and white spaces)
* New profile: Value Distribution, which is an improved version of the Repeated Values profile. The Value Distribution profile has an option to configure the top/bottom n values to include in the result.
* Better control of profile result column width.
* Bugfix: Copy to clipboard functions now work properly.
* Bugfix: Scrollbars added to visualized tables.
Take a look at the roadmap for more current developments of DataCleaner.
You can download the development release at our sourceforge download site.
Here's a short list of fixes since DataCleaner 1.3:
* Better memory handling and garbage collection
* Reference columns in drill-to-details windows
* Better error handling when loading schemas
* Quoting of string values in visualized tables (in order to distinguish empty strings and white spaces)
* New profile: Value Distribution, which is an improved version of the Repeated Values profile. The Value Distribution profile has an option to configure the top/bottom n values to include in the result.
* Better control of profile result column width.
* Bugfix: Copy to clipboard functions now work properly.
* Bugfix: Scrollbars added to visualized tables.
Take a look at the roadmap for more current developments of DataCleaner.

Follow eobjects.org on twitter
Join the DataCleaner LinkedIn group