back to forum.

Topic: Data quality best practices

Topic by
kasper

2008-10-18
11:51

Data quality best practices

Yesterday I was taking a look at some white papers from some of the proprietary data quality solutions, namely these two papers on best practices:

I was thinking that I wanted to have a look at their recommended approach to data quality to evaluate whether or not DataCleaner could do the job in their best practice fashion. Here are a couple of features that I stumbled upon and want to ask everyone about:

sampling - the principle of taking out a "piece of the cake" to run profiling on just that piece as a seperate analysis. Do you guys need such a feature? It would be rather easy to build into DataCleaner I think, but it just never occurred to me as interesting to filter the data that you're profiling. But I guess it could be useful?

join testing - while this is something that is possible in DataCleaner by using dictionaries, it's not something that we have approached directly as a distinct feature. And by incorporating joins within the same data source we could also do it quite a lot faster than using the dictionary approach. So what do you guys think? Is this a feature that is important?

metadata management - I would personally hate to implement our own metadata management system as that would just be another information silo with no connection to the outside world. So here's a question to everybody - are there any good Open Source metadata management components or applications out there that have good compatibility etc.?

issue management - There's a mentioning of issue management tools to handle the lifecycle of data quality issues. This is a very interesting feature I think but I've never seen it in practice. Which systems do data warehouse developers use to track issues? And should we provide integration with some of these systems? (For example, a right click option in the profiling results called "mark measure as issue in [issue tracker]").

These are some loose ideas for the vision of DataCleaner and I can't promise that we will do them all but they definately caught MY interest... So let's hear it if you think it's something we should put into the roadmap! All in all I must say that comparing DataCleaner to proprietary data quality solutions is quite a pleasure... We're doing quite alright! ;-)

Reply by
beno

2008-10-18
17:22
I vote for the sampling/filtering functionality. A couple of my tables contains a "validated" flag where only the records that have been validated are really interesting to profile because the other ones are known to have data quality issues. So it's not so much because I want to just profile a "sample", but because only a subset of the records have real relevance.

However, I've made an easy little workaround by just creating a view that does the filtering and then I just profile the view...

Reply by
benbor

2008-10-22
20:24
I will refer to each item on your list.

sampling - I had to use this feature ''once'' (I was using DataFlux for profiling), when I had a very large data source (all the credit details of one of the largest Canadian banks) to profile. So I used the sampling feature to run profiling on 10% of the data, assuming (correctly) that this will identify most of the problems with the data. I later ran full profiling, which took a whole weekend to run. The 10% sample took a few hours and gave me the results I needed. I therefore vote to have sampling as a feature of DataCleaner. It should allow you to select a percentage, then use a random or linear (this could be a user choice) algorithm to select a sample of records to profile.

Join Testing - This is an essential feature of any data profiling tool. I use it often to find, for example, all the transactions whose currency is not in the currency table. It is possible to do this with dictionaries, but it could (and in my view even should) be made into an explicit Join Testing feature.

Metadata Management - I think that currently DataCleaner is a ''data profiling'' tool, not a ''data cleaning'' tool. I therefore see no reason to include metadata management.
I would use metadata management tools when I need to ensure that my users can understand the meaning and derivation of the data that I present to them. They need to know what I "did" to the data that they see, so that they can decide if they can trust the data. So when I use data-cleansing tools, I would like the cleansing process to be visible to the users, therefore I would include the cleansing in the description of the data-derivation process (together with the extracting, loading and transformations that are part of the same process).
I would therefore not expect DataCleaner to have any Metadata Management facilities until the tools starts actually changing the data. When this happens, the best way would be to find some industry standard way to notify an existing Metadata tool of the work done in DataCleaner.

Issue Management - again, not an important feature for me. I'd like to see DataCleaner concentrate on the important features of profiling and cleansing. I can't see the industry deciding on a single approach to Issue Management, therefore any solution that we may implement would be useful only to a very small percentage of the users of DataCleaner.

Ben

Reply by
kasper

2008-10-23
15:41
Based on your responses I think that we should add tickets for DataCleaner 1.6 for sampling and join testing. So I will add tickets in a sec.

Regarding sampling I think we will have to provide three types of sampling to support all your needs:
  • Simple "max rows" sampling. This will be both the easiest and probably the most performant way to make a sample.
  • Random selection sampling. This is not hard to implement but I fear that it will be a pretty wasteful approach performance-wise because AFAIK the only way to do this is to perform a "full" query and then do the randomizing client-side.
  • Based on a filter. In this scenario the user either writes an SQL WHERE-clause (for database-based datastores) or makes a MetaModel-based filter (which can transform to SQL too) based on some kind of filter-dialog.

Reply by
jvdongen

2009-03-03
18:25
Hi Kasper,

a bit late on this subject, but more export options would be great. E.g. when I find exceptions and drill down to see the details, it would be nice if I could save these details in a CSV file for further reference or store them in an exeption table. Maybe this is already part of the profile result persistency you're working on. And dependency testing, both between columns and between tables would be a great extension of DataCleaner.

Reply by
jvdongen

2009-03-03
18:40
Oops; excuse me for being stupid. Right clicking gives me all the save options I need... Nevertheless, the dependency check request still stands.

Reply by
fmarin

2009-07-01
03:24
Thanks for all the good work with datacleaner and looking forward to new features, In 1.6. I know this is a late response but just writing to confirm that I too would love the sampling and join features, I'm currently working around it by creating views specifically for datacleaner. Thanks!!!

You need to be logged in to participate

In order to post your own comments on this topic, you need to be logged in.

Username:

Log in by clicking the login link at the top of the screen

 

Go back to forum.