Topic: Standard Measures: Distinct Count

Go back to forum.

Topic by
dhartford

2010-05-24
23:45

Standard Measures: Distinct Count

Hey all,
Hopefully this hasn't already been asked and I just couldn't find it --

Is there a way to add 'distinct count' (or duplicate count, whatever verbiage) to the Standard Measure profile under the 'Row count'?

Thanky!
-D
Reply by
dhartford

2010-05-24
23:49
Shoot, there is a similar topic here - http://datacleaner.eobjects.org/topic/8/Duplicate-records

Although duplicate and distinct are pretty much the same thing (one just measures 'unique count', other measures how many are duplicates), along the same reasons as described there.
Reply by
kasper

2010-05-25
08:32
Ah yes it looks like this feature has been overseen for some time.

When you say distinct count do you then mean on a per-column basis or a per-table basis? I can imagine that both measures will be of some value (eg. "how many distinct first names are there in the person-table" and "how many records are exactly the same in the person-table"). But which is what you need?
Reply by
dhartford

2010-05-25
14:39
For my needs, distinct by on a per-column basis would be important for the standard measure section ("how many distinct first names"). The reason why I would like this in the Standard Measure is distinct is one of the first things I do when evaluating data.

The "Value Distribution" profile gives some good information on duplicates already (top/bottoms), and could include a more general metric of 'number of records that are duplicates', which would be more useful in this profile if you want to look more closely at this kind of information anyway.
Reply by
dhartford

2010-05-25
14:49
Honestly -- I think 'Distinct' probably does fit better in the Value Distribution profile.

It wouldn't be the first place I personally would look, but it makes better sense there.
Reply by
kasper

2010-05-25
14:53
Oh, but if you disable the top/bottom properties of the value distribution profile, it will give you a complete listing of all values and their distributions! At the bottom of that list you will find "distinct values" :)
Reply by
dhartford

2010-05-25
15:07
Ah, I see that now!

But it doesn't look like the right number I'm looking for - it doesn't include the 'unique duplicates'.

i.e. pulling the data into the database, I can do a distinct Address_field for 9942 for my example, but in the DataCleaner profile it returns 5333 (because there are a lot of duplicates that aren't included) out of 20,000.

Same dataset for a different field has 19842 sql distincts out of 20,000, while the DataCleaner shows 19764.

I think it's close, but not quite what I was hoping for :-(
Reply by
dhartford

2010-05-25
15:14
hmm...to be more precise, adding two more measures:

rowcount:20,000

Address_field:

Bottom: 5333

[new]Distinct Count: 9942

[new]Duplicated Count: 4609

Does that sound reasonably useful to other people?
Reply by
kasper

2010-05-25
15:22
It does sound very useful to me but I would probably not vote that we put those measures in the Standard Measures profile because of the way that that profile currently works by iterating through all records (ie. it will have to hold all values in memory). The Value Distribution profile is much better at handling this kind of issue because it has support for on-disk storage.

(This is a general design-issue with DataCleaner that is being treated with a redesign of some of the core components of the "engine" in datacleaner - a pilot project called analyzerbeans)

For more info:
* the wiki page for analyzerbeans: http://www.eobjects.dk/trac/wiki/AnalyzerBeans
Reply by
dhartford

2010-05-25
15:28
Putting those in the Value Distribution profile actually sounds great to me, once thought-through! :-)

You need to be logged in to participate

In order to post your own comments on this topic, you need to be logged in.

Username:

Log in by clicking the login link at the top of the screen

 

Go back to forum.

Username:

Password:

Requested username:

Password:

Real name:

Email address:

Title:

Company:

Country: