Topic: Is DataCleaner right for me?
Is DataCleaner right for me?
Hello community,
I come from a social science background and am new to data cleaning and deduplication. I am seeking to identify duplicate cases in a single database using 4 variables: a unique ID #, birthdate, gender, and race. The problem is these values are self-reported and entered by multiple time-crunched agencies, meaning a lot of missing data and invalid entries. I am hoping to use these 4 measures to create some index of confidence in each matching case and create a column containing a value that identifies duplicate cases. Is DataCleaner a program I should invest my efforts into to achieve these goal? Any information and recommendations are greatly appreciated! -Nick
I come from a social science background and am new to data cleaning and deduplication. I am seeking to identify duplicate cases in a single database using 4 variables: a unique ID #, birthdate, gender, and race. The problem is these values are self-reported and entered by multiple time-crunched agencies, meaning a lot of missing data and invalid entries. I am hoping to use these 4 measures to create some index of confidence in each matching case and create a column containing a value that identifies duplicate cases. Is DataCleaner a program I should invest my efforts into to achieve these goal? Any information and recommendations are greatly appreciated! -Nick
Hi Nick,
As I see it there's two tasks that you need: Cleaning up the dirty/missing values. And then you need to deduplicate once the data has been improved.
For the first task, I definately think DataCleaner is the right tool. You can use synonyms for replacing dirty values with correct ones. You can use value distribution and pattern finder for identifying those dirty values and you can use eg. date mask matching for cleaning up your birthdates.
For the second task, there are a few options in DataCleaner:
So in short: Yes I think DataCleaner is a good tool for what you want. Keep us updated with your experiences and tell us if there's anything you think should be improved.
As I see it there's two tasks that you need: Cleaning up the dirty/missing values. And then you need to deduplicate once the data has been improved.
For the first task, I definately think DataCleaner is the right tool. You can use synonyms for replacing dirty values with correct ones. You can use value distribution and pattern finder for identifying those dirty values and you can use eg. date mask matching for cleaning up your birthdates.
For the second task, there are a few options in DataCleaner:
- For strict matching: Concatenate all your values (maybe not the unique id) and point the concatenated value to the Value Distribution analyzer.
- For simple fuzzy matching: Get the "phonetic similarity finder" extension. This matcher only supports phonetic matching so it might not fit the bill so well, but maybe you should try it.
- For contextual fuzzy matching: We will soon be releasing a Human Inference duplicate detection analyzer, which will fit that bill completely, but you will probably need to wait a few weeks for it!
So in short: Yes I think DataCleaner is a good tool for what you want. Keep us updated with your experiences and tell us if there's anything you think should be improved.
Nick,
In the next few weeks, we plan to release HiQuality Worldwide. This is a cloud-based deduplication engine that will certainly be capable of finding duplicate entries in your data.
We will bring HIquality Worldwide to you as an extension to DataCleaner. It will be very easy to use.
If you want, I will gladly keep you posted.
kind regards,
Hans Drexler
In the next few weeks, we plan to release HiQuality Worldwide. This is a cloud-based deduplication engine that will certainly be capable of finding duplicate entries in your data.
We will bring HIquality Worldwide to you as an extension to DataCleaner. It will be very easy to use.
If you want, I will gladly keep you posted.
kind regards,
Hans Drexler
Thank you for your replies Kasper and Drexler. It sounds like yo are both speaking about the same extension that will be coming in the next few weeks. Will this analyzer be freeware?
Drexler, the unique ID I am working with is driver's license number, so I would be hesitant to move my data from my hard disk. Am I correct in assuming my data would need to be upload to the net to use HI quality?
Thanks again for our help. I am currently watching the Webcasts and reading the DataCleaner documentation.
Drexler, the unique ID I am working with is driver's license number, so I would be hesitant to move my data from my hard disk. Am I correct in assuming my data would need to be upload to the net to use HI quality?
Thanks again for our help. I am currently watching the Webcasts and reading the DataCleaner documentation.
Nick,
Yes, Kasper and I are indeed talking about the same product.
The deduplication tool will not be entirely freeware, although there will be a free version that can be used for small tasks.
Yes, the deduplication tool will run on a cloud server. Data transfer is always over SSL secured connections. But you would need to trust us to not look at your data. We will not.
Please feel free to ask more, if you like the idea.
Yes, Kasper and I are indeed talking about the same product.
The deduplication tool will not be entirely freeware, although there will be a free version that can be used for small tasks.
Yes, the deduplication tool will run on a cloud server. Data transfer is always over SSL secured connections. But you would need to trust us to not look at your data. We will not.
Please feel free to ask more, if you like the idea.
Hi Nick,
DataCleaner 2.4 has just been released TODAY. It includes the all new Duplicate detection analyzer, which does the job! :-) It is free to use, but requires registration on www.easydq.com since the deduplication server was built and deployed there.
DataCleaner 2.4 has just been released TODAY. It includes the all new Duplicate detection analyzer, which does the job! :-) It is free to use, but requires registration on www.easydq.com since the deduplication server was built and deployed there.
Log in by clicking the login link at the top of the screen
Go back to forum.


