Topic: Duplicate records
Duplicate records
Hi datacleaner people... I'm wondering if you're planning to include support for finding duplicate records in a database? At my company we need a tool to find such records, this is the reasn I'm asking.
By the way, I really like the product other than this "missing feature" :)
This feature has not been included in the roadmap, but I see no reason why not! And you're right, it seems like a very relevant thing. So I will add a ticket for it right away.
Very good. Looking foward to trying it out.
Duplicates in names and addresses are one of the most prominent data quality issues around.
There are a couple of different approaches to find duplicates in names and addresses:
Synonyms: This is in my eyes the most basic approach. You have a list of common translations to different words like common misspellings, nicknames and so on. This approach is of course very depending on heavy maintenance, and must be worked over for every language/country – and actually works better with English than other languages like the Germanic ones, where you use concatenated words (like ‘Main Street’ being ‘Mainstreet’).
Match codes: You find those from very simple ones to the more sophisticated ones – going from ignoring vowels, soundex and metaphone (for English) to proprietary findings of all kinds. In my eyes match codes works OK for selecting candidates for matching – but falls a bit short when coming to actually settling the case.
Algorithms: A complex algorithm seems to be the only secure way to actually settling if two different spelled records make up the same real world entity. You have to deal with truncations, non phonetic typos, rearranged words and letters and all that jazz.
Probabilistic learning: This is at bottom a variation of synonyms, but the collection is not based on up front maintenance but collection of users actual decisions when verifying automatic matching. You may register the frequency and context of the decisions and combine with match codes and algorithms. This of course requires a substantial collection.
All approaches may of course be combined – and parsing and standardisation (a topic of its own) is often a supplementary method when you dedupe.
There are a couple of different approaches to find duplicates in names and addresses:
Synonyms: This is in my eyes the most basic approach. You have a list of common translations to different words like common misspellings, nicknames and so on. This approach is of course very depending on heavy maintenance, and must be worked over for every language/country – and actually works better with English than other languages like the Germanic ones, where you use concatenated words (like ‘Main Street’ being ‘Mainstreet’).
Match codes: You find those from very simple ones to the more sophisticated ones – going from ignoring vowels, soundex and metaphone (for English) to proprietary findings of all kinds. In my eyes match codes works OK for selecting candidates for matching – but falls a bit short when coming to actually settling the case.
Algorithms: A complex algorithm seems to be the only secure way to actually settling if two different spelled records make up the same real world entity. You have to deal with truncations, non phonetic typos, rearranged words and letters and all that jazz.
Probabilistic learning: This is at bottom a variation of synonyms, but the collection is not based on up front maintenance but collection of users actual decisions when verifying automatic matching. You may register the frequency and context of the decisions and combine with match codes and algorithms. This of course requires a substantial collection.
All approaches may of course be combined – and parsing and standardisation (a topic of its own) is often a supplementary method when you dedupe.
Hi SantaClaus
Thanks for the input, those are some interesting approaches and I have added them (or rather, this discussion thread) to the ticket description (Ticket #105) for us to use when implementing the duplicate records profile.
Thanks for the input, those are some interesting approaches and I have added them (or rather, this discussion thread) to the ticket description (Ticket #105) for us to use when implementing the duplicate records profile.
Santa, just wanted you to know that we now have phonetic similarity finder in the latest development version of DataCleaner 2.0. It uses a combination of soundex, refined soundex and metaphone. Let me know if this is something you're interested in looking deeper into!
i'm looking into data cleansing dealing primarily with East-Asian names. I read that Soundex is not successful for this kind of job. I am wondering if this extension has been tested successfully on non-English names?
sorry, just checking out for quick solution for the problem i'm looking at :D
Hi wsquare,
Soundex is indeed focused on English names. However, there is alxo a transliterator in DataCleaner, and you might try whether that serves your needs. In addition you can try Soundex over the transliterated names.
Soundex is indeed focused on English names. However, there is alxo a transliterator in DataCleaner, and you might try whether that serves your needs. In addition you can try Soundex over the transliterated names.
Log in by clicking the login link at the top of the screen
Go back to forum.


