Topic: Duplicate records

Go back to forum.

Topic by
beno

2008-03-23
12:56

Duplicate records

Hi datacleaner people... I'm wondering if you're planning to include support for finding duplicate records in a database? At my company we need a tool to find such records, this is the reasn I'm asking.
Reply by
beno

2008-03-23
12:57
By the way, I really like the product other than this "missing feature" :)
Reply by
kasper

2008-03-24
12:13
This feature has not been included in the roadmap, but I see no reason why not! And you're right, it seems like a very relevant thing. So I will add a ticket for it right away.
Reply by
beno

2008-03-24
23:03
Very good. Looking foward to trying it out.
Reply by
SantaClaus

2008-08-01
05:00
Duplicates in names and addresses are one of the most prominent data quality issues around.

There are a couple of different approaches to find duplicates in names and addresses:

Synonyms: This is in my eyes the most basic approach. You have a list of common translations to different words like common misspellings, nicknames and so on. This approach is of course very depending on heavy maintenance, and must be worked over for every language/country – and actually works better with English than other languages like the Germanic ones, where you use concatenated words (like ‘Main Street’ being ‘Mainstreet’).

Match codes: You find those from very simple ones to the more sophisticated ones – going from ignoring vowels, soundex and metaphone (for English) to proprietary findings of all kinds. In my eyes match codes works OK for selecting candidates for matching – but falls a bit short when coming to actually settling the case.

Algorithms: A complex algorithm seems to be the only secure way to actually settling if two different spelled records make up the same real world entity. You have to deal with truncations, non phonetic typos, rearranged words and letters and all that jazz.

Probabilistic learning: This is at bottom a variation of synonyms, but the collection is not based on up front maintenance but collection of users actual decisions when verifying automatic matching. You may register the frequency and context of the decisions and combine with match codes and algorithms. This of course requires a substantial collection.

All approaches may of course be combined – and parsing and standardisation (a topic of its own) is often a supplementary method when you dedupe.

Reply by
kasper

2008-08-01
08:22
Hi !SantaClaus

Thanks for the input, those are some interesting approaches and I have added them (or rather, this discussion thread) to the ticket description (Ticket #105) for us to use when implementing the duplicate records profile.

You need to be logged in to participate

In order to post your own comments on this topic, you need to be logged in.

Username:

Log in by clicking the login link at the top of the screen

 

Go back to forum.

Username:

Password:

Requested username:

Password:

Real name:

Email address:

Title:

Company:

Country: