Topic: Fuzzy Matching discussion
Fuzzy Matching discussion
Copied from SantaClaus post in http://datacleaner.eobjects.org/topic/8/Duplicate-records:
Duplicates in names and addresses are one of the most prominent
data quality issues around. There are a couple of different approaches
to find duplicates in names and addresses:
Synonyms: This is in my eyes the most basic approach. You have a list
of common translations to different words like common misspellings,
nicknames and so on. This approach is of course very depending on heavy
maintenance, and must be worked over for every language/country – and
actually works better with English than other languages like the
Germanic ones, where you use concatenated words
(like ‘Main Street’ being ‘Mainstreet’).
Match codes: You find those from very simple ones to the more
sophisticated ones – going from ignoring vowels, soundex and metaphone
(for English) to proprietary findings of all kinds. In my eyes match
codes works OK for selecting candidates for matching – but falls a bit
short when coming to actually settling the case.
Algorithms: A complex algorithm seems to be the only secure way to
actually settling if two different spelled records make up the same real
world entity. You have to deal with truncations, non phonetic typos,
rearranged words and letters and all that jazz.
Probabilistic learning: This is at bottom a variation of synonyms,
but the collection is not based on up front maintenance but collection
of users actual decisions when verifying automatic matching. You may
register the frequency and context of the decisions and combine with
match codes and algorithms. This of course requires a substantial collection.
All approaches may of course be combined – and parsing and standardisation
(a topic of its own) is often a supplementary method when you dedupe.
I'm bringing up this topic just to throw in my two coppers regarding what DataCleaner does very well and very quickly, which is profiling data, and the challenge of 'scrubbing' data.
All the approaches above are indeed quite common, and will *always* be customized depending on which identity fields you will have available to work with.
Name parsers understanding prefix/suffix/hyphenated names, soundex/metaphone for comparison (not entirely accurate mind you, just helpful), nickname aliasing, and adding address standardization and lookup/address-moved algorithms it is a really complex scope.
Scrubbing data with these sometimes heavyweight approaches should be the area of specialized tools and/or integration with your favorite ETL tool (particularly if you can write custom components for your SSIS or, in my case, Pentaho Data Integration ETL job).
All the approaches above are indeed quite common, and will *always* be customized depending on which identity fields you will have available to work with.
Name parsers understanding prefix/suffix/hyphenated names, soundex/metaphone for comparison (not entirely accurate mind you, just helpful), nickname aliasing, and adding address standardization and lookup/address-moved algorithms it is a really complex scope.
Scrubbing data with these sometimes heavyweight approaches should be the area of specialized tools and/or integration with your favorite ETL tool (particularly if you can write custom components for your SSIS or, in my case, Pentaho Data Integration ETL job).
I agree that some of these features are out of scope for DataCleaner but I think it's always interesting to see if some of these quite hard tasks can be simplified. Personally I don't think the overall challenge of Data Quality is a "our tools are not advanced enough" issue, but rather a "our tools are hard to figure out" issue :)
PS: I took the liberty of formatting your post/quote :)
PS: I took the liberty of formatting your post/quote :)
Just wanted you to know that we now have phonetic similarity finder in the latest development version of DataCleaner 2.0. It uses a combination of soundex, refined soundex and metaphone. Let me know if this is something you're interested in looking deeper into!
Log in by clicking the login link at the top of the screen
Go back to forum.


