back to forum.

Topic: 2.1.1 Pattern Finder/predefined token

Topic by
dhartford

2011-05-27
18:01

2.1.1 Pattern Finder/predefined token

Hey guys,
Whipped this up to do quick analysis of some data (because Datacleaner is my go-to tool for the few times I need to do this), and tried out using the Pattern Finder:

1) Predefined token name 'expected-format'

2)Gave a predefined token regex

3) run (it's that easy, I love it...)

Unfortunately, it appears in the analysis results there is a prefixed '9' appearing no matter what I do:

'9[expected-format]'

The format happened to find one record that matched the expected-format with a prefixed number too, so I know it 'worked', just didn't want the prefix number. :-)




Reply by
dhartford

2011-05-27
18:10
or, maybe I should say, my 'expected-format' should have covered 80% of the scenarios (this is a true regex w/ a .* at the end), but instead getting a lot of the normal '99/99/99 aaaaaaa, aaaa' style formats.

mm/dd/yy <name text that doesnt matter> is what I'm checking, as I know some of the dates are poor.

regex:
[0-1][0-9]/[0-3][0-9]/[0-1][0-9] .*

Reply by
kasper

2011-07-31
05:30
Hi dhartford. I'm trying to understand if you're requesting a change here, or just sharing experience?
Would you like predefined tokens to only be applied if it matches the complete string?
Actually then I think it should be something else - a predefined pattern! A token is only one of the components of a pattern and the idea of the predefined tokens is that you can name certain parts of the patterns that occurs in various situations. For example titulations and salutations in name fields.

Reply by
dhartford

2011-08-01
05:25
A predefined pattern might be what I'm trying to say, actually yes, that's exactly what I'm trying to say thanks for clarifying between pattern and token.

So some tests/example would be given a pre-defined pattern ".*" named "allmatch" and that is the only predefined pattern, the report should *only* show allmatch.

given named "mytestpattern" match name:
[0-1][0-9]/[0-3][0-9]/[0-1][0-9] .*

01/01/01 this is a test: mytestpattern

1/01/01 this is a test:
9/99/99 aaaa aa a aaaa (doesn't match so defining)

22/22/22 this is a test:
99/99/99 aaaa aa a aaaa (doesn't match so defining)

12/22/11 this is a test:
mytestpattern

Reply by
kasper

2011-08-01
05:25
Here's what I suggest:

1) Create all your "predefined patterns" as Regex String Patterns in Reference Data -> String patterns.

2) Go to the "Filters" tab and add a "String pattern match" filter.

3) Configure this filter to use all your string patterns. Use the "ANY" match criteria.

4) Add a pattern finder analyzer

5) Click the pattern finder analyzer and click the "no filter requirement" button and select "String pattern match -> INVALID".

Then you will get only the patterns that does not match your predefined patterns!

Reply by
dhartford

2011-08-01
05:25
Ah, ok - that's a good workaround, unfortunately I was hoping to, for lack of a better word, Report on the number matched (and which regex/pattern matched) and which ones needed new/custom matches.

How I think of it as unit-testing the data quality - we expect 80% to be fine with this 1 regex pattern, get another 5-10% with another regex, and then any remainders are failing the testing (either due to bad data, or is valid data but is missing the appropriate regex) and have enough information to take action on them.

Reply by
kasper

2011-08-01
05:25
Funny, that's also how I often think of data profiling (the unittesting metaphor).

If you want the report also, then simply use the "Matching analyzer" also. This will show you the pattern matches on the individual String patterns.

You need to be logged in to participate

In order to post your own comments on this topic, you need to be logged in.

Username:

Log in by clicking the login link at the top of the screen

 

Go back to forum.