Topic 2.1.1 Pattern Finder/predefined token

dhartford started the topic:
2011-05-27 18:01

2.1.1 Pattern Finder/predefined token

Hey guys,
Whipped this up to do quick analysis of some data (because Datacleaner is my go-to tool for the few times I need to do this), and tried out using the Pattern Finder:

1) Predefined token name 'expected-format'

2)Gave a predefined token regex

3) run (it's that easy, I love it...)

Unfortunately, it appears in the analysis results there is a prefixed '9' appearing no matter what I do:


The format happened to find one record that matched the expected-format with a prefixed number too, so I know it 'worked', just didn't want the prefix number. :-)

dhartford replied:
2011-05-27 18:10
or, maybe I should say, my 'expected-format' should have covered 80% of the scenarios (this is a true regex w/ a .* at the end), but instead getting a lot of the normal '99/99/99 aaaaaaa, aaaa' style formats.

mm/dd/yy <name text that doesnt matter> is what I'm checking, as I know some of the dates are poor.

[0-1][0-9]/[0-3][0-9]/[0-1][0-9] .*

deleted user replied:
2011-07-31 05:30
Hi dhartford. I'm trying to understand if you're requesting a change here, or just sharing experience?
Would you like predefined tokens to only be applied if it matches the complete string?
Actually then I think it should be something else - a predefined pattern! A token is only one of the components of a pattern and the idea of the predefined tokens is that you can name certain parts of the patterns that occurs in various situations. For example titulations and salutations in name fields.
dhartford replied:
2011-08-01 05:25
A predefined pattern might be what I'm trying to say, actually yes, that's exactly what I'm trying to say thanks for clarifying between pattern and token.

So some tests/example would be given a pre-defined pattern ".*" named "allmatch" and that is the only predefined pattern, the report should *only* show allmatch.

given named "mytestpattern" match name:
[0-1][0-9]/[0-3][0-9]/[0-1][0-9] .*

01/01/01 this is a test: mytestpattern

1/01/01 this is a test:
9/99/99 aaaa aa a aaaa (doesn't match so defining)

22/22/22 this is a test:
99/99/99 aaaa aa a aaaa (doesn't match so defining)

12/22/11 this is a test:

dhartford replied:
2011-08-01 05:25
Ah, ok - that's a good workaround, unfortunately I was hoping to, for lack of a better word, Report on the number matched (and which regex/pattern matched) and which ones needed new/custom matches.

How I think of it as unit-testing the data quality - we expect 80% to be fine with this 1 regex pattern, get another 5-10% with another regex, and then any remainders are failing the testing (either due to bad data, or is valid data but is missing the appropriate regex) and have enough information to take action on them.
deleted user replied:
2011-08-01 05:25
Here's what I suggest:

1) Create all your "predefined patterns" as Regex String Patterns in Reference Data -> String patterns.

2) Go to the "Filters" tab and add a "String pattern match" filter.

3) Configure this filter to use all your string patterns. Use the "ANY" match criteria.

4) Add a pattern finder analyzer

5) Click the pattern finder analyzer and click the "no filter requirement" button and select "String pattern match -> INVALID".

Then you will get only the patterns that does not match your predefined patterns!
deleted user replied:
2011-08-01 05:25
Funny, that's also how I often think of data profiling (the unittesting metaphor).

If you want the report also, then simply use the "Matching analyzer" also. This will show you the pattern matches on the individual String patterns.