Topic: challenge: date overlap/gap analysis
challenge: date overlap/gap analysis
Hey all,
After talking with some data people, I found a common problem that is in need of a good solution.
There are times when a table may have a natural key and a 'start' date and a 'stop' date that will have multiple entries.
In these times, it may be expected that the date ranges are not allowed to overlap, or other times not have gaps, or sometimes both.
Also, the 'allowance' variance could allow for the 'same' start/stop dates to be considered overlap, while other times 1 day between start/stop (or 3 days for a weekend) is acceptable and should not be considered a gap.
So, is this analysis problem something Datacleaner can help identify and flush out!
example data that may or may not have overlap/gap dates depending on variance threshold (Based on date opposed to date-time):
user|start|stop|job
fred|1/1/2000|1/1/2001|intern
fred|1/1/2001|1/1/2002|helpdesk
fred|1/2/2002|1/1/2003|appsupport
fred|12/31/2002|1/1/2004|specialist
===============
Just throwing that out there to see if it is a common problem for other people and/or an exciting challenge for Datacleaner!
-D
After talking with some data people, I found a common problem that is in need of a good solution.
There are times when a table may have a natural key and a 'start' date and a 'stop' date that will have multiple entries.
In these times, it may be expected that the date ranges are not allowed to overlap, or other times not have gaps, or sometimes both.
Also, the 'allowance' variance could allow for the 'same' start/stop dates to be considered overlap, while other times 1 day between start/stop (or 3 days for a weekend) is acceptable and should not be considered a gap.
So, is this analysis problem something Datacleaner can help identify and flush out!
example data that may or may not have overlap/gap dates depending on variance threshold (Based on date opposed to date-time):
user|start|stop|job
fred|1/1/2000|1/1/2001|intern
fred|1/1/2001|1/1/2002|helpdesk
fred|1/2/2002|1/1/2003|appsupport
fred|12/31/2002|1/1/2004|specialist
===============
Just throwing that out there to see if it is a common problem for other people and/or an exciting challenge for Datacleaner!
-D
Hi dhartford,
This is an interesting problem and I don't think it will be too hard to fix. Or at least I think it's a good example of something that DataCleaner should be able to do!
Maybe you can code it as a Javascript validation rule in which case I hope you'll share the code. In the mean time I will register a ticket for a specialized component for this type of analysis.
This is an interesting problem and I don't think it will be too hard to fix. Or at least I think it's a good example of something that DataCleaner should be able to do!
Maybe you can code it as a Javascript validation rule in which case I hope you'll share the code. In the mean time I will register a ticket for a specialized component for this type of analysis.
Hi again,
I was just thinking about this and I wanted to say that my suggestion about doing it as a Javascript validation is not possible. The current Javascript validation only supports validating a single row at a time and I guess the validation logic of the gap analysis spans several rows. So wait up for a proper solution to the problem :-)
I was just thinking about this and I wanted to say that my suggestion about doing it as a Javascript validation is not possible. The current Javascript validation only supports validating a single row at a time and I guess the validation logic of the gap analysis spans several rows. So wait up for a proper solution to the problem :-)
Thank you for double checking the Javascript validation approach, I probably wouldn't have been able to look into it until later this weekend.
I was trying to be as concise in the needs as I could, but I may have missed a couple of key pieces so adding them as early as possible:
*There is the opportunity for multiple fields to represent the natural key, this wasn't explicit earlier.
*the overlap/gap is over any time-based datatype, so it may be day-only date, date-time to the second, date-time to the millisecond, etc.
It's always nice to hear someone get excited about a good challenge :-)
I was trying to be as concise in the needs as I could, but I may have missed a couple of key pieces so adding them as early as possible:
*There is the opportunity for multiple fields to represent the natural key, this wasn't explicit earlier.
*the overlap/gap is over any time-based datatype, so it may be day-only date, date-time to the second, date-time to the millisecond, etc.
It's always nice to hear someone get excited about a good challenge :-)
Hehe I think this is a perfect case of a type of analysis for the new analysis engine, AnalyzerBeans, which I anticipate will soon be integrated into DataCleaner. With AnalyzerBeans it is possible to mix and match transformer components so that your "added requirements" is merely an issue of
1) Adding a "concatenate transformer" for the composite keys and pipe the output of this transformer to the input of the "key column" of the gap analysis analysis.
2) Add "convert to datetime" transformers to date-columns that doesn't comply to the same format. This will convert their values so that their representation within the application is the same.
1) Adding a "concatenate transformer" for the composite keys and pipe the output of this transformer to the input of the "key column" of the gap analysis analysis.
2) Add "convert to datetime" transformers to date-columns that doesn't comply to the same format. This will convert their values so that their representation within the application is the same.
Hi dhartford,
I now have a working solution for your problem. It's implemented in a new tool called AnalyzerBeans which will in time be integrated with DataCleaner. Here's how you go about using it, I hope you'll give me some feedback...
Check out the source and compile it:
Afterwards, run the example that I've created for you:
(I'm assuming you have the tools and skills to do this because I saw your linkedin profile and it looks as if this should familiar to you :))
I now have a working solution for your problem. It's implemented in a new tool called AnalyzerBeans which will in time be integrated with DataCleaner. Here's how you go about using it, I hope you'll give me some feedback...
Check out the source and compile it:
svn co http://eobjects.org/svn/AnalyzerBeans/trunk/ AnalyzerBeansNow wait for it to compile.
cd AnalyzerBeans
mvn install
Afterwards, run the example that I've created for you:
java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -job examples/date_analysis_job.xmlYou'll see a lot of output because the example database doesn't contain a really "good" dataset for the use case you're proposing. So what I suggest is:
- Edit examples/conf.xml and insert a new datastore with the connection configuration that you need.
- Edit examples/date_analysis_job.xml and adjust the datastore ref and the input columns in the beginning of the file.
(I'm assuming you have the tools and skills to do this because I saw your linkedin profile and it looks as if this should familiar to you :))
very quick turnaround, very nice!
unfortunately, I have some (good) life events going on over the next couple of weeks, but will be anxious to try this out -- I'll forward this to a couple of other guys for them to take a look as well.
unfortunately, I have some (good) life events going on over the next couple of weeks, but will be anxious to try this out -- I'll forward this to a couple of other guys for them to take a look as well.
I'm back, and I like what I'm seeing!
Although the ORDERS doesn't exactly express what would be the intent, as a datasource using the ORDERS.STATUS column as the 'natural key' it is sufficient to show that yes, this is indeed working as I would expect!
Probably would need to add in the job.xml file a *Technicalkey column* input so when it identifies gap/overlaps it can map to the unique key (versus the natural key used to group the dates).
Although the ORDERS doesn't exactly express what would be the intent, as a datasource using the ORDERS.STATUS column as the 'natural key' it is sufficient to show that yes, this is indeed working as I would expect!
Probably would need to add in the job.xml file a *Technicalkey column* input so when it identifies gap/overlaps it can map to the unique key (versus the natural key used to group the dates).
Maybe you can archieve this by concatenating two columns and then using this concatenated value as the "Group column". This can be archieved with something like this:
<transformation>and then later:
<transformer>
<descriptor ref="Concatenator"/>
<input ref="my_natural_key" />
<input ref="my_tech_key" />
<output id="my_composite_key">
</transformer>
</transformation>
<input name="Group column" ref="my_composite_key" />Can you try this?
Hey dhartford,
Just finished implementing the visual display of the results for the Date Gap Analyzer. You can see a screenshot here:
http://kasper.eobjects.org/2010/12/yadc2s.html
Is this what you expected?
Just finished implementing the visual display of the results for the Date Gap Analyzer. You can see a screenshot here:
http://kasper.eobjects.org/2010/12/yadc2s.html
Is this what you expected?
Hi Darren,
Did you try out the new DataCleaner 2.0? It features the date gap analyzer that you mention, so maybe you'd like to comment on it?
Did you try out the new DataCleaner 2.0? It features the date gap analyzer that you mention, so maybe you'd like to comment on it?
Log in by clicking the login link at the top of the screen
Go back to forum.


