Topic: Simplify usage of filtering datasets
Simplify usage of filtering datasets
Preparing some analysis jobs I stumbled about some small but important scenario:
whenever I want to filter data before I do any analysis it seems that this cannot be done within one workflow.
In fact I have to define a first job which is filtering the data and then create a second one who picks up the filtered dataset from the previous job. Regardless now if I do it with Javascript or using some standard features.
Why can't this be done in the very first step. It would be even better if I can perform this on the database side by means of generating the SQL to query the data already including the filter condition. This would give some performance boost.
So if I use the available functions the following will happen -> first I enter my criterias:
* select source table
* define filter if ATTRIBUTE_1 = "1"
* select "VALID"
* write to datastore
* add analyzer and select value distribution
So my intention was select a source, only take records with ATTRIBUTE_1 = "1" and continue with value distribution only for this subset of data.
What happens within DataCleaner:
* use the filter
* the datastore will be populated
* so, now I have to create another job as there is no way to say use the datastore for the next analysis step (which means "value distribution")
The thing which worries is, that I am forced to several job-specifications and have to start them in sequence on my own. It will be a great help, if I can specify such dependencies and steps within the same job.
Of course I can do this with DataCleaner, but then it stores the subset in the datastore but the following value distribution step runs again on the unfiltered original table.
But maybe I've missed also something important when I've created my analysis steps.
whenever I want to filter data before I do any analysis it seems that this cannot be done within one workflow.
In fact I have to define a first job which is filtering the data and then create a second one who picks up the filtered dataset from the previous job. Regardless now if I do it with Javascript or using some standard features.
Why can't this be done in the very first step. It would be even better if I can perform this on the database side by means of generating the SQL to query the data already including the filter condition. This would give some performance boost.
So if I use the available functions the following will happen -> first I enter my criterias:
* select source table
* define filter if ATTRIBUTE_1 = "1"
* select "VALID"
* write to datastore
* add analyzer and select value distribution
So my intention was select a source, only take records with ATTRIBUTE_1 = "1" and continue with value distribution only for this subset of data.
What happens within DataCleaner:
* use the filter
* the datastore will be populated
* so, now I have to create another job as there is no way to say use the datastore for the next analysis step (which means "value distribution")
The thing which worries is, that I am forced to several job-specifications and have to start them in sequence on my own. It will be a great help, if I can specify such dependencies and steps within the same job.
Of course I can do this with DataCleaner, but then it stores the subset in the datastore but the following value distribution step runs again on the unfiltered original table.
But maybe I've missed also something important when I've created my analysis steps.
one addon to this topic: the same for if I want to combine several Filter steps to have them performed in one go.
Currently this can only be done if I take one step only, safe the result in a datastore, then performing the next filter step on the newly created datastore, and so on.
Currently this can only be done if I take one step only, safe the result in a datastore, then performing the next filter step on the newly created datastore, and so on.
Ok, just found a solution myself (was more by accident, but think I have to do some more testing in detail):
when specifying the filter the "outcomes" box shall be not used if you want to base your analysis on the result of the filter itself.
So this was the wrong turn I've made. Just select the attribute plus specify the filter condition.
Next step then is to select the analysis step required and the click on the "(no filter requirement)" button in the right upper corner of the analysis window.
This will show you then a choice for VALID or INVALID. Selecting one of them the analysis will be done only on the filtered subset without any step in between.
Copyright 2011 © eobjects.org
when specifying the filter the "outcomes" box shall be not used if you want to base your analysis on the result of the filter itself.
So this was the wrong turn I've made. Just select the attribute plus specify the filter condition.
Next step then is to select the analysis step required and the click on the "(no filter requirement)" button in the right upper corner of the analysis window.
This will show you then a choice for VALID or INVALID. Selecting one of them the analysis will be done only on the filtered subset without any step in between.
Copyright 2011 © eobjects.org
Hi Christian,
You are right in what you figured out :) And if the chain of filters permit it, then it will also optimize the query used to run the analysis ("push down optimization")!
But it is interesting that you struggled to figure this functionality out. I guess the reason for it to be so "intricate" is that we actually support filtering not only as subsetting, but also as splitting, so that you can bind different analyzers to different filter outcomes. Clicking the "visualize" button greatly clarifies what is bound IMO.
If you have suggestions as to how we can improve the UI though, I am all ears.
You are right in what you figured out :) And if the chain of filters permit it, then it will also optimize the query used to run the analysis ("push down optimization")!
But it is interesting that you struggled to figure this functionality out. I guess the reason for it to be so "intricate" is that we actually support filtering not only as subsetting, but also as splitting, so that you can bind different analyzers to different filter outcomes. Clicking the "visualize" button greatly clarifies what is bound IMO.
If you have suggestions as to how we can improve the UI though, I am all ears.
Log in by clicking the login link at the top of the screen
Go back to forum.


