Prevent post-processor from running multiple times when datasource has multiple source defs in project.xml?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Prevent post-processor from running multiple times when datasource has multiple source defs in project.xml?

Sam Hokin-3
Hi, devs. It's me again. Here's another one. I've got four source entries in my project.xml that all come from the same datasource,
but using different processors, which need to be run separately for merging reasons. I also have a single post-processor that I'd
like to run once. (Since all the data sources are run before post-processing, it doesn't make any sense to run a post-processor more
than once.)

However, since there are four entries in project.xml (chado-genomics, chado-genetics, chado-go and chado-featureprop), the
corresponding post-processor is run four consecutive times by the do-sources post-process task. (Fortunately in this case it's a
quick post-process, but if it took many hours, it would be hugely annoying that it runs four times. And yes, I know, I can create
some sort of flag or test so the post-processor exits out if it's already been run, but I'm trying to make InterMine better here, or
at least my use of it.)

I don't see any way of limiting the times the post-processor is run. It looks to me like this must happen with FlyMine's chado-db
FlyBasePostProcess for the same reason: FlyMine has many chado-db sources defined in their project.xml.

Here's my DATASOURCE/project.properties which defines the source-related post-processor:

compile.dependencies = intermine/objectstore/main,bio/core/main,\
                        intermine/integrate/main, \
                        bio/sources/legfed/main

have.db.tgt = true
converter.class = org.intermine.bio.dataconversion.ChadoDBConverter
postprocessor.class = org.intermine.bio.postprocess.LegfedPostProcess

And here's the project.xml which results in do-sources running LegfedPostProcess four times:

     <!-- chado genomics - has merge priority, so run first -->
     <source name="chado-genomics" type="legfed" dump="true">
       <property name="source.db.name" value="tripal"/>
       <property name="organisms" value="3885 3398"/>
       <property name="dataSetTitle" value="LIS Phaseolus vulgaris (3885) data"/>
       <property name="dataSourceName" value="LIS Tripal database"/>
       <property name="converter.class" value="org.intermine.bio.dataconversion.ChadoDBConverter"/>
       <property name="processors" value="org.intermine.bio.dataconversion.SequenceProcessor"/>
     </source>

     <!-- chado genetics -->
     <source name="chado-genetics" type="legfed" dump="true">
       <property name="source.db.name" value="tripal"/>
       <property name="organisms" value="3885 3398"/>
       <property name="dataSetTitle" value="LIS Phaseolus vulgaris (3885) data"/>
       <property name="dataSourceName" value="LIS Tripal database"/>
       <property name="converter.class" value="org.intermine.bio.dataconversion.ChadoDBConverter"/>
       <property name="processors" value="org.intermine.bio.dataconversion.GeneticProcessor"/>
     </source>

     <!-- chado GO annotation -->
     <source name="chado-go" type="legfed" dump="true">
       <property name="source.db.name" value="tripal"/>
       <property name="organisms" value="3885 3398"/>
       <property name="dataSetTitle" value="LIS Phaseolus vulgaris (3885) data"/>
       <property name="dataSourceName" value="LIS Tripal database"/>
       <property name="converter.class" value="org.intermine.bio.dataconversion.ChadoDBConverter"/>
       <property name="processors" value="org.intermine.bio.dataconversion.GOProcessor"/>
     </source>

     <!-- chado featureprop attributes -->
     <source name="chado-featureprop" type="legfed" dump="true">
       <property name="source.db.name" value="tripal"/>
       <property name="organisms" value="3885 3398"/>
       <property name="dataSetTitle" value="LIS Phaseolus vulgaris (3885) data"/>
       <property name="dataSourceName" value="LIS Tripal database"/>
       <property name="converter.class" value="org.intermine.bio.dataconversion.ChadoDBConverter"/>
       <property name="processors" value="org.intermine.bio.dataconversion.FeaturePropProcessor"/>
     </source>

Is there any way to tell InterMine to only run LegfedPostProcess once, even though its data source appears four times in project.xml??

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Prevent post-processor from running multiple times when datasource has multiple source defs in project.xml?

Sergio Contrino
dear sam,
would it be reasonable to add a specific post process for your chado
data, and add it to the list of post processes at the end of your
project file (while removing the post proceeses involved in the
do-source one)?
depending on what your post process do you could refer to different
cases already in the repository.
if this is not reasonable, i'll make a ticket.
thanks!
sergio


On 14/04/16 22:47, Sam Hokin wrote:

> Hi, devs. It's me again. Here's another one. I've got four source
> entries in my project.xml that all come from the same datasource, but
> using different processors, which need to be run separately for merging
> reasons. I also have a single post-processor that I'd like to run once.
> (Since all the data sources are run before post-processing, it doesn't
> make any sense to run a post-processor more than once.)
>
> However, since there are four entries in project.xml (chado-genomics,
> chado-genetics, chado-go and chado-featureprop), the corresponding
> post-processor is run four consecutive times by the do-sources
> post-process task. (Fortunately in this case it's a quick post-process,
> but if it took many hours, it would be hugely annoying that it runs four
> times. And yes, I know, I can create some sort of flag or test so the
> post-processor exits out if it's already been run, but I'm trying to
> make InterMine better here, or at least my use of it.)
>
> I don't see any way of limiting the times the post-processor is run. It
> looks to me like this must happen with FlyMine's chado-db
> FlyBasePostProcess for the same reason: FlyMine has many chado-db
> sources defined in their project.xml.
>
> Here's my DATASOURCE/project.properties which defines the source-related
> post-processor:
>
> compile.dependencies = intermine/objectstore/main,bio/core/main,\
>                         intermine/integrate/main, \
>                         bio/sources/legfed/main
>
> have.db.tgt = true
> converter.class = org.intermine.bio.dataconversion.ChadoDBConverter
> postprocessor.class = org.intermine.bio.postprocess.LegfedPostProcess
>
> And here's the project.xml which results in do-sources running
> LegfedPostProcess four times:
>
>      <!-- chado genomics - has merge priority, so run first -->
>      <source name="chado-genomics" type="legfed" dump="true">
>        <property name="source.db.name" value="tripal"/>
>        <property name="organisms" value="3885 3398"/>
>        <property name="dataSetTitle" value="LIS Phaseolus vulgaris
> (3885) data"/>
>        <property name="dataSourceName" value="LIS Tripal database"/>
>        <property name="converter.class"
> value="org.intermine.bio.dataconversion.ChadoDBConverter"/>
>        <property name="processors"
> value="org.intermine.bio.dataconversion.SequenceProcessor"/>
>      </source>
>
>      <!-- chado genetics -->
>      <source name="chado-genetics" type="legfed" dump="true">
>        <property name="source.db.name" value="tripal"/>
>        <property name="organisms" value="3885 3398"/>
>        <property name="dataSetTitle" value="LIS Phaseolus vulgaris
> (3885) data"/>
>        <property name="dataSourceName" value="LIS Tripal database"/>
>        <property name="converter.class"
> value="org.intermine.bio.dataconversion.ChadoDBConverter"/>
>        <property name="processors"
> value="org.intermine.bio.dataconversion.GeneticProcessor"/>
>      </source>
>
>      <!-- chado GO annotation -->
>      <source name="chado-go" type="legfed" dump="true">
>        <property name="source.db.name" value="tripal"/>
>        <property name="organisms" value="3885 3398"/>
>        <property name="dataSetTitle" value="LIS Phaseolus vulgaris
> (3885) data"/>
>        <property name="dataSourceName" value="LIS Tripal database"/>
>        <property name="converter.class"
> value="org.intermine.bio.dataconversion.ChadoDBConverter"/>
>        <property name="processors"
> value="org.intermine.bio.dataconversion.GOProcessor"/>
>      </source>
>
>      <!-- chado featureprop attributes -->
>      <source name="chado-featureprop" type="legfed" dump="true">
>        <property name="source.db.name" value="tripal"/>
>        <property name="organisms" value="3885 3398"/>
>        <property name="dataSetTitle" value="LIS Phaseolus vulgaris
> (3885) data"/>
>        <property name="dataSourceName" value="LIS Tripal database"/>
>        <property name="converter.class"
> value="org.intermine.bio.dataconversion.ChadoDBConverter"/>
>        <property name="processors"
> value="org.intermine.bio.dataconversion.FeaturePropProcessor"/>
>      </source>
>
> Is there any way to tell InterMine to only run LegfedPostProcess once,
> even though its data source appears four times in project.xml??
>
> _______________________________________________
> dev mailing list
> [hidden email]
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>

--
sergio contrino                  InterMine, University of Cambridge
https://sergiocontrino.github.io           http://www.intermine.org

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Prevent post-processor from running multiple times when datasource has multiple source defs in project.xml?

Sam Hokin-3
In reply to this post by Sam Hokin-3
That's certainly an option, Sergio, but this post-processor specifically uses the data model additions that are defined in my
datasource (bio/sources/legfed/legfed_additions.xml). It would be of no use to the general public as a standalone processor. It's a
perfect example of a datasource-specific post-processor that should be run by do-sources.

So, I can get by, but my issue is at the app design level: I think running a post-processor for every time you have a data source
integration entry in project.xml is bad design - it goes against the idea of being able to use a data source with multiple
integration processors (I have a total of eight so far), since one would never run a post-processor more than once. To me, it makes
sense to define the post-processor under the legfed datasource and for it to be run once from do-sources.

Like I said, it seems to me that FlyMine's chado-db post-processor, FlyBasePostProcess.java, must get run up to 6 times since that
datasource has 6 processors (FlyBaseProcessor.java, ModEncodeFeatureProcessor.java, ModEncodeMetaDataProcessor.java,
SequenceProcessor.java, StockProcessor.java, WormBaseProcessor.java) which could be run separately. So, I'd have thought that you
folks would have already run into this.

Sorry about the verbose justification, but yes, I think this issue deserves a ticket. :)

Cheers!
Sam

On 04/18/2016 08:25 AM, sergio contrino wrote:
> dear sam,
> would it be reasonable to add a specific post process for your chado data, and add it to the list of post processes at the end of
> your project file (while removing the post proceeses involved in the do-source one)?
> depending on what your post process do you could refer to different cases already in the repository.
> if this is not reasonable, i'll make a ticket.
> thanks!
> sergio



_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev