Any way to pass an input file to a post-process task?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Any way to pass an input file to a post-process task?

Sam Hokin-3
Hi, devs. I'm writing a post-processor that takes an input file (interpro.xml) and adds a bunch of data from that to proteins and
protein domains (which I already have from a different data source). I see that PostProcessOperationsTask.java has a method to set
an output file (setOutputFile), presumably from project.xml, but there is none to set an input file. I naively added a setter to do
so, but it does not work when I use:

       <property name="input.file" location="/home/intermine/data/interpro/interpro.xml"/>

in project.xml. The setter that I added is simply:

     /**
      * Set the value of inputFile
      *
      * @param inputFile an input file for operations that require one
      */
     public void setInputFile(File inputFile) {
         this.inputFile = inputFile;
     }

Any suggestions? I thought I'd ask before I start digging deeper.
_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Any way to pass an input file to a post-process task?

Justin Clark-Casey-2
Hi Sam,

Are you trying to add a post processing operation to a <source> or a new
<post-process> step in <post-processing>?

PostProcessOperationsTask handles <post-process> steps (and hardcodes
all possible steps such as create-references which is really quite
shocking).

However, do you really want post-processing on an individual source
<source> (confusing, I know).  This is done by providing a class in that
source's directory that extends org.intermine.postprocess.PostProcessor.
  This should accept the properties in <source> as usual (as described at
[1]).  One existing example is BioPAXPostProcess.java

[1]
https://intermine.readthedocs.io/en/latest/database/data-sources/custom/

On 2016-07-20 21:45, Sam Hokin wrote:

> Hi, devs. I'm writing a post-processor that takes an input file
> (interpro.xml) and adds a bunch of data from that to proteins and
> protein domains (which I already have from a different data source). I
> see that PostProcessOperationsTask.java has a method to set an output
> file (setOutputFile), presumably from project.xml, but there is none
> to set an input file. I naively added a setter to do so, but it does
> not work when I use:
>
>       <property name="input.file"
> location="/home/intermine/data/interpro/interpro.xml"/>
>
> in project.xml. The setter that I added is simply:
>
>     /**
>      * Set the value of inputFile
>      *
>      * @param inputFile an input file for operations that require one
>      */
>     public void setInputFile(File inputFile) {
>         this.inputFile = inputFile;
>     }
>
> Any suggestions? I thought I'd ask before I start digging deeper.
> _______________________________________________
> dev mailing list
> [hidden email]
> https://lists.intermine.org/mailman/listinfo/dev

_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Any way to pass an input file to a post-process task?

Sam Hokin-3
Yeah, I could stuff it into a source-specific post-process task for do-sources, but I have many processors for this given source and
do-sources ends up running the post-processor for every one (previous post to this list).

Also, it's actually a pretty generic post-processor: it just associates Interpro data with protein domains based on their ID (PF*,
SM*, TIGR*, PIRSF*, GENE3D*), which is what I happen to get from my chado data source. In addition to associating the Interpro
accession, I use this to associate domain names (e.g. "TIR") with the protein domains, which I didn't have otherwise and is
something biologists would want to search on.

And yes, it is shocking that I've had to hack PostProcessOperationsTask rather than edit a mine-specific XML file. I've got all
sorts of stuff floating under bio/postprocess now that I'd prefer to have in an external place like bio/sources/*. Basically, I
think the architecture of sources should have been repeated for post-processors.

It seems to me that a post-processor could very well want an input file specified like I do. I expect it's just an oversight because
one hasn't been needed yet. I can hack the code, but I thought I'd find out what's up first.

On 07/20/2016 03:55 PM, [hidden email] wrote:

> Hi Sam,
>
> Are you trying to add a post processing operation to a <source> or a new <post-process> step in <post-processing>?
>
> PostProcessOperationsTask handles <post-process> steps (and hardcodes all possible steps such as create-references which is really
> quite shocking).
>
> However, do you really want post-processing on an individual source <source> (confusing, I know).  This is done by providing a class
> in that source's directory that extends org.intermine.postprocess.PostProcessor.  This should accept the properties in <source> as
> usual (as described at [1]).  One existing example is BioPAXPostProcess.java
>
> [1] https://intermine.readthedocs.io/en/latest/database/data-sources/custom/
_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Any way to pass an input file to a post-process task?

joe carlson
Just my 2 cents of advice:

You probably should put the InterPro accession information into your chado database by introducing another feature_dbxref for the domain. You only need to load it once rather than every time you build a mine, and it’s there forever in your reference database. In my mine, loading is a very long process and anything that allows me to do work in advance is a good thing.

Joe

> On Jul 20, 2016, at 3:45 PM, Sam Hokin <[hidden email]> wrote:
>
> Yeah, I could stuff it into a source-specific post-process task for do-sources, but I have many processors for this given source and do-sources ends up running the post-processor for every one (previous post to this list).
>
> Also, it's actually a pretty generic post-processor: it just associates Interpro data with protein domains based on their ID (PF*, SM*, TIGR*, PIRSF*, GENE3D*), which is what I happen to get from my chado data source. In addition to associating the Interpro accession, I use this to associate domain names (e.g. "TIR") with the protein domains, which I didn't have otherwise and is something biologists would want to search on.
>
> And yes, it is shocking that I've had to hack PostProcessOperationsTask rather than edit a mine-specific XML file. I've got all sorts of stuff floating under bio/postprocess now that I'd prefer to have in an external place like bio/sources/*. Basically, I think the architecture of sources should have been repeated for post-processors.
>
> It seems to me that a post-processor could very well want an input file specified like I do. I expect it's just an oversight because one hasn't been needed yet. I can hack the code, but I thought I'd find out what's up first.
>
> On 07/20/2016 03:55 PM, [hidden email] wrote:
>> Hi Sam,
>>
>> Are you trying to add a post processing operation to a <source> or a new <post-process> step in <post-processing>?
>>
>> PostProcessOperationsTask handles <post-process> steps (and hardcodes all possible steps such as create-references which is really
>> quite shocking).
>>
>> However, do you really want post-processing on an individual source <source> (confusing, I know).  This is done by providing a class
>> in that source's directory that extends org.intermine.postprocess.PostProcessor.  This should accept the properties in <source> as
>> usual (as described at [1]).  One existing example is BioPAXPostProcess.java
>>
>> [1] https://intermine.readthedocs.io/en/latest/database/data-sources/custom/
> _______________________________________________
> dev mailing list
> [hidden email]
> https://lists.intermine.org/mailman/listinfo/dev

_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Any way to pass an input file to a post-process task?

Julie Sullivan-2
In reply to this post by Sam Hokin-3
Hi Sam,

We already have a loader for interpro.xml, use that?

http://intermine.readthedocs.io/en/latest/database/data-sources/library/proteins/interpro/

You want to avoid loading new data in the post-processing stage, as you
want to include these data in the keyword search etc.

Julie

On 07/20/2016 09:45 PM, Sam Hokin wrote:

> Hi, devs. I'm writing a post-processor that takes an input file
> (interpro.xml) and adds a bunch of data from that to proteins and
> protein domains (which I already have from a different data source). I
> see that PostProcessOperationsTask.java has a method to set an output
> file (setOutputFile), presumably from project.xml, but there is none to
> set an input file. I naively added a setter to do so, but it does not
> work when I use:
>
>       <property name="input.file"
> location="/home/intermine/data/interpro/interpro.xml"/>
>
> in project.xml. The setter that I added is simply:
>
>     /**
>      * Set the value of inputFile
>      *
>      * @param inputFile an input file for operations that require one
>      */
>     public void setInputFile(File inputFile) {
>         this.inputFile = inputFile;
>     }
>
> Any suggestions? I thought I'd ask before I start digging deeper.
> _______________________________________________
> dev mailing list
> [hidden email]
> https://lists.intermine.org/mailman/listinfo/dev
_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Any way to pass an input file to a post-process task?

Sam Hokin-3
Yeah, the interpro loader doesn't quite fit the bill. I only want to fill some attributes in the protein domains that are loaded
from my chado database, partly for esoteric design reasons. I'm not adding any new items, so in a sense it's not unlike other
post-processors like CreateReferences. Don't want to store interpro records in my mine. It's sort of between a data source (adding
new data) and a post-processor (not creating any new items).

Anyway, just thought I'd ask. Not a huge deal to leave the file name hardcoded. :)

On 07/26/2016 02:02 AM, Julie Sullivan wrote:

> Hi Sam,
>
> We already have a loader for interpro.xml, use that?
>
> http://intermine.readthedocs.io/en/latest/database/data-sources/library/proteins/interpro/
>
> You want to avoid loading new data in the post-processing stage, as you want to include these data in the keyword search etc.
>
> Julie
>
> On 07/20/2016 09:45 PM, Sam Hokin wrote:
>> Hi, devs. I'm writing a post-processor that takes an input file
>> (interpro.xml) and adds a bunch of data from that to proteins and
>> protein domains (which I already have from a different data source). I
>> see that PostProcessOperationsTask.java has a method to set an output
>> file (setOutputFile), presumably from project.xml, but there is none to
>> set an input file. I naively added a setter to do so, but it does not
>> work when I use:
>>
>>       <property name="input.file"
>> location="/home/intermine/data/interpro/interpro.xml"/>
>>
>> in project.xml. The setter that I added is simply:
>>
>>     /**
>>      * Set the value of inputFile
>>      *
>>      * @param inputFile an input file for operations that require one
>>      */
>>     public void setInputFile(File inputFile) {
>>         this.inputFile = inputFile;
>>     }
>>
>> Any suggestions? I thought I'd ask before I start digging deeper.
_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Any way to pass an input file to a post-process task?

Joel Richardson-2

You could also do this as a separate source that runs after and adds the
extra data.
And if all you¹re doing is setting simple fields in existing objects, the
easiest way is to do it as a large-item-xml source.
Then your job is to generate the ItemXML-formatted input file. To set the
³foo² attribute of protein ³Q8VBZ1², you¹d generate a record like:
        <item class=³Protein² id=³1001_1²>
        <attribute name="primaryAccession" value=³Q8VBZ1²/>
        <attribute name=³foo² value=³bar² />
        </item>


Joel

--
Joel E. Richardson, Ph.D.
Sr. Research Scientist
Mouse Genome Informatics
The Jackson Laboratory
600 Main Street
Bar Harbor, Maine 04609
207-288-6435
[hidden email]





On 7/26/16, 8:59 AM, "dev on behalf of Sam Hokin"
<[hidden email] on behalf of [hidden email]> wrote:

>Yeah, the interpro loader doesn't quite fit the bill. I only want to fill
>some attributes in the protein domains that are loaded
>from my chado database, partly for esoteric design reasons. I'm not
>adding any new items, so in a sense it's not unlike other
>post-processors like CreateReferences. Don't want to store interpro
>records in my mine. It's sort of between a data source (adding
>new data) and a post-processor (not creating any new items).
>
>Anyway, just thought I'd ask. Not a huge deal to leave the file name
>hardcoded. :)
>
>On 07/26/2016 02:02 AM, Julie Sullivan wrote:
>> Hi Sam,
>>
>> We already have a loader for interpro.xml, use that?
>>
>>
>>http://intermine.readthedocs.io/en/latest/database/data-sources/library/p
>>roteins/interpro/
>>
>> You want to avoid loading new data in the post-processing stage, as you
>>want to include these data in the keyword search etc.
>>
>> Julie
>>
>> On 07/20/2016 09:45 PM, Sam Hokin wrote:
>>> Hi, devs. I'm writing a post-processor that takes an input file
>>> (interpro.xml) and adds a bunch of data from that to proteins and
>>> protein domains (which I already have from a different data source). I
>>> see that PostProcessOperationsTask.java has a method to set an output
>>> file (setOutputFile), presumably from project.xml, but there is none to
>>> set an input file. I naively added a setter to do so, but it does not
>>> work when I use:
>>>
>>>       <property name="input.file"
>>> location="/home/intermine/data/interpro/interpro.xml"/>
>>>
>>> in project.xml. The setter that I added is simply:
>>>
>>>     /**
>>>      * Set the value of inputFile
>>>      *
>>>      * @param inputFile an input file for operations that require one
>>>      */
>>>     public void setInputFile(File inputFile) {
>>>         this.inputFile = inputFile;
>>>     }
>>>
>>> Any suggestions? I thought I'd ask before I start digging deeper.
>_______________________________________________
>dev mailing list
>[hidden email]
>https://lists.intermine.org/mailman/listinfo/dev

---

The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.

_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Any way to pass an input file to a post-process task?

Sam Hokin-3
Thanks, Joel. My goal isn't to get the data into the mine by hook or by crook. I can do that a lot of ways. My goal is to write data
sources and post-processors that import data as it comes down the pipe, without scripts or anything else to reformat it into a new,
acceptable format. I don't want workflows. I can pull the Interpro data directly from the Interpro file URL; my post-processor can
then use that to update the ProteinDomain fields in the object store. No intervention required.

It took me about an hour to write a PostProcessor to update ProteinDomain fields from Interpro. Yes, that could be a data source,
but I don't feel like reloading the protein domains from chado along with interpro data from Interpro (and then merging them) when I
can work with the existing ProteinDomain data, already imported by a generic chado importer, and then spin a post-process that takes
a minute or two to yank the extra data from Interpro.

So, it's not an InterMine-compliant solution. But it's a good solution, IMO. :)

On 07/26/2016 07:39 AM, Joel Richardson wrote:

>
> You could also do this as a separate source that runs after and adds the
> extra data.
> And if all you¹re doing is setting simple fields in existing objects, the
> easiest way is to do it as a large-item-xml source.
> Then your job is to generate the ItemXML-formatted input file. To set the
> ³foo² attribute of protein ³Q8VBZ1², you¹d generate a record like:
> <item class=³Protein² id=³1001_1²>
> <attribute name="primaryAccession" value=³Q8VBZ1²/>
> <attribute name=³foo² value=³bar² />
> </item>
>
>
> Joel
>
_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev