Goals for pipeline reproducibility

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Goals for pipeline reproducibility

Chris Hemmerich

We are preparing our first update to the prokaryotic annotation pipeline
for ISGA, which means it's time to set the level of reproducibility we
will provide for old pipelines. I'm looking for feedback on how much
others are providing - especially for public services like CloVR and DIAG,
but also for internal installations.

I've identified the following problem areas;

* input

We save user input with ISGA. It would be nice for ISGA to have a sharing
mechanism, but users can share the file themselves.

* pipeline templates

These seem trivial to save. A good naming scheme is important though.

* components

Rathed than maintain component version within an Ergatis installation , we
would have to maintain multiple Ergatis installations. A number of
projects with different Ergatis backends sharing the same web frontend
seems doable.

* database versions

This is doable by storing all versions of a db and referencing explicit
versions in project.config. However, this means every db update becomes a
new pipeline version and Ergatis project. Also, if you're on cheapo NFS
like us, you want these databases on local disk so space could become an
issue.

* executable versions

This is the same as database versions. Combining the two means you
probably need a formalized, scheduled pipeline release system where you
incorporate updates on a regular scheduled.

* libraries and OS issues

Eventually versions of programs are going to get difficult to run or won't
compile on new architectures. I'm hesitant to make long term commitments
against this because of the uncertainty. With CloVR, it seems technically
straightforward as long as someone is willing to pay to let vms and tagged
databases accumulate. Eventually the images may no longer run on Amazon.


Is this in line with what others are doing? Am I missing anything or going
overboard?

Thanks much,

  Chris

------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:

Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users
Reply | Threaded
Open this post in threaded view
|

Re: Goals for pipeline reproducibility

Joshua Orvis
Putting comments below where I can provide feedback.  If you'd like to talk about any of these over the phone send me an e-mail and I can give you my # to call (if you don't still have it already)

Joshua



On Tue, Aug 31, 2010 at 10:23 AM, Chris Hemmerich <[hidden email]> wrote:

We are preparing our first update to the prokaryotic annotation pipeline
for ISGA, which means it's time to set the level of reproducibility we
will provide for old pipelines. I'm looking for feedback on how much
others are providing - especially for public services like CloVR and DIAG,
but also for internal installations.

I've identified the following problem areas;

* input

We save user input with ISGA. It would be nice for ISGA to have a sharing
mechanism, but users can share the file themselves.

* pipeline templates

These seem trivial to save. A good naming scheme is important though.

* components

Rathed than maintain component version within an Ergatis installation , we
would have to maintain multiple Ergatis installations. A number of
projects with different Ergatis backends sharing the same web frontend
seems doable.


This has actually been how we've managed this issue from the very start.  We create directories like:

/usr/local/projects/ergatis/package-v1r26
/usr/local/projects/ergatis/package-v1r27
/usr/local/projects/ergatis/package-v1r28
/usr/local/projects/ergatis/package-devel
/usr/local/projects/ergatis/package-latest

Where each of the numbered directories are actual releases, 'package-devel' is where developers manually edit code for testing, and 'package-latest' is an auto-install from a nightly cron off the SVN trunk.  When a new project is created, the project.conf file points to one of the numbered release directories and usually stays on that version unless there's a reason to change it.  This makes it so that if we want to change something about how the results are stored (in files or the database) we don't have to pass through all previous projects ever worked on and update them all too.  Older projects continue to use the older code version, and new projects newer code.  It's mostly important that they remain internally consistent and not broken.
 
* database versions

This is doable by storing all versions of a db and referencing explicit
versions in project.config. However, this means every db update becomes a
new pipeline version and Ergatis project. Also, if you're on cheapo NFS
like us, you want these databases on local disk so space could become an
issue.


I guess it's not completely clear to me what's in these databases .... the products of your annotation pipeline runs?  If so, the same versioning info I described above applies (at least if you're using Ergatis to store and access your Chado data).  We've had several schema updates over the years that give a new database version but since the data access API is versioned right along with the database on disk we can only update the projects we choose to and the rest still work.

 
* executable versions

This is the same as database versions. Combining the two means you
probably need a formalized, scheduled pipeline release system where you
incorporate updates on a regular scheduled.


Updates don't necessarily need to be on a schedule, but yes, the formal release process is always good.

 
* libraries and OS issues

Eventually versions of programs are going to get difficult to run or won't
compile on new architectures. I'm hesitant to make long term commitments
against this because of the uncertainty. With CloVR, it seems technically
straightforward as long as someone is willing to pay to let vms and tagged
databases accumulate. Eventually the images may no longer run on Amazon.


This is certainly true about the program versions.  Gene finders, etc. change output formats as major versions are released, which breaks a lot of things.  We handle this by organizing our software into directories with version numbers attached and then the values in the Ergatis software.config file point to these specific version so you always know which are being used (rather than relying on the PATH or other ENV variables.)
 

Is this in line with what others are doing? Am I missing anything or going
overboard?

Thanks much,

 Chris

------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:

Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users


------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:

Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users
Reply | Threaded
Open this post in threaded view
|

Re: Goals for pipeline reproducibility

Chris Hemmerich

Joshua,

Thanks for the response. It's good to hear I'm going down a proven path.
We are already maintaining distinct Ergatis installations and database (I
wasn't clear there - I meant databases we search against (e.g. nr ) rather
than result sets) versions. We install programs using encap, so we have
distinct directories for each version. We've been using the convenience
links generated in /usr/local/bin, but I will change that.

I don't have any additional questions right now, but will let you know if
I hit any surprises.

Cheers,

  Chris

On Wed, 1 Sep 2010, Joshua Orvis wrote:

> Putting comments below where I can provide feedback.  If you'd like to talk
> about any of these over the phone send me an e-mail and I can give you my #
> to call (if you don't still have it already)
>
> Joshua
>
>
>
> On Tue, Aug 31, 2010 at 10:23 AM, Chris Hemmerich
> <[hidden email]>wrote:
>
>>
>> We are preparing our first update to the prokaryotic annotation pipeline
>> for ISGA, which means it's time to set the level of reproducibility we
>> will provide for old pipelines. I'm looking for feedback on how much
>> others are providing - especially for public services like CloVR and DIAG,
>> but also for internal installations.
>>
>> I've identified the following problem areas;
>>
>> * input
>>
>> We save user input with ISGA. It would be nice for ISGA to have a sharing
>> mechanism, but users can share the file themselves.
>>
>> * pipeline templates
>>
>> These seem trivial to save. A good naming scheme is important though.
>>
>> * components
>>
>> Rathed than maintain component version within an Ergatis installation , we
>> would have to maintain multiple Ergatis installations. A number of
>> projects with different Ergatis backends sharing the same web frontend
>> seems doable.
>>
>>
> This has actually been how we've managed this issue from the very start.  We
> create directories like:
>
> /usr/local/projects/ergatis/package-v1r26
> /usr/local/projects/ergatis/package-v1r27
> /usr/local/projects/ergatis/package-v1r28
> /usr/local/projects/ergatis/package-devel
> /usr/local/projects/ergatis/package-latest
>
> Where each of the numbered directories are actual releases, 'package-devel'
> is where developers manually edit code for testing, and 'package-latest' is
> an auto-install from a nightly cron off the SVN trunk.  When a new project
> is created, the project.conf file points to one of the numbered release
> directories and usually stays on that version unless there's a reason to
> change it.  This makes it so that if we want to change something about how
> the results are stored (in files or the database) we don't have to pass
> through all previous projects ever worked on and update them all too.  Older
> projects continue to use the older code version, and new projects newer
> code.  It's mostly important that they remain internally consistent and not
> broken.
>
>
>> * database versions
>>
>> This is doable by storing all versions of a db and referencing explicit
>> versions in project.config. However, this means every db update becomes a
>> new pipeline version and Ergatis project. Also, if you're on cheapo NFS
>> like us, you want these databases on local disk so space could become an
>> issue.
>>
>>
> I guess it's not completely clear to me what's in these databases .... the
> products of your annotation pipeline runs?  If so, the same versioning info
> I described above applies (at least if you're using Ergatis to store and
> access your Chado data).  We've had several schema updates over the years
> that give a new database version but since the data access API is versioned
> right along with the database on disk we can only update the projects we
> choose to and the rest still work.
>
>
>
>> * executable versions
>>
>> This is the same as database versions. Combining the two means you
>> probably need a formalized, scheduled pipeline release system where you
>> incorporate updates on a regular scheduled.
>>
>>
> Updates don't necessarily need to be on a schedule, but yes, the formal
> release process is always good.
>
>
>
>> * libraries and OS issues
>>
>> Eventually versions of programs are going to get difficult to run or won't
>> compile on new architectures. I'm hesitant to make long term commitments
>> against this because of the uncertainty. With CloVR, it seems technically
>> straightforward as long as someone is willing to pay to let vms and tagged
>> databases accumulate. Eventually the images may no longer run on Amazon.
>>
>>
> This is certainly true about the program versions.  Gene finders, etc.
> change output formats as major versions are released, which breaks a lot of
> things.  We handle this by organizing our software into directories with
> version numbers attached and then the values in the Ergatis software.config
> file point to these specific version so you always know which are being used
> (rather than relying on the PATH or other ENV variables.)
>
>
>>
>> Is this in line with what others are doing? Am I missing anything or going
>> overboard?
>>
>> Thanks much,
>>
>>  Chris
>>
>>
>> ------------------------------------------------------------------------------
>> This SF.net Dev2Dev email is sponsored by:
>>
>> Show off your parallel programming skills.
>> Enter the Intel(R) Threading Challenge 2010.
>> http://p.sf.net/sfu/intel-thread-sfd
>> _______________________________________________
>> Ergatis-users mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/ergatis-users
>>
>
------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:

Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users

------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:

Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
Ergatis-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/ergatis-users