Best practices multiple gene-prediction pipelines.

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Best practices multiple gene-prediction pipelines.

Michael Dondrup
Dear all,

I have a question regarding best practices for storing the multiple gene predictions for the
same genome in a chado database. We have automatic gene-predictions for our genome, one from Augustus and one from
Maker. Both seem to have their pros and cons so we would like to keep them both for now and display them
as separate annotation tracks in Gbrowse. The output of both pipelines is in 2 GFF3 files using almost the same
SO terms (e.g. transcript, gene, CDS, exon,…).

 I have imported one, but when I import the second one, I am afraid the
predictions will get mixed up and will be identifiable by name only, or is the source field in the GFF file stored in
chado? So, what are best practices to store different sets of annotations with the same sequence type,
possibly using different databases?

Thank you very much for your comments.

Regards

Michael Dondrup
Postdoctoral researcher
The Sea Lice Research Centre
Department of Informatics
University of Bergen
Thormøhlensgate 55, N-5008 Bergen,
Norway


------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Best practices multiple gene-prediction pipelines.

Scott Cain
Hi Michael,

Yes, the source field is stored in Chado (I think it is stored in
dbxref.accession and is linked to the feature via feature_dbxref, but
I'd have to look it up to be sure).  In any event, you can use the
source in a GBrowse track stanza for a Chado database exactly the same
was as you would for a Bio::DB::GFF or Bio::DB::SeqFeature::Store
database.

Scott


On Tue, Mar 20, 2012 at 5:02 AM, Michael Dondrup <[hidden email]> wrote:

> Dear all,
>
> I have a question regarding best practices for storing the multiple gene predictions for the
> same genome in a chado database. We have automatic gene-predictions for our genome, one from Augustus and one from
> Maker. Both seem to have their pros and cons so we would like to keep them both for now and display them
> as separate annotation tracks in Gbrowse. The output of both pipelines is in 2 GFF3 files using almost the same
> SO terms (e.g. transcript, gene, CDS, exon,…).
>
>  I have imported one, but when I import the second one, I am afraid the
> predictions will get mixed up and will be identifiable by name only, or is the source field in the GFF file stored in
> chado? So, what are best practices to store different sets of annotations with the same sequence type,
> possibly using different databases?
>
> Thank you very much for your comments.
>
> Regards
>
> Michael Dondrup
> Postdoctoral researcher
> The Sea Lice Research Centre
> Department of Informatics
> University of Bergen
> Thormøhlensgate 55, N-5008 Bergen,
> Norway
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema



--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Best practices multiple gene-prediction pipelines.

Siddhartha Basu
Hi,
As scott told,  this is the exact data model (feature -> feature_dbxref
-> dbxref -> db) we are using at dictyBase(last 8 years) to store multiple gene
predictions which eventually get displayed in gbrowse as separate
tracks. The source name gets stored in the dbxref.accession column which
is referenced by a db table row with its db.name column having the value
of 'GFF_source'. This db table column gets reused for every instance of
source name stored in dbxref table.

thanks,
-siddhartha

On Tue, 20 Mar 2012, Scott Cain wrote:

> Hi Michael,
>
> Yes, the source field is stored in Chado (I think it is stored in
> dbxref.accession and is linked to the feature via feature_dbxref, but
> I'd have to look it up to be sure).  In any event, you can use the
> source in a GBrowse track stanza for a Chado database exactly the same
> was as you would for a Bio::DB::GFF or Bio::DB::SeqFeature::Store
> database.
>
> Scott
>
>
> On Tue, Mar 20, 2012 at 5:02 AM, Michael Dondrup <[hidden email]> wrote:
> > Dear all,
> >
> > I have a question regarding best practices for storing the multiple gene predictions for the
> > same genome in a chado database. We have automatic gene-predictions for our genome, one from Augustus and one from
> > Maker. Both seem to have their pros and cons so we would like to keep them both for now and display them
> > as separate annotation tracks in Gbrowse. The output of both pipelines is in 2 GFF3 files using almost the same
> > SO terms (e.g. transcript, gene, CDS, exon,…).
> >
> >  I have imported one, but when I import the second one, I am afraid the
> > predictions will get mixed up and will be identifiable by name only, or is the source field in the GFF file stored in
> > chado? So, what are best practices to store different sets of annotations with the same sequence type,
> > possibly using different databases?
> >
> > Thank you very much for your comments.
> >
> > Regards
> >
> > Michael Dondrup
> > Postdoctoral researcher
> > The Sea Lice Research Centre
> > Department of Informatics
> > University of Bergen
> > Thormøhlensgate 55, N-5008 Bergen,
> > Norway
> >
> >
> > ------------------------------------------------------------------------------
> > This SF email is sponsosred by:
> > Try Windows Azure free for 90 days Click Here
> > http://p.sf.net/sfu/sfd2d-msazure
> > _______________________________________________
> > Gmod-schema mailing list
> > [hidden email]
> > https://lists.sourceforge.net/lists/listinfo/gmod-schema
>
>
>
> --
> ------------------------------------------------------------------------
> Scott Cain, Ph. D.                                   scott at scottcain dot net
> GMOD Coordinator (http://gmod.org/)                     216-392-3087
> Ontario Institute for Cancer Research
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Best practices multiple gene-prediction pipelines.

Michael Dondrup
Thank you Scott and Siddhartha for your help,

in fact I should have looked more closely or just tried it out before  
asking, because it was documented, and it works. For what it's worth,
I imported both predictions from gff3 using good_bulk_load_gff3.pl,  
one file contains source column 'maker' and the other one AUGUSTUS.

Then in GBrowse, i configure two sub-tracks for the gene prediction  
track by adding:
[genepred]
category     = Genes:Coding
feature      = gene
glyph        = gene
### add this to separate predictions
subtrack select = Pipeline source_tag
subtrack table  =  :Augustus AUGUSTUS ;
                              :EBI maker
subtrack select labels = AUGUSTUS "Augustus" ;
                                           maker "EBI"
[….]

Best
Michael



Quoting Siddhartha Basu <[hidden email]>:

> Hi,
> As scott told,  this is the exact data model (feature -> feature_dbxref
> -> dbxref -> db) we are using at dictyBase(last 8 years) to store  
> multiple gene
> predictions which eventually get displayed in gbrowse as separate
> tracks. The source name gets stored in the dbxref.accession column which
> is referenced by a db table row with its db.name column having the value
> of 'GFF_source'. This db table column gets reused for every instance of
> source name stored in dbxref table.
>
> thanks,
> -siddhartha
>
> On Tue, 20 Mar 2012, Scott Cain wrote:
>
>> Hi Michael,
>>
>> Yes, the source field is stored in Chado (I think it is stored in
>> dbxref.accession and is linked to the feature via feature_dbxref, but
>> I'd have to look it up to be sure).  In any event, you can use the
>> source in a GBrowse track stanza for a Chado database exactly the same
>> was as you would for a Bio::DB::GFF or Bio::DB::SeqFeature::Store
>> database.
>>
>> Scott
>>
>>
>> On Tue, Mar 20, 2012 at 5:02 AM, Michael Dondrup  
>> <[hidden email]> wrote:
>> > Dear all,
>> >
>> > I have a question regarding best practices for storing the  
>> multiple gene predictions for the
>> > same genome in a chado database. We have automatic  
>> gene-predictions for our genome, one from Augustus and one from
>> > Maker. Both seem to have their pros and cons so we would like to  
>> keep them both for now and display them
>> > as separate annotation tracks in Gbrowse. The output of both  
>> pipelines is in 2 GFF3 files using almost the same
>> > SO terms (e.g. transcript, gene, CDS, exon,…).
>> >
>> >  I have imported one, but when I import the second one, I am afraid the
>> > predictions will get mixed up and will be identifiable by name  
>> only, or is the source field in the GFF file stored in
>> > chado? So, what are best practices to store different sets of  
>> annotations with the same sequence type,
>> > possibly using different databases?
>> >
>> > Thank you very much for your comments.
>> >
>> > Regards
>> >
>> > Michael Dondrup
>> > Postdoctoral researcher
>> > The Sea Lice Research Centre
>> > Department of Informatics
>> > University of Bergen
>> > Thormøhlensgate 55, N-5008 Bergen,
>> > Norway
>> >
>> >
>> >  
>> ------------------------------------------------------------------------------
>> > This SF email is sponsosred by:
>> > Try Windows Azure free for 90 days Click Here
>> > http://p.sf.net/sfu/sfd2d-msazure
>> > _______________________________________________
>> > Gmod-schema mailing list
>> > [hidden email]
>> > https://lists.sourceforge.net/lists/listinfo/gmod-schema
>>
>>
>>
>> --
>> ------------------------------------------------------------------------
>> Scott Cain, Ph. D.                                   scott at  
>> scottcain dot net
>> GMOD Coordinator (http://gmod.org/)                     216-392-3087
>> Ontario Institute for Cancer Research
>>
>> ------------------------------------------------------------------------------
>> This SF email is sponsosred by:
>> Try Windows Azure free for 90 days Click Here
>> http://p.sf.net/sfu/sfd2d-msazure
>> _______________________________________________
>> Gmod-schema mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>


------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema