Howto annotate multipart CDS, IDs or not?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Howto annotate multipart CDS, IDs or not?

Michael Dondrup-3
Hi,

I have a GFF3 file with spliced transcripts and therefore multipart CDS. I wish to import them into chado, and
then be able to use GBrowse2 for display. I am unsure about the way to annotate the CDS to allow users to
retrieve the joint coding sequence for each gene. Should the CDS all have an ID which is shared for multiple
parts or is that not necessary?
like so:

LSalAtl2s1 . CDS 188834 189007 . + 0 ID=EMLSAP00000003688-CDS;Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=1
LSalAtl2s1 . CDS 189059 189382 . + 0 ID=EMLSAP00000003688-CDS;Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=2

Also, I noticed that my types contain "protein_coding_gene" and "transcript" instead of "gene" and "mRNA". Does Gbrowse2 rely on these annotations elsewhere, except
for the "gene" glyph? So, would you recommend I rather change those types?

Thank you very much, example gene model follows below.
Michael


LSalAtl2s1 ensembl protein_coding_gene 169164 204974 . + . ID=EMLSAG00000003688;biotype=protein_coding;description=maker-LSalAtl2s1-augustus-gene-2.21;logic_name=ensemblgenomes
LSalAtl2s1 ensembl transcript 169164 204974 . + . ID=EMLSAT00000003688;Parent=EMLSAG00000003688;biotype=protein_coding;logic_name=ensemblgenomes
LSalAtl2s1 . exon 169164 169208 . + . ID=EMLSAE00000016048;Parent=EMLSAT00000003688;ensembl_end_phase=-1;ensembl_phase=-1;rank=1
LSalAtl2s1 . five_prime_UTR 169164 188833 . + . Parent=EMLSAT00000003688;
LSalAtl2s1 . CDS 188834 189007 . + 0 Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=1
LSalAtl2s1 . CDS 189059 189382 . + 0 Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=2
LSalAtl2s1 . CDS 198618 198759 . + 0 Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=3
LSalAtl2s1 . CDS 204808 204974 . + 0 Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=4
LSalAtl2s1 . exon 188822 189007 . + . ID=EMLSAE00000016049;Parent=EMLSAT00000003688;ensembl_phase=-1;rank=2
LSalAtl2s1 . exon 189059 189382 . + . ID=EMLSAE00000016050;Parent=EMLSAT00000003688;rank=3
LSalAtl2s1 . exon 198618 198759 . + . ID=EMLSAE00000016051;Parent=EMLSAT00000003688;ensembl_end_phase=1;rank=4
LSalAtl2s1 . exon 204808 204974 . + . ID=EMLSAE00000016052;Parent=EMLSAT00000003688;ensembl_phase=1;rank=5



------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse

signature.asc (465 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Howto annotate multipart CDS, IDs or not?

Scott Cain
Hi Michael,

For loading into Chado, I believe the loader ignores the IDs of CDS features and only pays attention to the Parent (this is a problem for polycistronic genes though--bummer). The loader will convert the CDS features to a polypeptide feature which has start at the start of the CDS translation and a end at the end of translation.  The loader will also create exon features if they aren't already present in the GFF file, buy yours appears to have them (supply the --noexon flag to the loader to suppress it's creating exons).  Also supply the "-inferCDS 1" option in the database connection stanza in the GBrowse config (see the 07.chado.conf sample config for an example).

For GBrowse, none of the "specialty" glyphs know about protein_coding_gene features (the only one that would is gene.pm), but there are specialty glyphs that know about transcript features (transcript.pm would be one of those).  If you want to use the gene glyph, your options would be to modify the GFF before you load it, to convert the protein_coding_gene features, or to subclass the gene.pm glyph to add support for protein_coding_gene features.

Scott



On Thu, Sep 26, 2013 at 3:58 AM, Michael Dondrup <[hidden email]> wrote:
Hi,

I have a GFF3 file with spliced transcripts and therefore multipart CDS. I wish to import them into chado, and
then be able to use GBrowse2 for display. I am unsure about the way to annotate the CDS to allow users to
retrieve the joint coding sequence for each gene. Should the CDS all have an ID which is shared for multiple
parts or is that not necessary?
like so:

LSalAtl2s1      .       CDS     188834  189007  .       +       0       ID=EMLSAP00000003688-CDS;Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=1
LSalAtl2s1      .       CDS     189059  189382  .       +       0       ID=EMLSAP00000003688-CDS;Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=2

Also, I noticed that my types contain "protein_coding_gene" and "transcript" instead of "gene" and "mRNA". Does Gbrowse2 rely on these annotations elsewhere, except
for the "gene" glyph? So, would you recommend I rather change those types?

Thank you very much, example gene model follows below.
Michael


LSalAtl2s1      ensembl protein_coding_gene     169164  204974  .       +       .       ID=EMLSAG00000003688;biotype=protein_coding;description=maker-LSalAtl2s1-augustus-gene-2.21;logic_name=ensemblgenomes
LSalAtl2s1      ensembl transcript      169164  204974  .       +       .       ID=EMLSAT00000003688;Parent=EMLSAG00000003688;biotype=protein_coding;logic_name=ensemblgenomes
LSalAtl2s1      .       exon    169164  169208  .       +       .       ID=EMLSAE00000016048;Parent=EMLSAT00000003688;ensembl_end_phase=-1;ensembl_phase=-1;rank=1
LSalAtl2s1      .       five_prime_UTR  169164  188833  .       +       .       Parent=EMLSAT00000003688;
LSalAtl2s1      .       CDS     188834  189007  .       +       0       Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=1
LSalAtl2s1      .       CDS     189059  189382  .       +       0       Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=2
LSalAtl2s1      .       CDS     198618  198759  .       +       0       Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=3
LSalAtl2s1      .       CDS     204808  204974  .       +       0       Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=4
LSalAtl2s1      .       exon    188822  189007  .       +       .       ID=EMLSAE00000016049;Parent=EMLSAT00000003688;ensembl_phase=-1;rank=2
LSalAtl2s1      .       exon    189059  189382  .       +       .       ID=EMLSAE00000016050;Parent=EMLSAT00000003688;rank=3
LSalAtl2s1      .       exon    198618  198759  .       +       .       ID=EMLSAE00000016051;Parent=EMLSAT00000003688;ensembl_end_phase=1;rank=4
LSalAtl2s1      .       exon    204808  204974  .       +       .       ID=EMLSAE00000016052;Parent=EMLSAT00000003688;ensembl_phase=1;rank=5



------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse




--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: Howto annotate multipart CDS, IDs or not?

Michael Dondrup-3
Hi Scott,

thank you for the answer. I renamed the features and everything looks as it should, also the "polypetide" DNA sequence seems to
be correctly composed of multiple segments. The only thing I am
not completely happy with is the automatic names for the polypeptides, they look like: polypeptide-auto557281
while I have protein_id attributes like: protein_id=EMLSAP00000003688 for the CDS. Would it be possible to use those
and make the bulk loader use the protein id?

Michael


On Sep 26, 2013, at 4:25 PM, Scott Cain wrote:

> Hi Michael,
>
> For loading into Chado, I believe the loader ignores the IDs of CDS features and only pays attention to the Parent (this is a problem for polycistronic genes though--bummer). The loader will convert the CDS features to a polypeptide feature which has start at the start of the CDS translation and a end at the end of translation.  The loader will also create exon features if they aren't already present in the GFF file, buy yours appears to have them (supply the --noexon flag to the loader to suppress it's creating exons).  Also supply the "-inferCDS 1" option in the database connection stanza in the GBrowse config (see the 07.chado.conf sample config for an example).
>
> For GBrowse, none of the "specialty" glyphs know about protein_coding_gene features (the only one that would is gene.pm), but there are specialty glyphs that know about transcript features (transcript.pm would be one of those).  If you want to use the gene glyph, your options would be to modify the GFF before you load it, to convert the protein_coding_gene features, or to subclass the gene.pm glyph to add support for protein_coding_gene features.
>
> Scott
>
>
>
> On Thu, Sep 26, 2013 at 3:58 AM, Michael Dondrup <[hidden email]> wrote:
> Hi,
>
> I have a GFF3 file with spliced transcripts and therefore multipart CDS. I wish to import them into chado, and
> then be able to use GBrowse2 for display. I am unsure about the way to annotate the CDS to allow users to
> retrieve the joint coding sequence for each gene. Should the CDS all have an ID which is shared for multiple
> parts or is that not necessary?
> like so:
>
> LSalAtl2s1      .       CDS     188834  189007  .       +       0       ID=EMLSAP00000003688-CDS;Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=1
> LSalAtl2s1      .       CDS     189059  189382  .       +       0       ID=EMLSAP00000003688-CDS;Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=2
>
> Also, I noticed that my types contain "protein_coding_gene" and "transcript" instead of "gene" and "mRNA". Does Gbrowse2 rely on these annotations elsewhere, except
> for the "gene" glyph? So, would you recommend I rather change those types?
>
> Thank you very much, example gene model follows below.
> Michael
>
>
> LSalAtl2s1      ensembl protein_coding_gene     169164  204974  .       +       .       ID=EMLSAG00000003688;biotype=protein_coding;description=maker-LSalAtl2s1-augustus-gene-2.21;logic_name=ensemblgenomes
> LSalAtl2s1      ensembl transcript      169164  204974  .       +       .       ID=EMLSAT00000003688;Parent=EMLSAG00000003688;biotype=protein_coding;logic_name=ensemblgenomes
> LSalAtl2s1      .       exon    169164  169208  .       +       .       ID=EMLSAE00000016048;Parent=EMLSAT00000003688;ensembl_end_phase=-1;ensembl_phase=-1;rank=1
> LSalAtl2s1      .       five_prime_UTR  169164  188833  .       +       .       Parent=EMLSAT00000003688;
> LSalAtl2s1      .       CDS     188834  189007  .       +       0       Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=1
> LSalAtl2s1      .       CDS     189059  189382  .       +       0       Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=2
> LSalAtl2s1      .       CDS     198618  198759  .       +       0       Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=3
> LSalAtl2s1      .       CDS     204808  204974  .       +       0       Parent=EMLSAT00000003688;protein_id=EMLSAP00000003688;rank=4
> LSalAtl2s1      .       exon    188822  189007  .       +       .       ID=EMLSAE00000016049;Parent=EMLSAT00000003688;ensembl_phase=-1;rank=2
> LSalAtl2s1      .       exon    189059  189382  .       +       .       ID=EMLSAE00000016050;Parent=EMLSAT00000003688;rank=3
> LSalAtl2s1      .       exon    198618  198759  .       +       .       ID=EMLSAE00000016051;Parent=EMLSAT00000003688;ensembl_end_phase=1;rank=4
> LSalAtl2s1      .       exon    204808  204974  .       +       .       ID=EMLSAE00000016052;Parent=EMLSAT00000003688;ensembl_phase=1;rank=5
>
>
>
> ------------------------------------------------------------------------------
> October Webinars: Code for Performance
> Free Intel webinars can help you accelerate application performance.
> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
> the latest Intel processors and coprocessors. See abstracts and register >
> http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk
> _______________________________________________
> Gmod-gbrowse mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
>
>
>
>
> --
> ------------------------------------------------------------------------
> Scott Cain, Ph. D.                                   scott at scottcain dot net
> GMOD Coordinator (http://gmod.org/)                     216-392-3087
> Ontario Institute for Cancer Research
> ------------------------------------------------------------------------------
> October Webinars: Code for Performance
> Free Intel webinars can help you accelerate application performance.
> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
> the latest Intel processors and coprocessors. See abstracts and register >
> http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk_______________________________________________
> Gmod-gbrowse mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse

signature.asc (465 bytes) Download Attachment