Bug uploading gene models with gmod_bulk_load_gff3.pl

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Bug uploading gene models with gmod_bulk_load_gff3.pl

lpritc@scri.ac.uk
Hi,

I have two gene models that are almost equivalent, and when added together
the upload script seems to generate an additional polypeptide that shouldn't
be there.  Maybe this will be fixed in the next release - I've not spotted
this problem on the lists, though.

The models are:

#gff-version 3
SuperContig8    genezilla    gene    574304    575522    .    +    .
Name=801852;ID=801852
SuperContig8    genezilla    mRNA    574304    575522    .    +    .
ID=801852:mrna;Parent=801852
SuperContig8    genezilla    exon    574304    574690    23.36    +    0
ID=801852:exon1;Parent=801852:mrna
SuperContig8    genezilla    CDS    574304    574690    23.36    +    0
ID=801852:cds1;Parent=801852:mrna
SuperContig8    genezilla    exon    574826    574935    27.52    +    0
ID=801852:exon2;Parent=801852:mrna
SuperContig8    genezilla    CDS    574826    574935    27.52    +    0
ID=801852:cds2;Parent=801852:mrna
SuperContig8    genezilla    exon    575004    575031    26.17    +    2
ID=801852:exon3;Parent=801852:mrna
SuperContig8    genezilla    CDS    575004    575031    26.17    +    2
ID=801852:cds3;Parent=801852:mrna
SuperContig8    genezilla    exon    575217    575522    25.71    +    0
ID=801852:exon4;Parent=801852:mrna
SuperContig8    genezilla    CDS    575217    575522    25.71    +    0
ID=801852:cds4;Parent=801852:mrna
SuperContig8    cegma    gene    574304    575522    .    +    .
Name=801851;ID=801851
SuperContig8    cegma    mRNA    574304    575522    .    +    .
ID=801851:mrna;Parent=801851
SuperContig8    cegma    exon    574304    574690    116.79    +    0
ID=801851:exon1;Parent=801851:mrna
SuperContig8    cegma    CDS    574304    574690    116.79    +    0
ID=801851:cds1;Parent=801851:mrna
SuperContig8    cegma    exon    574826    574935    42.45    +    0
ID=801851:exon2;Parent=801851:mrna
SuperContig8    cegma    CDS    574826    574935    42.45    +    0
ID=801851:cds2;Parent=801851:mrna
SuperContig8    cegma    exon    575004    575148    66.41    +    1
ID=801851:exon3;Parent=801851:mrna
SuperContig8    cegma    CDS    575004    575148    66.41    +    1
ID=801851:cds3;Parent=801851:mrna
SuperContig8    cegma    exon    575217    575522    108.98    +    0
ID=801851:exon4;Parent=801851:mrna
SuperContig8    cegma    CDS    575217    575522    108.98    +    0
ID=801851:cds4;Parent=801851:mrna

The only differences (apart from the Ids/sources) is in the score column,
and the phase and end point of one exon (I know that there's something funny
about the phases here, but that's a separate issue and doesn't affect this
problem).

When loaded into the current release of CHADO with:

gmod_bulk_load_gff3.pl --gfffile test.gff3 --organism "Hyaloperonospora
arabidopsidis EMOY2" --dbname oomycete_reference --dbuser ****** --dbpass
****** --dbhost localhost --noexon

The two models are far from equivalent when viewed in GBROWSE v2.13 (see
attachments).  The genezilla model displays as expected, but there is an
extra polypeptide generated for the cegma model that spans the full mRNA
(feature_id:175303, here):

oomycete_reference=> SELECT feature_overlaps(175297);
                   
feature_overlaps  
----------------------------------------------------------------------------
----------------------------------------------------------------------------
--------------------------
 (175291,,29,801852,801852,,,,806,f,f,"2010-08-05
14:14:49.902149","2010-08-05 14:14:49.902149","'801852':1,2")
 (175292,,29,801852:mrna,801852:mrna,,,,336,f,f,"2010-08-05
14:14:49.902149","2010-08-05 14:14:49.902149","'801852':1,3 'mrna':2,4")
 (175293,,29,801852:exon1,801852:exon1,,,,249,f,f,"2010-08-05
14:14:49.902149","2010-08-05 14:14:49.902149","'801852':1,3 'exon1':2,4")
 (175294,,29,801852:exon2,801852:exon2,,,,249,f,f,"2010-08-05
14:14:49.902149","2010-08-05 14:14:49.902149","'801852':1,3 'exon2':2,4")
 (175295,,29,801852:exon3,801852:exon3,,,,249,f,f,"2010-08-05
14:14:49.902149","2010-08-05 14:14:49.902149","'801852':1,3 'exon3':2,4")
 (175296,,29,801852:exon4,801852:exon4,,,,249,f,f,"2010-08-05
14:14:49.902149","2010-08-05 14:14:49.902149","'801852':1,3 'exon4':2,4")
 (175297,,29,801851,801851,,,,806,f,f,"2010-08-05
14:14:49.902149","2010-08-05 14:14:49.902149","'801851':1,2")
 (175298,,29,801851:mrna,801851:mrna,,,,336,f,f,"2010-08-05
14:14:49.902149","2010-08-05 14:14:49.902149","'801851':1,3 'mrna':2,4")
 (175299,,29,801851:exon1,801851:exon1,,,,249,f,f,"2010-08-05
14:14:49.902149","2010-08-05 14:14:49.902149","'801851':1,3 'exon1':2,4")
 (175300,,29,801851:exon2,801851:exon2,,,,249,f,f,"2010-08-05
14:14:49.902149","2010-08-05 14:14:49.902149","'801851':1,3 'exon2':2,4")
 (175301,,29,801851:exon3,801851:exon3,,,,249,f,f,"2010-08-05
14:14:49.902149","2010-08-05 14:14:49.902149","'801851':1,3 'exon3':2,4")
 (175302,,29,801851:exon4,801851:exon4,,,,249,f,f,"2010-08-05
14:14:49.902149","2010-08-05 14:14:49.902149","'801851':1,3 'exon4':2,4")
 (175303,,29,polypeptide-auto175303,auto175303,,,,206,f,f,"2010-08-05
14:14:49.902149","2010-08-05 14:14:49.902149","'auto175303':3,4
'polypeptid':2 'polypeptide-auto175303':1")
 (175304,,29,polypeptide-auto175304,auto175304,,,,206,f,f,"2010-08-05
14:14:49.902149","2010-08-05 14:14:49.902149","'auto175304':3,4
'polypeptid':2 'polypeptide-auto175304':1")
 (175305,,29,polypeptide-auto175305,auto175305,,,,206,f,f,"2010-08-05
14:14:49.902149","2010-08-05 14:14:49.902149","'auto175305':3,4
'polypeptid':2 'polypeptide-auto175305':1")

When viewed in GBROWSE with -inferCDS, as you can see from the attachment,
this extra polypeptide shows up as an additional feature in the gene track,
and as two extra reading frames where the CDS are automatically inferred.

It doesn't matter which order the genemodels are presented in the GFF3 file;
the cegma model (lowest valued ID) is the one with the error.

The problem also occurs if I provide duplicate genezilla genemodels with
differing Ids - the lowest-valued ID has the error.

The problem disappears if I add the models separately.

Is this a known issue that might go away with next week's release?

Cheers,

L.


--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:[hidden email]       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405



______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________
------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema

Screen shot 2010-08-05 at Thursday, August 5 14.08.09.png (41K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Bug uploading gene models with gmod_bulk_load_gff3.pl

Scott Cain
Hi Leighton,

Do you have "ignore_sub_part = polypeptide" in the track stanza?  I
realize that doesn't solve the phantom polypeptide directly.  Also,
could you send along that GFF sample as an attachment; I'd like to try
a few things on it.

Thanks,
Scott


On Thu, Aug 5, 2010 at 9:35 AM, Leighton Pritchard <[hidden email]> wrote:

> Hi,
>
> I have two gene models that are almost equivalent, and when added together
> the upload script seems to generate an additional polypeptide that shouldn't
> be there.  Maybe this will be fixed in the next release - I've not spotted
> this problem on the lists, though.
>
> The models are:
>
> #gff-version 3
> SuperContig8    genezilla    gene    574304    575522    .    +    .
> Name=801852;ID=801852
> SuperContig8    genezilla    mRNA    574304    575522    .    +    .
> ID=801852:mrna;Parent=801852
> SuperContig8    genezilla    exon    574304    574690    23.36    +    0
> ID=801852:exon1;Parent=801852:mrna
> SuperContig8    genezilla    CDS    574304    574690    23.36    +    0
> ID=801852:cds1;Parent=801852:mrna
> SuperContig8    genezilla    exon    574826    574935    27.52    +    0
> ID=801852:exon2;Parent=801852:mrna
> SuperContig8    genezilla    CDS    574826    574935    27.52    +    0
> ID=801852:cds2;Parent=801852:mrna
> SuperContig8    genezilla    exon    575004    575031    26.17    +    2
> ID=801852:exon3;Parent=801852:mrna
> SuperContig8    genezilla    CDS    575004    575031    26.17    +    2
> ID=801852:cds3;Parent=801852:mrna
> SuperContig8    genezilla    exon    575217    575522    25.71    +    0
> ID=801852:exon4;Parent=801852:mrna
> SuperContig8    genezilla    CDS    575217    575522    25.71    +    0
> ID=801852:cds4;Parent=801852:mrna
> SuperContig8    cegma    gene    574304    575522    .    +    .
> Name=801851;ID=801851
> SuperContig8    cegma    mRNA    574304    575522    .    +    .
> ID=801851:mrna;Parent=801851
> SuperContig8    cegma    exon    574304    574690    116.79    +    0
> ID=801851:exon1;Parent=801851:mrna
> SuperContig8    cegma    CDS    574304    574690    116.79    +    0
> ID=801851:cds1;Parent=801851:mrna
> SuperContig8    cegma    exon    574826    574935    42.45    +    0
> ID=801851:exon2;Parent=801851:mrna
> SuperContig8    cegma    CDS    574826    574935    42.45    +    0
> ID=801851:cds2;Parent=801851:mrna
> SuperContig8    cegma    exon    575004    575148    66.41    +    1
> ID=801851:exon3;Parent=801851:mrna
> SuperContig8    cegma    CDS    575004    575148    66.41    +    1
> ID=801851:cds3;Parent=801851:mrna
> SuperContig8    cegma    exon    575217    575522    108.98    +    0
> ID=801851:exon4;Parent=801851:mrna
> SuperContig8    cegma    CDS    575217    575522    108.98    +    0
> ID=801851:cds4;Parent=801851:mrna
>
> The only differences (apart from the Ids/sources) is in the score column,
> and the phase and end point of one exon (I know that there's something funny
> about the phases here, but that's a separate issue and doesn't affect this
> problem).
>
> When loaded into the current release of CHADO with:
>
> gmod_bulk_load_gff3.pl --gfffile test.gff3 --organism "Hyaloperonospora
> arabidopsidis EMOY2" --dbname oomycete_reference --dbuser ****** --dbpass
> ****** --dbhost localhost --noexon
>
> The two models are far from equivalent when viewed in GBROWSE v2.13 (see
> attachments).  The genezilla model displays as expected, but there is an
> extra polypeptide generated for the cegma model that spans the full mRNA
> (feature_id:175303, here):
>
> oomycete_reference=> SELECT feature_overlaps(175297);
>
> feature_overlaps
> ----------------------------------------------------------------------------
> ----------------------------------------------------------------------------
> --------------------------
>  (175291,,29,801852,801852,,,,806,f,f,"2010-08-05
> 14:14:49.902149","2010-08-05 14:14:49.902149","'801852':1,2")
>  (175292,,29,801852:mrna,801852:mrna,,,,336,f,f,"2010-08-05
> 14:14:49.902149","2010-08-05 14:14:49.902149","'801852':1,3 'mrna':2,4")
>  (175293,,29,801852:exon1,801852:exon1,,,,249,f,f,"2010-08-05
> 14:14:49.902149","2010-08-05 14:14:49.902149","'801852':1,3 'exon1':2,4")
>  (175294,,29,801852:exon2,801852:exon2,,,,249,f,f,"2010-08-05
> 14:14:49.902149","2010-08-05 14:14:49.902149","'801852':1,3 'exon2':2,4")
>  (175295,,29,801852:exon3,801852:exon3,,,,249,f,f,"2010-08-05
> 14:14:49.902149","2010-08-05 14:14:49.902149","'801852':1,3 'exon3':2,4")
>  (175296,,29,801852:exon4,801852:exon4,,,,249,f,f,"2010-08-05
> 14:14:49.902149","2010-08-05 14:14:49.902149","'801852':1,3 'exon4':2,4")
>  (175297,,29,801851,801851,,,,806,f,f,"2010-08-05
> 14:14:49.902149","2010-08-05 14:14:49.902149","'801851':1,2")
>  (175298,,29,801851:mrna,801851:mrna,,,,336,f,f,"2010-08-05
> 14:14:49.902149","2010-08-05 14:14:49.902149","'801851':1,3 'mrna':2,4")
>  (175299,,29,801851:exon1,801851:exon1,,,,249,f,f,"2010-08-05
> 14:14:49.902149","2010-08-05 14:14:49.902149","'801851':1,3 'exon1':2,4")
>  (175300,,29,801851:exon2,801851:exon2,,,,249,f,f,"2010-08-05
> 14:14:49.902149","2010-08-05 14:14:49.902149","'801851':1,3 'exon2':2,4")
>  (175301,,29,801851:exon3,801851:exon3,,,,249,f,f,"2010-08-05
> 14:14:49.902149","2010-08-05 14:14:49.902149","'801851':1,3 'exon3':2,4")
>  (175302,,29,801851:exon4,801851:exon4,,,,249,f,f,"2010-08-05
> 14:14:49.902149","2010-08-05 14:14:49.902149","'801851':1,3 'exon4':2,4")
>  (175303,,29,polypeptide-auto175303,auto175303,,,,206,f,f,"2010-08-05
> 14:14:49.902149","2010-08-05 14:14:49.902149","'auto175303':3,4
> 'polypeptid':2 'polypeptide-auto175303':1")
>  (175304,,29,polypeptide-auto175304,auto175304,,,,206,f,f,"2010-08-05
> 14:14:49.902149","2010-08-05 14:14:49.902149","'auto175304':3,4
> 'polypeptid':2 'polypeptide-auto175304':1")
>  (175305,,29,polypeptide-auto175305,auto175305,,,,206,f,f,"2010-08-05
> 14:14:49.902149","2010-08-05 14:14:49.902149","'auto175305':3,4
> 'polypeptid':2 'polypeptide-auto175305':1")
>
> When viewed in GBROWSE with -inferCDS, as you can see from the attachment,
> this extra polypeptide shows up as an additional feature in the gene track,
> and as two extra reading frames where the CDS are automatically inferred.
>
> It doesn't matter which order the genemodels are presented in the GFF3 file;
> the cegma model (lowest valued ID) is the one with the error.
>
> The problem also occurs if I provide duplicate genezilla genemodels with
> differing Ids - the lowest-valued ID has the error.
>
> The problem disappears if I add the models separately.
>
> Is this a known issue that might go away with next week's release?
>
> Cheers,
>
> L.
>
>
> --
> Dr Leighton Pritchard MRSC
> D131, Plant Pathology Programme, SCRI
> Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
> e:[hidden email]       w:http://www.scri.ac.uk/staff/leightonpritchard
> gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405
>
>
>
> ______________________________________________________
> SCRI, Invergowrie, Dundee, DD2 5DA.
> The Scottish Crop Research Institute is a charitable company limited by guarantee.
> Registered in Scotland No: SC 29367.
> Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.
>
>
> DISCLAIMER:
>
> This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
> If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system.
>
> Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
> ______________________________________________________
> ------------------------------------------------------------------------------
> The Palm PDK Hot Apps Program offers developers who use the
> Plug-In Development Kit to bring their C/C++ apps to Palm for a share
> of $1 Million in cash or HP Products. Visit us here for more details:
> http://p.sf.net/sfu/dev2dev-palm
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>
>



--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema