[Gmod-schema] GFF3: Phase problems in CDS features

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[Gmod-schema] GFF3: Phase problems in CDS features

lpritc@scri.ac.uk
Hi,

I've had a look through the archives at SourceForge, and found an email
describing a similar problem, by Erika Sallett
(http://sourceforge.net/mailarchive/forum.php?thread_name=1251738304.3814.33
1.camel@lipmCinfoES&forum_name=song-devel), but even after the correction I
think there may still be a problem with the GFF3 example at
http://www.sequenceontology.org/gff3.shtml, and that this might be feeding
through to other applications.

Considering cds0004 in the example from
http://www.sequenceontology.org/gff3.shtml:

ctg123 . CDS             3391  3902  .  +  0
ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
ctg123 . CDS      5000  5500  .  +  1
ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
ctg123 . CDS      7000  7600  .  +  2
ID=cds00004;Parent=mRNA00003;Name=edenprotein.4

The first exon has length (3902-3391+1) = 2mod3, so the following exon
should have phase 1 (as pointed out by Erika, and later corrected).

But the second exon has length (5500-5000+1) = 0mod3; this exon must then
contain the first two bases of the last codon in addition to the trailing
base from the first exon, and so the third exon should have phase 1, rather
than phase 2.  The general rule being (where CDS are indexed in 5`->3`
sequential order):

Phase of CDS_1 = 0

Phase of CDS_n = (phase of CDS_{n-1} - length of CDS_{n-1})mod3, n > 1

[or alternatively:

Phase(CDS_n) = -(sum(len(CDS_k)), k=1..n-1)mod3, n > 1]

In both cases, the nominal phase of CDS_{n+1} where n=(number of CDS) should
be zero, as a check.

The problem seems to extend farther than the example on that page, though -
I've seen the same problem in the MAKER output used for the GMOD/CHADO
example tutorial at http://gmod.org/wiki/Chado_Tutorial.  The example data
in the file

ftp://ftp.gmod.org/pub/gmod/Courses/2009/SummerSchoolEurope/GMOD_sample_data
.gff.zip

(this example is GMOD/CHADO and not SO, so I'm cross-posting) contains an
entry for PYU1_G008865, which has seven exon/CDS features with ID
PYU1_T008883.  The CDS features for this entry in the file are as described
below:

scf1117875581239    maker    CDS    13459    13616    .    +    0
ID=PYU1_T008883:cds:1;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    13643    13808    .    +    1
ID=PYU1_T008883:cds:2;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    13893    13952    .    +    2
ID=PYU1_T008883:cds:3;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    14050    14313    .    +    0
ID=PYU1_T008883:cds:4;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    14404    19177    .    +    0
ID=PYU1_T008883:cds:5;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    19264    19706    .    +    2
ID=PYU1_T008883:cds:6;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    19842    20093    .    +    1
ID=PYU1_T008883:cds:7;Parent=PYU1_T008883;

Now, the CDS features have the following lengths mod3, and the corresponding
phases:

cds:1 length=158=2mod3,  so cds:2 should have phase (0-2)mod3=1
cds:2 length=166=1mod3,  so cds:3 should have phase (1-1)mod3=0
cds:3 length=60=0mod3,   so cds:4 should have phase (0-0)mod3=0
cds:4 length=264=0mod3,  so cds:5 should have phase (0-0)mod3=0
cds:5 length=4774=1mod3, so cds:6 should have phase (0-1)mod3=2
cds:6 length=443=2mod3,  so cds:7 should have phase (2-2)mod3=0

- these recalculated phases are exactly those seen in an ARTEMIS
visualisation of this gene model.

The recorded phases in the original (pre-Erika) example, and in the MAKER
example above appear to have been calculated by the following rule:

Phase of CDS_1 = 0

Phase of CDS_n = -(sum(phase_k), k=1..n-1 + sum(len(k)), k=1..n-1))mod3, n >
1

n      length(mod3)     phase_n    phase_{n+1}
===    ============     =======    ==========
1      158(2)            0          -(0+2)=1mod3
2      166(1)            1          -(1+1+0+2)=2mod3
3      60(0)             2          -(2+0+1+1+0+2)=0mod3
4      264(0)            0          -(0+0+2+0+1+1+0+2)=0mod3
5      4774(1)           0          -(0+1+0+0+2+0+1+1+0+2)=2mod3
6      443(2)            2          -(2+2+0+1+0+0+2+0+1+1+0+2)=1mod3
     
and I think they are incorrect, as can be seen by visualisation of this
feature, and considering that the first two CDS have combined length
158+166=324=0mod3, so the third CDS must start with the first base of a
codon, and so have phase 0.

Cheers,

L.

--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:[hidden email]       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________

------------------------------------------------------------------------------

_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: [SO-devel] GFF3: Phase problems in CDS features

Barry Moore
Thanks for the detailed and helpful e-mail Leighton.  I'll have a look at this tonight and make the necessary corrections to the GFF3 spec.  I'll also forward to the MAKER list.

B

On May 18, 2010, at 3:18 AM, Leighton Pritchard wrote:

> Hi,
>
> I've had a look through the archives at SourceForge, and found an email
> describing a similar problem, by Erika Sallett
> (http://sourceforge.net/mailarchive/forum.php?thread_name=1251738304.3814.33
> 1.camel@lipmCinfoES&forum_name=song-devel), but even after the correction I
> think there may still be a problem with the GFF3 example at
> http://www.sequenceontology.org/gff3.shtml, and that this might be feeding
> through to other applications.
>
> Considering cds0004 in the example from
> http://www.sequenceontology.org/gff3.shtml:
>
> ctg123 . CDS             3391  3902  .  +  0
> ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
> ctg123 . CDS      5000  5500  .  +  1
> ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
> ctg123 . CDS      7000  7600  .  +  2
> ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
>
> The first exon has length (3902-3391+1) = 2mod3, so the following exon
> should have phase 1 (as pointed out by Erika, and later corrected).
>
> But the second exon has length (5500-5000+1) = 0mod3; this exon must then
> contain the first two bases of the last codon in addition to the trailing
> base from the first exon, and so the third exon should have phase 1, rather
> than phase 2.  The general rule being (where CDS are indexed in 5`->3`
> sequential order):
>
> Phase of CDS_1 = 0
>
> Phase of CDS_n = (phase of CDS_{n-1} - length of CDS_{n-1})mod3, n > 1
>
> [or alternatively:
>
> Phase(CDS_n) = -(sum(len(CDS_k)), k=1..n-1)mod3, n > 1]
>
> In both cases, the nominal phase of CDS_{n+1} where n=(number of CDS) should
> be zero, as a check.
>
> The problem seems to extend farther than the example on that page, though -
> I've seen the same problem in the MAKER output used for the GMOD/CHADO
> example tutorial at http://gmod.org/wiki/Chado_Tutorial.  The example data
> in the file
>
> ftp://ftp.gmod.org/pub/gmod/Courses/2009/SummerSchoolEurope/GMOD_sample_data
> .gff.zip
>
> (this example is GMOD/CHADO and not SO, so I'm cross-posting) contains an
> entry for PYU1_G008865, which has seven exon/CDS features with ID
> PYU1_T008883.  The CDS features for this entry in the file are as described
> below:
>
> scf1117875581239    maker    CDS    13459    13616    .    +    0
> ID=PYU1_T008883:cds:1;Parent=PYU1_T008883;
> scf1117875581239    maker    CDS    13643    13808    .    +    1
> ID=PYU1_T008883:cds:2;Parent=PYU1_T008883;
> scf1117875581239    maker    CDS    13893    13952    .    +    2
> ID=PYU1_T008883:cds:3;Parent=PYU1_T008883;
> scf1117875581239    maker    CDS    14050    14313    .    +    0
> ID=PYU1_T008883:cds:4;Parent=PYU1_T008883;
> scf1117875581239    maker    CDS    14404    19177    .    +    0
> ID=PYU1_T008883:cds:5;Parent=PYU1_T008883;
> scf1117875581239    maker    CDS    19264    19706    .    +    2
> ID=PYU1_T008883:cds:6;Parent=PYU1_T008883;
> scf1117875581239    maker    CDS    19842    20093    .    +    1
> ID=PYU1_T008883:cds:7;Parent=PYU1_T008883;
>
> Now, the CDS features have the following lengths mod3, and the corresponding
> phases:
>
> cds:1 length=158=2mod3,  so cds:2 should have phase (0-2)mod3=1
> cds:2 length=166=1mod3,  so cds:3 should have phase (1-1)mod3=0
> cds:3 length=60=0mod3,   so cds:4 should have phase (0-0)mod3=0
> cds:4 length=264=0mod3,  so cds:5 should have phase (0-0)mod3=0
> cds:5 length=4774=1mod3, so cds:6 should have phase (0-1)mod3=2
> cds:6 length=443=2mod3,  so cds:7 should have phase (2-2)mod3=0
>
> - these recalculated phases are exactly those seen in an ARTEMIS
> visualisation of this gene model.
>
> The recorded phases in the original (pre-Erika) example, and in the MAKER
> example above appear to have been calculated by the following rule:
>
> Phase of CDS_1 = 0
>
> Phase of CDS_n = -(sum(phase_k), k=1..n-1 + sum(len(k)), k=1..n-1))mod3, n >
> 1
>
> n      length(mod3)     phase_n    phase_{n+1}
> ===    ============     =======    ==========
> 1      158(2)            0          -(0+2)=1mod3
> 2      166(1)            1          -(1+1+0+2)=2mod3
> 3      60(0)             2          -(2+0+1+1+0+2)=0mod3
> 4      264(0)            0          -(0+0+2+0+1+1+0+2)=0mod3
> 5      4774(1)           0          -(0+1+0+0+2+0+1+1+0+2)=2mod3
> 6      443(2)            2          -(2+2+0+1+0+0+2+0+1+1+0+2)=1mod3
>
> and I think they are incorrect, as can be seen by visualisation of this
> feature, and considering that the first two CDS have combined length
> 158+166=324=0mod3, so the third CDS must start with the first base of a
> codon, and so have phase 0.
>
> Cheers,
>
> L.
>
> --
> Dr Leighton Pritchard MRSC
> D131, Plant Pathology Programme, SCRI
> Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
> e:[hidden email]       w:http://www.scri.ac.uk/staff/leightonpritchard
> gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405
>
>
> ______________________________________________________
> SCRI, Invergowrie, Dundee, DD2 5DA.  
> The Scottish Crop Research Institute is a charitable company limited by guarantee.
> Registered in Scotland No: SC 29367.
> Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.
>
>
> DISCLAIMER:
>
> This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
> If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system.
>
> Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
> ______________________________________________________
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> SOng-devel mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/song-devel


------------------------------------------------------------------------------

_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: [SO-devel] GFF3: Phase problems in CDS features

Lincoln Stein
In reply to this post by lpritc@scri.ac.uk
I think you are right. The GFF3 example needs to be fixed yet again.

Lincoln

On Tue, May 18, 2010 at 2:18 AM, Leighton Pritchard <[hidden email]> wrote:
Hi,

I've had a look through the archives at SourceForge, and found an email
describing a similar problem, by Erika Sallett
(http://sourceforge.net/mailarchive/forum.php?thread_name=1251738304.3814.33
1.camel@lipmCinfoES&forum_name=song-devel), but even after the correction I
think there may still be a problem with the GFF3 example at
http://www.sequenceontology.org/gff3.shtml, and that this might be feeding
through to other applications.

Considering cds0004 in the example from
http://www.sequenceontology.org/gff3.shtml:

ctg123 . CDS             3391  3902  .  +  0
ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
ctg123 . CDS      5000  5500  .  +  1
ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
ctg123 . CDS      7000  7600  .  +  2
ID=cds00004;Parent=mRNA00003;Name=edenprotein.4

The first exon has length (3902-3391+1) = 2mod3, so the following exon
should have phase 1 (as pointed out by Erika, and later corrected).

But the second exon has length (5500-5000+1) = 0mod3; this exon must then
contain the first two bases of the last codon in addition to the trailing
base from the first exon, and so the third exon should have phase 1, rather
than phase 2.  The general rule being (where CDS are indexed in 5`->3`
sequential order):

Phase of CDS_1 = 0

Phase of CDS_n = (phase of CDS_{n-1} - length of CDS_{n-1})mod3, n > 1

[or alternatively:

Phase(CDS_n) = -(sum(len(CDS_k)), k=1..n-1)mod3, n > 1]

In both cases, the nominal phase of CDS_{n+1} where n=(number of CDS) should
be zero, as a check.

The problem seems to extend farther than the example on that page, though -
I've seen the same problem in the MAKER output used for the GMOD/CHADO
example tutorial at http://gmod.org/wiki/Chado_Tutorial.  The example data
in the file

ftp://ftp.gmod.org/pub/gmod/Courses/2009/SummerSchoolEurope/GMOD_sample_data
.gff.zip

(this example is GMOD/CHADO and not SO, so I'm cross-posting) contains an
entry for PYU1_G008865, which has seven exon/CDS features with ID
PYU1_T008883.  The CDS features for this entry in the file are as described
below:

scf1117875581239    maker    CDS    13459    13616    .    +    0
ID=PYU1_T008883:cds:1;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    13643    13808    .    +    1
ID=PYU1_T008883:cds:2;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    13893    13952    .    +    2
ID=PYU1_T008883:cds:3;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    14050    14313    .    +    0
ID=PYU1_T008883:cds:4;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    14404    19177    .    +    0
ID=PYU1_T008883:cds:5;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    19264    19706    .    +    2
ID=PYU1_T008883:cds:6;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    19842    20093    .    +    1
ID=PYU1_T008883:cds:7;Parent=PYU1_T008883;

Now, the CDS features have the following lengths mod3, and the corresponding
phases:

cds:1 length=158=2mod3,  so cds:2 should have phase (0-2)mod3=1
cds:2 length=166=1mod3,  so cds:3 should have phase (1-1)mod3=0
cds:3 length=60=0mod3,   so cds:4 should have phase (0-0)mod3=0
cds:4 length=264=0mod3,  so cds:5 should have phase (0-0)mod3=0
cds:5 length=4774=1mod3, so cds:6 should have phase (0-1)mod3=2
cds:6 length=443=2mod3,  so cds:7 should have phase (2-2)mod3=0

- these recalculated phases are exactly those seen in an ARTEMIS
visualisation of this gene model.

The recorded phases in the original (pre-Erika) example, and in the MAKER
example above appear to have been calculated by the following rule:

Phase of CDS_1 = 0

Phase of CDS_n = -(sum(phase_k), k=1..n-1 + sum(len(k)), k=1..n-1))mod3, n >
1

n      length(mod3)     phase_n    phase_{n+1}
===    ============     =======    ==========
1      158(2)            0          -(0+2)=1mod3
2      166(1)            1          -(1+1+0+2)=2mod3
3      60(0)             2          -(2+0+1+1+0+2)=0mod3
4      264(0)            0          -(0+0+2+0+1+1+0+2)=0mod3
5      4774(1)           0          -(0+1+0+0+2+0+1+1+0+2)=2mod3
6      443(2)            2          -(2+2+0+1+0+0+2+0+1+1+0+2)=1mod3

and I think they are incorrect, as can be seen by visualisation of this
feature, and considering that the first two CDS have combined length
158+166=324=0mod3, so the third CDS must start with the first base of a
codon, and so have phase 0.

Cheers,

L.

--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
[hidden email]       w:<a href="http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp" target="_blank">http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________

------------------------------------------------------------------------------

_______________________________________________
SOng-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/song-devel



--
Lincoln D. Stein
Director, Informatics and Biocomputing Platform
Ontario Institute for Cancer Research
101 College St., Suite 800
Toronto, ON, Canada M5G0A3
416 673-8514
Assistant: Renata Musa <[hidden email]>

------------------------------------------------------------------------------


_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: [SO-devel] GFF3: Phase problems in CDS features

Barry Moore
Thanks again Leighton,

I agree with the changes that you have suggested to the GFF3 specification and have updated it accordingly - http://www.sequenceontology.org/resources/gff3.html. I checked the spec carefully, and I'm pretty sure we've got all of that type of error now.

For those in the perl world the phase of the NEXT CDS in a 5' -> 3' series - as suggested by Leighton - looks like this:

$next_phase = ($this_cds_phase - $this_cds_length) % 3;

Carson, I made a change to line 692 of /maker/lib/Dumper/GFF/GFFV3.pm to correct what appears to be an error in phase calculation there.  I've committed that change, but can you confirm that I did right thing there next time you update.

Dave, I think that the following perl should fix up the example data from the Chado tutorial:

perl -i.bak -F'\t' -lane 'BEGIN{$p=0}if($F[2] eq "CDS"){$F[7] = $p;$p=($p-($F[4]-$F[3]+1))%3}else{$p=0};print join "\t", @F' GMOD_sample_data.gff

Barry

On May 24, 2010, at 7:30 PM, Lincoln Stein wrote:

I think you are right. The GFF3 example needs to be fixed yet again.

Lincoln

On Tue, May 18, 2010 at 2:18 AM, Leighton Pritchard <[hidden email]> wrote:
Hi,

I've had a look through the archives at SourceForge, and found an email
describing a similar problem, by Erika Sallett
(http://sourceforge.net/mailarchive/forum.php?thread_name=1251738304.3814.33
1.camel@lipmCinfoES&forum_name=song-devel), but even after the correction I
think there may still be a problem with the GFF3 example at
http://www.sequenceontology.org/gff3.shtml, and that this might be feeding
through to other applications.

Considering cds0004 in the example from
http://www.sequenceontology.org/gff3.shtml:

ctg123 . CDS             3391  3902  .  +  0
ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
ctg123 . CDS      5000  5500  .  +  1
ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
ctg123 . CDS      7000  7600  .  +  2
ID=cds00004;Parent=mRNA00003;Name=edenprotein.4

The first exon has length (3902-3391+1) = 2mod3, so the following exon
should have phase 1 (as pointed out by Erika, and later corrected).

But the second exon has length (5500-5000+1) = 0mod3; this exon must then
contain the first two bases of the last codon in addition to the trailing
base from the first exon, and so the third exon should have phase 1, rather
than phase 2.  The general rule being (where CDS are indexed in 5`->3`
sequential order):

Phase of CDS_1 = 0

Phase of CDS_n = (phase of CDS_{n-1} - length of CDS_{n-1})mod3, n > 1

[or alternatively:

Phase(CDS_n) = -(sum(len(CDS_k)), k=1..n-1)mod3, n > 1]

In both cases, the nominal phase of CDS_{n+1} where n=(number of CDS) should
be zero, as a check.

The problem seems to extend farther than the example on that page, though -
I've seen the same problem in the MAKER output used for the GMOD/CHADO
example tutorial at http://gmod.org/wiki/Chado_Tutorial.  The example data
in the file

ftp://ftp.gmod.org/pub/gmod/Courses/2009/SummerSchoolEurope/GMOD_sample_data
.gff.zip

(this example is GMOD/CHADO and not SO, so I'm cross-posting) contains an
entry for PYU1_G008865, which has seven exon/CDS features with ID
PYU1_T008883.  The CDS features for this entry in the file are as described
below:

scf1117875581239    maker    CDS    13459    13616    .    +    0
ID=PYU1_T008883:cds:1;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    13643    13808    .    +    1
ID=PYU1_T008883:cds:2;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    13893    13952    .    +    2
ID=PYU1_T008883:cds:3;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    14050    14313    .    +    0
ID=PYU1_T008883:cds:4;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    14404    19177    .    +    0
ID=PYU1_T008883:cds:5;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    19264    19706    .    +    2
ID=PYU1_T008883:cds:6;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    19842    20093    .    +    1
ID=PYU1_T008883:cds:7;Parent=PYU1_T008883;

Now, the CDS features have the following lengths mod3, and the corresponding
phases:

cds:1 length=158=2mod3,  so cds:2 should have phase (0-2)mod3=1
cds:2 length=166=1mod3,  so cds:3 should have phase (1-1)mod3=0
cds:3 length=60=0mod3,   so cds:4 should have phase (0-0)mod3=0
cds:4 length=264=0mod3,  so cds:5 should have phase (0-0)mod3=0
cds:5 length=4774=1mod3, so cds:6 should have phase (0-1)mod3=2
cds:6 length=443=2mod3,  so cds:7 should have phase (2-2)mod3=0

- these recalculated phases are exactly those seen in an ARTEMIS
visualisation of this gene model.

The recorded phases in the original (pre-Erika) example, and in the MAKER
example above appear to have been calculated by the following rule:

Phase of CDS_1 = 0

Phase of CDS_n = -(sum(phase_k), k=1..n-1 + sum(len(k)), k=1..n-1))mod3, n >
1

n      length(mod3)     phase_n    phase_{n+1}
===    ============     =======    ==========
1      158(2)            0          -(0+2)=1mod3
2      166(1)            1          -(1+1+0+2)=2mod3
3      60(0)             2          -(2+0+1+1+0+2)=0mod3
4      264(0)            0          -(0+0+2+0+1+1+0+2)=0mod3
5      4774(1)           0          -(0+1+0+0+2+0+1+1+0+2)=2mod3
6      443(2)            2          -(2+2+0+1+0+0+2+0+1+1+0+2)=1mod3

and I think they are incorrect, as can be seen by visualisation of this
feature, and considering that the first two CDS have combined length
158+166=324=0mod3, so the third CDS must start with the first base of a
codon, and so have phase 0.

Cheers,

L.

--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
[hidden email]       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp
: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________

------------------------------------------------------------------------------

_______________________________________________
SOng-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/song-devel



--
Lincoln D. Stein
Director, Informatics and Biocomputing Platform
Ontario Institute for Cancer Research
101 College St., Suite 800
Toronto, ON, Canada M5G0A3
416 673-8514
Assistant: Renata Musa <[hidden email]>
------------------------------------------------------------------------------

_______________________________________________
SOng-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/song-devel


------------------------------------------------------------------------------


_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: [SO-devel] GFF3: Phase problems in CDS features

lpritc@scri.ac.uk
Hi,

On 25/05/2010 Tuesday, May 25, 08:10, "Barry Moore"
<[hidden email]> wrote:

> Thanks again Leighton,
 
[...]

> On May 24, 2010, at 7:30 PM, Lincoln Stein wrote:
>
>> I think you are right. The GFF3 example needs to be fixed yet again.

You're welcome - I'm just glad it's useful, and to be of some help.

Cheers,

L.
 

--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:[hidden email]       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________

------------------------------------------------------------------------------

_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: [maker-devel] [SO-devel] GFF3: Phase problems in CDS features

Carson Hinton Holt
In reply to this post by Barry Moore
Re: [maker-devel] [SO-devel] GFF3: Phase problems in CDS features Thanks Barry, I added another fix for the minus strand as well.  Also thank you Leighton, MAKER now should produce the correct phase.

Carson

On 5/25/10 1:10 AM, "Barry Moore" <bmoore@...> wrote:

Thanks again Leighton,

I agree with the changes that you have suggested to the GFF3 specification and have updated it accordingly - http://www.sequenceontology.org/resources/gff3.html. I checked the spec carefully, and I'm pretty sure we've got all of that type of error now.

For those in the perl world the phase of the NEXT CDS in a 5' -> 3' series - as suggested by Leighton - looks like this:

$next_phase = ($this_cds_phase - $this_cds_length) % 3;

Carson, I made a change to line 692 of /maker/lib/Dumper/GFF/GFFV3.pm to correct what appears to be an error in phase calculation there.  I've committed that change, but can you confirm that I did right thing there next time you update.

Dave, I think that the following perl should fix up the example data from the Chado tutorial:

perl -i.bak -F'\t' -lane 'BEGIN{$p=0}if($F[2] eq "CDS"){$F[7] = $p;$p=($p-($F[4]-$F[3]+1))%3}else{$p=0};print join "\t", @F' GMOD_sample_data.gff

Barry

On May 24, 2010, at 7:30 PM, Lincoln Stein wrote:

I think you are right. The GFF3 example needs to be fixed yet again.

Lincoln

On Tue, May 18, 2010 at 2:18 AM, Leighton Pritchard <lpritc@...> wrote:
Hi,

I've had a look through the archives at SourceForge, and found an email
describing a similar problem, by Erika Sallett
(http://sourceforge.net/mailarchive/forum.php?thread_name=1251738304.3814.33
1.camel@lipmCinfoES&forum_name=song-devel), but even after the correction I
think there may still be a problem with the GFF3 example at
http://www.sequenceontology.org/gff3.shtml, and that this might be feeding
through to other applications.

Considering cds0004 in the example from
http://www.sequenceontology.org/gff3.shtml:

ctg123 . CDS             3391  3902  .  +  0
ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
ctg123 . CDS      5000  5500  .  +  1
ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
ctg123 . CDS      7000  7600  .  +  2
ID=cds00004;Parent=mRNA00003;Name=edenprotein.4

The first exon has length (3902-3391+1) = 2mod3, so the following exon
should have phase 1 (as pointed out by Erika, and later corrected).

But the second exon has length (5500-5000+1) = 0mod3; this exon must then
contain the first two bases of the last codon in addition to the trailing
base from the first exon, and so the third exon should have phase 1, rather
than phase 2.  The general rule being (where CDS are indexed in 5`->3`
sequential order):

Phase of CDS_1 = 0

Phase of CDS_n = (phase of CDS_{n-1} - length of CDS_{n-1})mod3, n > 1

[or alternatively:

Phase(CDS_n) = -(sum(len(CDS_k)), k=1..n-1)mod3, n > 1]

In both cases, the nominal phase of CDS_{n+1} where n=(number of CDS) should
be zero, as a check.

The problem seems to extend farther than the example on that page, though -
I've seen the same problem in the MAKER output used for the GMOD/CHADO
example tutorial at http://gmod.org/wiki/Chado_Tutorial.  The example data
in the file

ftp://ftp.gmod.org/pub/gmod/Courses/2009/SummerSchoolEurope/GMOD_sample_data
.gff.zip

(this example is GMOD/CHADO and not SO, so I'm cross-posting) contains an
entry for PYU1_G008865, which has seven exon/CDS features with ID
PYU1_T008883.  The CDS features for this entry in the file are as described
below:

scf1117875581239    maker    CDS    13459    13616    .    +    0
ID=PYU1_T008883:cds:1;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    13643    13808    .    +    1
ID=PYU1_T008883:cds:2;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    13893    13952    .    +    2
ID=PYU1_T008883:cds:3;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    14050    14313    .    +    0
ID=PYU1_T008883:cds:4;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    14404    19177    .    +    0
ID=PYU1_T008883:cds:5;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    19264    19706    .    +    2
ID=PYU1_T008883:cds:6;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    19842    20093    .    +    1
ID=PYU1_T008883:cds:7;Parent=PYU1_T008883;

Now, the CDS features have the following lengths mod3, and the corresponding
phases:

cds:1 length=158=2mod3,  so cds:2 should have phase (0-2)mod3=1
cds:2 length=166=1mod3,  so cds:3 should have phase (1-1)mod3=0
cds:3 length=60=0mod3,   so cds:4 should have phase (0-0)mod3=0
cds:4 length=264=0mod3,  so cds:5 should have phase (0-0)mod3=0
cds:5 length=4774=1mod3, so cds:6 should have phase (0-1)mod3=2
cds:6 length=443=2mod3,  so cds:7 should have phase (2-2)mod3=0

- these recalculated phases are exactly those seen in an ARTEMIS
visualisation of this gene model.

The recorded phases in the original (pre-Erika) example, and in the MAKER
example above appear to have been calculated by the following rule:

Phase of CDS_1 = 0

Phase of CDS_n = -(sum(phase_k), k=1..n-1 + sum(len(k)), k=1..n-1))mod3, n >
1

n      length(mod3)     phase_n    phase_{n+1}
===    ============     =======    ==========
1      158(2)            0          -(0+2)=1mod3
2      166(1)            1          -(1+1+0+2)=2mod3
3      60(0)             2          -(2+0+1+1+0+2)=0mod3
4      264(0)            0          -(0+0+2+0+1+1+0+2)=0mod3
5      4774(1)           0          -(0+1+0+0+2+0+1+1+0+2)=2mod3
6      443(2)            2          -(2+2+0+1+0+0+2+0+1+1+0+2)=1mod3

and I think they are incorrect, as can be seen by visualisation of this
feature, and considering that the first two CDS have combined length
158+166=324=0mod3, so the third CDS must start with the first base of a
codon, and so have phase 0.

Cheers,

L.

--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc@... <[hidden email]>        w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp <http://www.scri.ac.uk/staff/leightonpritchardgpg/pgp> : 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster@... quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________

------------------------------------------------------------------------------

_______________________________________________
SOng-devel mailing list
SOng-devel@...
https://lists.sourceforge.net/lists/listinfo/song-devel



------------------------------------------------------------------------------


_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: [maker-devel] [SO-devel] GFF3: Phase problems in CDS features

Mark Yandell
Thanks Guys!

--mark

Mark Yandell
Associate Professor of Human Genetics
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330
ph:801-587-7707
________________________________________
From: [hidden email] [[hidden email]] On Behalf Of Carson Holt
Sent: Tuesday, May 25, 2010 10:53 AM
To: Barry Moore; SO developers
Cc: [hidden email]; gmod-schema
Subject: Re: [maker-devel] [SO-devel] GFF3: Phase problems in CDS features

Thanks Barry, I added another fix for the minus strand as well.  Also thank you Leighton, MAKER now should produce the correct phase.

Carson

On 5/25/10 1:10 AM, "Barry Moore" <[hidden email]> wrote:

Thanks again Leighton,

I agree with the changes that you have suggested to the GFF3 specification and have updated it accordingly - http://www.sequenceontology.org/resources/gff3.html. I checked the spec carefully, and I'm pretty sure we've got all of that type of error now.

For those in the perl world the phase of the NEXT CDS in a 5' -> 3' series - as suggested by Leighton - looks like this:

$next_phase = ($this_cds_phase - $this_cds_length) % 3;

Carson, I made a change to line 692 of /maker/lib/Dumper/GFF/GFFV3.pm to correct what appears to be an error in phase calculation there.  I've committed that change, but can you confirm that I did right thing there next time you update.

Dave, I think that the following perl should fix up the example data from the Chado tutorial:

perl -i.bak -F'\t' -lane 'BEGIN{$p=0}if($F[2] eq "CDS"){$F[7] = $p;$p=($p-($F[4]-$F[3]+1))%3}else{$p=0};print join "\t", @F' GMOD_sample_data.gff

Barry

On May 24, 2010, at 7:30 PM, Lincoln Stein wrote:

I think you are right. The GFF3 example needs to be fixed yet again.

Lincoln

On Tue, May 18, 2010 at 2:18 AM, Leighton Pritchard <[hidden email]> wrote:
Hi,

I've had a look through the archives at SourceForge, and found an email
describing a similar problem, by Erika Sallett
(http://sourceforge.net/mailarchive/forum.php?thread_name=1251738304.3814.33
1.camel@lipmCinfoES&forum_name=song-devel), but even after the correction I
think there may still be a problem with the GFF3 example at
http://www.sequenceontology.org/gff3.shtml, and that this might be feeding
through to other applications.

Considering cds0004 in the example from
http://www.sequenceontology.org/gff3.shtml:

ctg123 . CDS             3391  3902  .  +  0
ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
ctg123 . CDS      5000  5500  .  +  1
ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
ctg123 . CDS      7000  7600  .  +  2
ID=cds00004;Parent=mRNA00003;Name=edenprotein.4

The first exon has length (3902-3391+1) = 2mod3, so the following exon
should have phase 1 (as pointed out by Erika, and later corrected).

But the second exon has length (5500-5000+1) = 0mod3; this exon must then
contain the first two bases of the last codon in addition to the trailing
base from the first exon, and so the third exon should have phase 1, rather
than phase 2.  The general rule being (where CDS are indexed in 5`->3`
sequential order):

Phase of CDS_1 = 0

Phase of CDS_n = (phase of CDS_{n-1} - length of CDS_{n-1})mod3, n > 1

[or alternatively:

Phase(CDS_n) = -(sum(len(CDS_k)), k=1..n-1)mod3, n > 1]

In both cases, the nominal phase of CDS_{n+1} where n=(number of CDS) should
be zero, as a check.

The problem seems to extend farther than the example on that page, though -
I've seen the same problem in the MAKER output used for the GMOD/CHADO
example tutorial at http://gmod.org/wiki/Chado_Tutorial.  The example data
in the file

ftp://ftp.gmod.org/pub/gmod/Courses/2009/SummerSchoolEurope/GMOD_sample_data
.gff.zip

(this example is GMOD/CHADO and not SO, so I'm cross-posting) contains an
entry for PYU1_G008865, which has seven exon/CDS features with ID
PYU1_T008883.  The CDS features for this entry in the file are as described
below:

scf1117875581239    maker    CDS    13459    13616    .    +    0
ID=PYU1_T008883:cds:1;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    13643    13808    .    +    1
ID=PYU1_T008883:cds:2;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    13893    13952    .    +    2
ID=PYU1_T008883:cds:3;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    14050    14313    .    +    0
ID=PYU1_T008883:cds:4;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    14404    19177    .    +    0
ID=PYU1_T008883:cds:5;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    19264    19706    .    +    2
ID=PYU1_T008883:cds:6;Parent=PYU1_T008883;
scf1117875581239    maker    CDS    19842    20093    .    +    1
ID=PYU1_T008883:cds:7;Parent=PYU1_T008883;

Now, the CDS features have the following lengths mod3, and the corresponding
phases:

cds:1 length=158=2mod3,  so cds:2 should have phase (0-2)mod3=1
cds:2 length=166=1mod3,  so cds:3 should have phase (1-1)mod3=0
cds:3 length=60=0mod3,   so cds:4 should have phase (0-0)mod3=0
cds:4 length=264=0mod3,  so cds:5 should have phase (0-0)mod3=0
cds:5 length=4774=1mod3, so cds:6 should have phase (0-1)mod3=2
cds:6 length=443=2mod3,  so cds:7 should have phase (2-2)mod3=0

- these recalculated phases are exactly those seen in an ARTEMIS
visualisation of this gene model.

The recorded phases in the original (pre-Erika) example, and in the MAKER
example above appear to have been calculated by the following rule:

Phase of CDS_1 = 0

Phase of CDS_n = -(sum(phase_k), k=1..n-1 + sum(len(k)), k=1..n-1))mod3, n >
1

n      length(mod3)     phase_n    phase_{n+1}
===    ============     =======    ==========
1      158(2)            0          -(0+2)=1mod3
2      166(1)            1          -(1+1+0+2)=2mod3
3      60(0)             2          -(2+0+1+1+0+2)=0mod3
4      264(0)            0          -(0+0+2+0+1+1+0+2)=0mod3
5      4774(1)           0          -(0+1+0+0+2+0+1+1+0+2)=2mod3
6      443(2)            2          -(2+2+0+1+0+0+2+0+1+1+0+2)=1mod3

and I think they are incorrect, as can be seen by visualisation of this
feature, and considering that the first two CDS have combined length
158+166=324=0mod3, so the third CDS must start with the first base of a
codon, and so have phase 0.

Cheers,

L.

--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:[hidden email] <mailto:e%[hidden email]>        w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp <http://www.scri.ac.uk/staff/leightonpritchardgpg/pgp> : 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________

------------------------------------------------------------------------------

_______________________________________________
SOng-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/song-devel



------------------------------------------------------------------------------

_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema