Inconsistency uploading/representing multipart features

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Inconsistency uploading/representing multipart features

lpritc@scri.ac.uk
Hi,

I've come across an inconsistency when loading multipart features into CHADO
from GFF3.

The GFF3 spec (http://www.sequenceontology.org/gff3.shtml) allows for
defining several parts to the same feature by giving them the same ID, e.g:

ctg123 . cDNA_match 1050  1500  5.8e-42  +  . ID=match00001;Target=cdna0123
12 462
ctg123 . cDNA_match 5000  5500  8.1e-43  +  . ID=match00001;Target=cdna0123
463 963
ctg123 . cDNA_match 7000  9000  1.4e-40  +  . ID=match00001;Target=cdna0123
964 2964

And this is represented correctly in GBROWSE for example, when using the
in-memory adaptor for a GFF3 file directly.

However, when uploading features that are structured like this into CHADO
with gmod_bulk_load_gff3.pl, the Ids of sub-parts of features are changed to
include the feature_id, to avoid clashes:

ID=match00001
ID=match00001-1
ID=match00001-2

And this breaks the relationship between the features, as far as the schema
is concerned.

For features like match, I can break these down into match_part subfeatures
to get around this.  But if there are other features that, for example, span
exons but can't really be argued to have any representation in the introns,
and that also do not have subfeatures already defined in the SO, this is a
problem.

Is there a recommended way to get around this that I've missed, somewhere?

I guess that, eventually, gmod_bulk_load_gff3.pl will support this part of
the spec directly, but in the meantime are there any good ideas out there
for what I could do (other than point GBROWSE to a GFF3 copy of the data -
we'd like to have the annotation in CHADO for use with ARTEMIS)?

Cheers,

L.

--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:[hidden email]       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistency uploading/representing multipart features

Scott Cain
Hi Leighton,

While it is possible that the bulk loader will be modified to support
this part of the GFF3 spec, it is not on my immediate time line.  This
is at the very end of the (rather long) perldoc for
gmod_bulk_load_gff3.pl:

Grouping features by ID
           The GFF3 specification allows features like CDSes and match_parts
           to be grouped together by sharing the same ID.  This loader does
           not support this method of grouping.  Instead the parent feature
           must be explicitly created before the parts and the parts must
           refer to the parent with the Parent tag.

So your GFF should look like this:

ctg123 . cDNA_match 1050  9000  .  +  . ID=match00001
ctg123 . match_part 1050  1500  5.8e-42  +  .
Parent=match00001;Target=cdna0123 12 462
ctg123 . match_part 5000  5500  8.1e-43  +  .
Parent=match00001;Target=cdna0123 463 963
ctg123 . match_part 7000  9000  1.4e-40  +  .
Parent=match00001;Target=cdna0123 964 2964

Scott


On Thu, Jun 17, 2010 at 10:39 AM, Leighton Pritchard <[hidden email]> wrote:

> Hi,
>
> I've come across an inconsistency when loading multipart features into CHADO
> from GFF3.
>
> The GFF3 spec (http://www.sequenceontology.org/gff3.shtml) allows for
> defining several parts to the same feature by giving them the same ID, e.g:
>
> ctg123 . cDNA_match 1050  1500  5.8e-42  +  . ID=match00001;Target=cdna0123
> 12 462
> ctg123 . cDNA_match 5000  5500  8.1e-43  +  . ID=match00001;Target=cdna0123
> 463 963
> ctg123 . cDNA_match 7000  9000  1.4e-40  +  . ID=match00001;Target=cdna0123
> 964 2964
>
> And this is represented correctly in GBROWSE for example, when using the
> in-memory adaptor for a GFF3 file directly.
>
> However, when uploading features that are structured like this into CHADO
> with gmod_bulk_load_gff3.pl, the Ids of sub-parts of features are changed to
> include the feature_id, to avoid clashes:
>
> ID=match00001
> ID=match00001-1
> ID=match00001-2
>
> And this breaks the relationship between the features, as far as the schema
> is concerned.
>
> For features like match, I can break these down into match_part subfeatures
> to get around this.  But if there are other features that, for example, span
> exons but can't really be argued to have any representation in the introns,
> and that also do not have subfeatures already defined in the SO, this is a
> problem.
>
> Is there a recommended way to get around this that I've missed, somewhere?
>
> I guess that, eventually, gmod_bulk_load_gff3.pl will support this part of
> the spec directly, but in the meantime are there any good ideas out there
> for what I could do (other than point GBROWSE to a GFF3 copy of the data -
> we'd like to have the annotation in CHADO for use with ARTEMIS)?
>
> Cheers,
>
> L.
>
> --
> Dr Leighton Pritchard MRSC
> D131, Plant Pathology Programme, SCRI
> Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
> e:[hidden email]       w:http://www.scri.ac.uk/staff/leightonpritchard
> gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405
>
>
> ______________________________________________________
> SCRI, Invergowrie, Dundee, DD2 5DA.
> The Scottish Crop Research Institute is a charitable company limited by guarantee.
> Registered in Scotland No: SC 29367.
> Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.
>
>
> DISCLAIMER:
>
> This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
> If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system.
>
> Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
> ______________________________________________________
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>



--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistency uploading/representing multipart features

lpritc@scri.ac.uk
Hi Scott,

On 17/06/2010 Thursday, June 17, 16:01, "Scott Cain" <[hidden email]>
wrote:

> While it is possible that the bulk loader will be modified to support
> this part of the GFF3 spec, it is not on my immediate time line.  This
> is at the very end of the (rather long) perldoc for
> gmod_bulk_load_gff3.pl:
>
> Grouping features by ID
>            The GFF3 specification allows features like CDSes and match_parts
>            to be grouped together by sharing the same ID.  This loader does
>            not support this method of grouping.  Instead the parent feature
>            must be explicitly created before the parts and the parts must
>            refer to the parent with the Parent tag.

Ah... I missed that - thanks for the pointer.  I wouldn't think that that's
entirely trivial to sort out, so I'll not hold my breath ;)
 
> So your GFF should look like this:
>
> ctg123 . cDNA_match 1050  9000  .  +  . ID=match00001
> ctg123 . match_part 1050  1500  5.8e-42  +  .
> Parent=match00001;Target=cdna0123 12 462
> ctg123 . match_part 5000  5500  8.1e-43  +  .
> Parent=match00001;Target=cdna0123 463 963
> ctg123 . match_part 7000  9000  1.4e-40  +  .
> Parent=match00001;Target=cdna0123 964 2964

That's what I've been doing for matches - even for Pfam hits (which could
arguably be described otherwise...), and it works just fine for those.

The specific problem I've been having trouble getting my head around has
been for predictions like TM domains and other annotated regions that derive
from the polypeptide sequence and can readily span exons.  I'm less
comfortable with classifying those as a match type when there's an SO term
dedicated to them, and no clear 'query' term as there is for Pfam domains
(for example: transmembrane_polypeptide_region which has no obvious
children).  I'm also interested in how automatically to populate the
feature_relationship table to link the annotated feature and the parent gene
in these cases, while still allowing for display in GBROWSE, but that's for
another mailing list post, I think.

Cheers,

L.

--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:[hidden email]       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

More on multipart features

lpritc@scri.ac.uk
Hi,

Apologies for labouring the point, but I'm still having some conceptual
trouble with this.  I'd like to get our local CHADO/ARTEMIS/GBROWSE
implementation as good as it can be made before putting it out to the
biologists in anger - particularly as I'll have to explain/justify the way
annotations are stored to them.

Thanks again for your answer yesterday, Scott, but I was hoping that there
might be another answer than to stick only with match/match_part features.

My problems are as follows:

1) CHADO enforces that the polypeptide which is calculated from exon
features is a single featureloc, with fmin=start codon and fmax=end codon.
This is fine as a policy but doesn't represent multi-exon polypeptides well
in ARTEMIS, and looks a bit odd in GBROWSE.  We can ignore this for the most
part, and work with the CDS/exon features in those two systems, but I still
find it misleading (as do our biologists) to visualise the polypeptide that
derives from the CDS as a single contiguous region including introns.  The
same is true for features, such as transmembrane_polypeptide_region that are
a single contiguous region of polypeptide, but span two genomic locations.

2) Even if CHADO didn't enforce the above, I can't see an obvious way to
group features in CHADO that are on the same level in the SO, unless the SO
gives them an obvious parent that is recognised by software that sits on top
of CHADO.  We can't associate features by ID (as in GFF3) as there is a
constraint on feature.uniquename, and we can't associate a single feature
with multiple featureloc rows - at least not as primary locations.  We must
then make individual subfeatures and give each one its own featureloc.
'Grouping' is then by the parent feature.  However, this has inconsistent
results in - for example - GBROWSE, depending on the choice of SO term.  The
Pfam domain represented as a match/match_part tree:

supercont1.1    PFam:PI_T30-4_FINAL_CALLGENES_3    match    40699    41485
1.10e-24    +    .    ID=PITG_00004:pfam_Broad:2
supercont1.1    PFam:PI_T30-4_FINAL_CALLGENES_3    match_part    40699
40771    1.10e-24    +    .
ID=PITG_00004:pfam_Broad:2:2;Parent=PITG_00004:pfam_Broad:2
supercont1.1    PFam:PI_T30-4_FINAL_CALLGENES_3    match_part    40827
41485    1.10e-24    +    .
ID=PITG_00004:pfam_Broad:2:3;Parent=PITG_00004:pfam_Broad:2

can be displayed as two linked regions in GBROWSE, but a
polypeptide_domain/polypeptide_region tree

supercont1.1    PFam:PI_T30-4_FINAL_CALLGENES_3    polypeptide_domain
40699    41485    1.10e-24    +    .    ID=PITG_00004:pfam_Broad:x
supercont1.1    PFam:PI_T30-4_FINAL_CALLGENES_3    polypeptide_region
40699    40771    1.10e-24    +    .
ID=PITG_00004:pfam_Broad:x:2;Parent=PITG_00004:pfam_Broad:x
supercont1.1    PFam:PI_T30-4_FINAL_CALLGENES_3    polypeptide_region
40827    41485    1.10e-24    +    .
ID=PITG_00004:pfam_Broad:x:3;Parent=PITG_00004:pfam_Broad:x

cannot.  They don't seem to be considered as connected terms.  This could be
a problem for terms like transmembrane_polypeptide_region or
helix_turn_helix which don't have a declared, specific SO part structure and
can span multiple sections of the reference sequence.

There are a number of other SO terms to which this kind of situation could
reasonably apply, such as more or less everything under polypeptide_region,
and algorithmic predictions of protein domains - and these don't seem to fit
the match/match_part system, either.

I can't be the only person trying to store this kind of annotation in CHADO,
but I'm willing to believe that I'm the only one having problems ;)  So -
how are other users getting around this issue?  Does anyone else even think
it's a problem?  Am I missing something obvious?  I've been through the
wiki, including the best practices page, but I've not found any examples of
this - I'd welcome pointers to more information, also.

Thanks,

L.

--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:[hidden email]       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: More on multipart features

lpritc@scri.ac.uk
Aha!

Eventually it's clicked, I think.  But it also took a little more
modification of the CHADO adaptor for GBROWSE.

If I have, say, a TM domain that spans exons, I can set the parent feature
to be a single contiguous transmembrane_polypeptide_region, and then add
polypeptide_region features as children to this, where they coincide with
the exons.

That worked in GFF3/in-memory adaptor for GBROWSE, but not when uploaded to
CHADO.  The adaptor was reporting a method 'phase' error.  This turned out
to be due to my using the inferCDS option.  When this is set, at line 1104
in Feature.pm of the adaptor a test is made for whether a feature starts
with the term 'exon' or 'polypeptide' and, if it does, the feature is placed
in an array of such features; this array is used to infer the locations of
the CDS/UTR features.  However, plenty of other SO terms start with
'polypeptide', even if inferring CDS doesn't make any sense for them, and
changing this line to

      if ($inferCDS && ($feat->type =~ /exon/ or $feat->type =~
/polypeptide$/ )) {

so that only polypeptide (and not polypeptide_region etc.) features are used
for the inference allows for the display of child regions of features with
the correct SO term as parents with GBROWSE and CHADO.

With apologies for bugging you...

L.

--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:[hidden email]       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema