|
Hi,
I've come across an inconsistency when loading multipart features into CHADO from GFF3. The GFF3 spec (http://www.sequenceontology.org/gff3.shtml) allows for defining several parts to the same feature by giving them the same ID, e.g: ctg123 . cDNA_match 1050 1500 5.8e-42 + . ID=match00001;Target=cdna0123 12 462 ctg123 . cDNA_match 5000 5500 8.1e-43 + . ID=match00001;Target=cdna0123 463 963 ctg123 . cDNA_match 7000 9000 1.4e-40 + . ID=match00001;Target=cdna0123 964 2964 And this is represented correctly in GBROWSE for example, when using the in-memory adaptor for a GFF3 file directly. However, when uploading features that are structured like this into CHADO with gmod_bulk_load_gff3.pl, the Ids of sub-parts of features are changed to include the feature_id, to avoid clashes: ID=match00001 ID=match00001-1 ID=match00001-2 And this breaks the relationship between the features, as far as the schema is concerned. For features like match, I can break these down into match_part subfeatures to get around this. But if there are other features that, for example, span exons but can't really be argued to have any representation in the introns, and that also do not have subfeatures already defined in the SO, this is a problem. Is there a recommended way to get around this that I've missed, somewhere? I guess that, eventually, gmod_bulk_load_gff3.pl will support this part of the spec directly, but in the meantime are there any good ideas out there for what I could do (other than point GBROWSE to a GFF3 copy of the data - we'd like to have the annotation in CHADO for use with ARTEMIS)? Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:[hidden email] w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ ------------------------------------------------------------------------------ ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo _______________________________________________ Gmod-schema mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gmod-schema |
|
Hi Leighton,
While it is possible that the bulk loader will be modified to support this part of the GFF3 spec, it is not on my immediate time line. This is at the very end of the (rather long) perldoc for gmod_bulk_load_gff3.pl: Grouping features by ID The GFF3 specification allows features like CDSes and match_parts to be grouped together by sharing the same ID. This loader does not support this method of grouping. Instead the parent feature must be explicitly created before the parts and the parts must refer to the parent with the Parent tag. So your GFF should look like this: ctg123 . cDNA_match 1050 9000 . + . ID=match00001 ctg123 . match_part 1050 1500 5.8e-42 + . Parent=match00001;Target=cdna0123 12 462 ctg123 . match_part 5000 5500 8.1e-43 + . Parent=match00001;Target=cdna0123 463 963 ctg123 . match_part 7000 9000 1.4e-40 + . Parent=match00001;Target=cdna0123 964 2964 Scott On Thu, Jun 17, 2010 at 10:39 AM, Leighton Pritchard <[hidden email]> wrote: > Hi, > > I've come across an inconsistency when loading multipart features into CHADO > from GFF3. > > The GFF3 spec (http://www.sequenceontology.org/gff3.shtml) allows for > defining several parts to the same feature by giving them the same ID, e.g: > > ctg123 . cDNA_match 1050 1500 5.8e-42 + . ID=match00001;Target=cdna0123 > 12 462 > ctg123 . cDNA_match 5000 5500 8.1e-43 + . ID=match00001;Target=cdna0123 > 463 963 > ctg123 . cDNA_match 7000 9000 1.4e-40 + . ID=match00001;Target=cdna0123 > 964 2964 > > And this is represented correctly in GBROWSE for example, when using the > in-memory adaptor for a GFF3 file directly. > > However, when uploading features that are structured like this into CHADO > with gmod_bulk_load_gff3.pl, the Ids of sub-parts of features are changed to > include the feature_id, to avoid clashes: > > ID=match00001 > ID=match00001-1 > ID=match00001-2 > > And this breaks the relationship between the features, as far as the schema > is concerned. > > For features like match, I can break these down into match_part subfeatures > to get around this. But if there are other features that, for example, span > exons but can't really be argued to have any representation in the introns, > and that also do not have subfeatures already defined in the SO, this is a > problem. > > Is there a recommended way to get around this that I've missed, somewhere? > > I guess that, eventually, gmod_bulk_load_gff3.pl will support this part of > the spec directly, but in the meantime are there any good ideas out there > for what I could do (other than point GBROWSE to a GFF3 copy of the data - > we'd like to have the annotation in CHADO for use with ARTEMIS)? > > Cheers, > > L. > > -- > Dr Leighton Pritchard MRSC > D131, Plant Pathology Programme, SCRI > Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA > e:[hidden email] w:http://www.scri.ac.uk/staff/leightonpritchard > gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 > > > ______________________________________________________ > SCRI, Invergowrie, Dundee, DD2 5DA. > The Scottish Crop Research Institute is a charitable company limited by guarantee. > Registered in Scotland No: SC 29367. > Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. > > > DISCLAIMER: > > This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. > If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system. > > Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). > ______________________________________________________ > > ------------------------------------------------------------------------------ > ThinkGeek and WIRED's GeekDad team up for the Ultimate > GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the > lucky parental unit. See the prize list and enter to win: > http://p.sf.net/sfu/thinkgeek-promo > _______________________________________________ > Gmod-schema mailing list > [hidden email] > https://lists.sourceforge.net/lists/listinfo/gmod-schema > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research ------------------------------------------------------------------------------ ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo _______________________________________________ Gmod-schema mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gmod-schema |
|
Hi Scott,
On 17/06/2010 Thursday, June 17, 16:01, "Scott Cain" <[hidden email]> wrote: > While it is possible that the bulk loader will be modified to support > this part of the GFF3 spec, it is not on my immediate time line. This > is at the very end of the (rather long) perldoc for > gmod_bulk_load_gff3.pl: > > Grouping features by ID > The GFF3 specification allows features like CDSes and match_parts > to be grouped together by sharing the same ID. This loader does > not support this method of grouping. Instead the parent feature > must be explicitly created before the parts and the parts must > refer to the parent with the Parent tag. Ah... I missed that - thanks for the pointer. I wouldn't think that that's entirely trivial to sort out, so I'll not hold my breath ;) > So your GFF should look like this: > > ctg123 . cDNA_match 1050 9000 . + . ID=match00001 > ctg123 . match_part 1050 1500 5.8e-42 + . > Parent=match00001;Target=cdna0123 12 462 > ctg123 . match_part 5000 5500 8.1e-43 + . > Parent=match00001;Target=cdna0123 463 963 > ctg123 . match_part 7000 9000 1.4e-40 + . > Parent=match00001;Target=cdna0123 964 2964 That's what I've been doing for matches - even for Pfam hits (which could arguably be described otherwise...), and it works just fine for those. The specific problem I've been having trouble getting my head around has been for predictions like TM domains and other annotated regions that derive from the polypeptide sequence and can readily span exons. I'm less comfortable with classifying those as a match type when there's an SO term dedicated to them, and no clear 'query' term as there is for Pfam domains (for example: transmembrane_polypeptide_region which has no obvious children). I'm also interested in how automatically to populate the feature_relationship table to link the annotated feature and the parent gene in these cases, while still allowing for display in GBROWSE, but that's for another mailing list post, I think. Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:[hidden email] w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ ------------------------------------------------------------------------------ ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo _______________________________________________ Gmod-schema mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gmod-schema |
|
Hi,
Apologies for labouring the point, but I'm still having some conceptual trouble with this. I'd like to get our local CHADO/ARTEMIS/GBROWSE implementation as good as it can be made before putting it out to the biologists in anger - particularly as I'll have to explain/justify the way annotations are stored to them. Thanks again for your answer yesterday, Scott, but I was hoping that there might be another answer than to stick only with match/match_part features. My problems are as follows: 1) CHADO enforces that the polypeptide which is calculated from exon features is a single featureloc, with fmin=start codon and fmax=end codon. This is fine as a policy but doesn't represent multi-exon polypeptides well in ARTEMIS, and looks a bit odd in GBROWSE. We can ignore this for the most part, and work with the CDS/exon features in those two systems, but I still find it misleading (as do our biologists) to visualise the polypeptide that derives from the CDS as a single contiguous region including introns. The same is true for features, such as transmembrane_polypeptide_region that are a single contiguous region of polypeptide, but span two genomic locations. 2) Even if CHADO didn't enforce the above, I can't see an obvious way to group features in CHADO that are on the same level in the SO, unless the SO gives them an obvious parent that is recognised by software that sits on top of CHADO. We can't associate features by ID (as in GFF3) as there is a constraint on feature.uniquename, and we can't associate a single feature with multiple featureloc rows - at least not as primary locations. We must then make individual subfeatures and give each one its own featureloc. 'Grouping' is then by the parent feature. However, this has inconsistent results in - for example - GBROWSE, depending on the choice of SO term. The Pfam domain represented as a match/match_part tree: supercont1.1 PFam:PI_T30-4_FINAL_CALLGENES_3 match 40699 41485 1.10e-24 + . ID=PITG_00004:pfam_Broad:2 supercont1.1 PFam:PI_T30-4_FINAL_CALLGENES_3 match_part 40699 40771 1.10e-24 + . ID=PITG_00004:pfam_Broad:2:2;Parent=PITG_00004:pfam_Broad:2 supercont1.1 PFam:PI_T30-4_FINAL_CALLGENES_3 match_part 40827 41485 1.10e-24 + . ID=PITG_00004:pfam_Broad:2:3;Parent=PITG_00004:pfam_Broad:2 can be displayed as two linked regions in GBROWSE, but a polypeptide_domain/polypeptide_region tree supercont1.1 PFam:PI_T30-4_FINAL_CALLGENES_3 polypeptide_domain 40699 41485 1.10e-24 + . ID=PITG_00004:pfam_Broad:x supercont1.1 PFam:PI_T30-4_FINAL_CALLGENES_3 polypeptide_region 40699 40771 1.10e-24 + . ID=PITG_00004:pfam_Broad:x:2;Parent=PITG_00004:pfam_Broad:x supercont1.1 PFam:PI_T30-4_FINAL_CALLGENES_3 polypeptide_region 40827 41485 1.10e-24 + . ID=PITG_00004:pfam_Broad:x:3;Parent=PITG_00004:pfam_Broad:x cannot. They don't seem to be considered as connected terms. This could be a problem for terms like transmembrane_polypeptide_region or helix_turn_helix which don't have a declared, specific SO part structure and can span multiple sections of the reference sequence. There are a number of other SO terms to which this kind of situation could reasonably apply, such as more or less everything under polypeptide_region, and algorithmic predictions of protein domains - and these don't seem to fit the match/match_part system, either. I can't be the only person trying to store this kind of annotation in CHADO, but I'm willing to believe that I'm the only one having problems ;) So - how are other users getting around this issue? Does anyone else even think it's a problem? Am I missing something obvious? I've been through the wiki, including the best practices page, but I've not found any examples of this - I'd welcome pointers to more information, also. Thanks, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:[hidden email] w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ ------------------------------------------------------------------------------ ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo _______________________________________________ Gmod-schema mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gmod-schema |
|
Aha!
Eventually it's clicked, I think. But it also took a little more modification of the CHADO adaptor for GBROWSE. If I have, say, a TM domain that spans exons, I can set the parent feature to be a single contiguous transmembrane_polypeptide_region, and then add polypeptide_region features as children to this, where they coincide with the exons. That worked in GFF3/in-memory adaptor for GBROWSE, but not when uploaded to CHADO. The adaptor was reporting a method 'phase' error. This turned out to be due to my using the inferCDS option. When this is set, at line 1104 in Feature.pm of the adaptor a test is made for whether a feature starts with the term 'exon' or 'polypeptide' and, if it does, the feature is placed in an array of such features; this array is used to infer the locations of the CDS/UTR features. However, plenty of other SO terms start with 'polypeptide', even if inferring CDS doesn't make any sense for them, and changing this line to if ($inferCDS && ($feat->type =~ /exon/ or $feat->type =~ /polypeptide$/ )) { so that only polypeptide (and not polypeptide_region etc.) features are used for the inference allows for the display of child regions of features with the correct SO term as parents with GBROWSE and CHADO. With apologies for bugging you... L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:[hidden email] w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ ------------------------------------------------------------------------------ ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo _______________________________________________ Gmod-schema mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/gmod-schema |
| Powered by Nabble | Edit this page |
