How to make sequenceExporter display polypeptide amino acid sequences (like Uniprot proteins)?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

How to make sequenceExporter display polypeptide amino acid sequences (like Uniprot proteins)?

Sam Hokin-3
Hiya, devs. I thought this would be an easy tweak, but I've reached the point where I'd like some help. I've loaded a bunch of data
in from a tripal.chado database, which works fine, overall. However, if I wish to look at the FASTA from a polypeptide,
sequenceExporter throws the following exception:

org.biojava.bio.symbol.IllegalSymbolException: This tokenization doesn't contain character: 'I'
        at org.biojava.bio.seq.io.CharacterTokenization.parseTokenChar(CharacterTokenization.java:175)
        at org.biojava.bio.seq.io.CharacterTokenization$TPStreamParser.characters(CharacterTokenization.java:246)
        at org.biojava.bio.symbol.SimpleSymbolList.<init>(SimpleSymbolList.java:178)
        at org.biojava.bio.seq.DNATools.createDNA(DNATools.java:204)
        at org.intermine.bio.web.biojava.BioSequenceFactory.make(BioSequenceFactory.java:111)

Clearly InterMine thinks a polypeptide sequence is just nucleotides. This doesn't happen for proteins imported from Uniprot. But I'm
at a loss as to how a Uniprot-loaded Protein class knows that it contains a full amino acid sequence while my chado-db-loaded
Polypeptide class thinks it's DNA.

My so_terms "polypeptide" and "polypeptide_domain" entries result in the following simple Polypeptide definition:

<class name="Polypeptide" extends="SequenceFeature" is-interface="true">
         <collection name="polypeptideDomains" referenced-type="PolypeptideDomain" reverse-reference="polypeptide"/>
</class>

I'd have hoped that InterMine would know from SO that a polypeptide is an amino acid sequence, but no go.

So, I presume that I should expand the class definition in chado-db_additions.xml, just as uniprot_additions.xml has an expanded
protein definition:

<class name="Protein" is-interface="true">
     <attribute name="isFragment" type="java.lang.Boolean"/>
     <attribute name="isUniprotCanonical" type="java.lang.Boolean"/>
     <attribute name="uniprotAccession" type="java.lang.String"/>
     <attribute name="uniprotName" type="java.lang.String"/>
     <attribute name="ecNumber" type="java.lang.String"/>
     <reference name="canonicalProtein" referenced-type="Protein" reverse-reference="isoforms" />
     <collection name="ecNumbers" referenced-type="ECNumber" reverse-reference="proteins" />
     <collection name="comments" referenced-type="Comment" />
     <collection name="components" referenced-type="Component" reverse-reference="protein" />
     <collection name="keywords" referenced-type="OntologyTerm" />
     <collection name="features" referenced-type="UniProtFeature" reverse-reference="protein" />
     <collection name="proteinDomains" referenced-type="ProteinDomain" reverse-reference="proteins" />
     <collection name="isoforms" referenced-type="Protein" reverse-reference="canonicalProtein"/>
</class>

But it's a total mystery to me how InterMine gathers from that (if from that) that a Protein sequence contains all the amino acids
rather than ACGT. So, please enlighten me! Thanks!!!

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: How to make sequenceExporter display polypeptide amino acid sequences (like Uniprot proteins)?

vkrishna
Hi Sam,

I believe the way UniProt source which loads the FASTA sequence from flat-files knows that it is dealing with a protein sequence is by the setting of a property called fasta.sequenceType to “protein” in the project.xml. Here is the corresponding config in FlyMine: https://github.com/intermine/intermine/blob/beta/flymine/project.xml#L66

This then tells the underlying biojava SequenceFactory to consider the incoming sequences to be made of amino acid residues instead of nucleotides.

Hope this steers you in the right direction, towards solving your issue.

Thank you.

Regards,
Vivek

> On Oct 5, 2015, at 8:50 PM, Sam Hokin <[hidden email]> wrote:
>
> Hiya, devs. I thought this would be an easy tweak, but I've reached the point where I'd like some help. I've loaded a bunch of data in from a tripal.chado database, which works fine, overall. However, if I wish to look at the FASTA from a polypeptide, sequenceExporter throws the following exception:
>
> org.biojava.bio.symbol.IllegalSymbolException: This tokenization doesn't contain character: 'I'
> at org.biojava.bio.seq.io.CharacterTokenization.parseTokenChar(CharacterTokenization.java:175)
> at org.biojava.bio.seq.io.CharacterTokenization$TPStreamParser.characters(CharacterTokenization.java:246)
> at org.biojava.bio.symbol.SimpleSymbolList.<init>(SimpleSymbolList.java:178)
> at org.biojava.bio.seq.DNATools.createDNA(DNATools.java:204)
> at org.intermine.bio.web.biojava.BioSequenceFactory.make(BioSequenceFactory.java:111)
>
> Clearly InterMine thinks a polypeptide sequence is just nucleotides. This doesn't happen for proteins imported from Uniprot. But I'm at a loss as to how a Uniprot-loaded Protein class knows that it contains a full amino acid sequence while my chado-db-loaded Polypeptide class thinks it's DNA.
>
> My so_terms "polypeptide" and "polypeptide_domain" entries result in the following simple Polypeptide definition:
>
> <class name="Polypeptide" extends="SequenceFeature" is-interface="true">
>        <collection name="polypeptideDomains" referenced-type="PolypeptideDomain" reverse-reference="polypeptide"/>
> </class>
>
> I'd have hoped that InterMine would know from SO that a polypeptide is an amino acid sequence, but no go.
>
> So, I presume that I should expand the class definition in chado-db_additions.xml, just as uniprot_additions.xml has an expanded protein definition:
>
> <class name="Protein" is-interface="true">
>    <attribute name="isFragment" type="java.lang.Boolean"/>
>    <attribute name="isUniprotCanonical" type="java.lang.Boolean"/>
>    <attribute name="uniprotAccession" type="java.lang.String"/>
>    <attribute name="uniprotName" type="java.lang.String"/>
>    <attribute name="ecNumber" type="java.lang.String"/>
>    <reference name="canonicalProtein" referenced-type="Protein" reverse-reference="isoforms" />
>    <collection name="ecNumbers" referenced-type="ECNumber" reverse-reference="proteins" />
>    <collection name="comments" referenced-type="Comment" />
>    <collection name="components" referenced-type="Component" reverse-reference="protein" />
>    <collection name="keywords" referenced-type="OntologyTerm" />
>    <collection name="features" referenced-type="UniProtFeature" reverse-reference="protein" />
>    <collection name="proteinDomains" referenced-type="ProteinDomain" reverse-reference="proteins" />
>    <collection name="isoforms" referenced-type="Protein" reverse-reference="canonicalProtein"/>
> </class>
>
> But it's a total mystery to me how InterMine gathers from that (if from that) that a Protein sequence contains all the amino acids rather than ACGT. So, please enlighten me! Thanks!!!
>
> _______________________________________________
> dev mailing list
> [hidden email]
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev