Extracting gene sequences from GFF3 and FASTA library for Multi-FASTA loading into Tripal

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Extracting gene sequences from GFF3 and FASTA library for Multi-FASTA loading into Tripal

Collett, James R-2
Hello Everyone,
 
I’ve got Tripal up and running, and now I have some sequence data and MAKER output files from a collaborator that I would like to load into Chado.  Following the tutorial (v0.3b), I was able to load my GFF3 file which has contig, gene, exon, CDS, and mRNA features and a total of 91120 lines.  I’ve also been able to load the FASTA library of masked scaffolds, which I successfully synced to the contig features in Tripal.  I was also able to load the transcripts FASTA library that MAKER outputs and sync the transcript sequences with the mRNA features in Tripal.
 
I would now like to load and sync gene sequences in Tripal, but these sequences will need to be extracted from the scaffolds (contigs) FASTA library according to their coordinates for their corresponding gene features in the GFF3 file.  Is there a script somewhere in my installed Tripal, Chado, or MAKER distributions that can do this for me?
 
Also, I’m having a bit of trouble finding the right Multi-FASTA loader settings and regular expressions to correctly load my library of translated protein sequences into Tripal in order to create and sync protein feature nodes. I’ve assumed that the workflow for this operation will be quite similar to that shown in the tutorial, wherein the newly created protein features (and their sequences) are associated with their respective mRNA features.  The names and IDs of the mRNA features in the GFF3 file are identical and look like this: “CLAGR_000001-RA”.  The protein FASTA header lines have the same exact name as their respective mRNA features (“>CLAGR_000001-RA”).  Can you suggest the proper settings and regular expressions to use in this case?
 
Thanks,
 
Jim
 
__________________________________________________
James R. Collett, Ph.D.
Senior Scientist
Chemical and Biological Process Development Group
Energy and Environment Directorate
 
Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, MSIN P8-60
Richland, WA  99352 USA
 
 
 

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: Extracting gene sequences from GFF3 and FASTA library for Multi-FASTA loading into Tripal

Stephen Ficklin-2
Hi Jim,

Great job on getting all your data loaded!  

If your goal is to simply show the gene sequence on the gene page, then Tripal can do that without uploading any gene sequences.  If you still have the default side bar then you should see the gene sequence by clicking the link 'gene Colored Sequence'.   Tripal will automatically pull out the sequence based on the gene coordinates on the chromosome (or scaffold or pseudomolecule) and display it.  It also tries to do some highlighting of sub features, which doesn't quite look it's best for a gene (we need to do a bit more work on that.. it looks great for an mRNA feature), so,  if you don't like the way that looks then you'll have to upload the gene sequence like you suggested.  Tripal can't generate these sequences for you.  But if you have a GBRowse instance for your data, I believe you can use your GBrowse to download an entire track (e.g. gene track) in FASTA format.  You can then use that file to load into Tripal for the gene sequences.

For loading your protein FASTA file try these settings:
1.  sequence type:  protein
2.  match type name:  unique name
3.  No need to provide any regular expressions for name and uniquename
4.  Relationship Type:  'produced_by'
5.  Regular exrpession:  I don't think you need one, but if it complains you can use this:  ^.*$
6.  Parent Type: mRNA

So, a brief explanation.  It is okay to have two features with identical names so long as they are of different types.  In your case, you have mRNA and protein types that have the same name.  Since your FASTA file headers look like this '>CLAGR_000001-RA' then you don't have anything on the line but the feature name.  So, you don't need any regular expressions to pull out a separate name and uniquename.    However, you do want these protein sequences to have links from your mRNA pages so that after you sync your proteins folks can get to those pages from the mRNA page.  The relationship fields do this for you.  The relationship type indicates the protein is derived from the parent of type mRNA.  And the regular expression is used to find the parent by using information in the header of your FASTA file.  You don't have any parent information in your FASTA header, but fortunately, the mRNA (parent) and protein are named the same so we can take advantage of that to match the parent.  I don't think you need a regular expression for that match, but just in case (I can't remember exactly), I provided one that should match the entire definition line.

By the way, you mention you were following the v0.3b tutorial.   Are you using Tripal v0.3b or v0.3.1b?  If v0.3b I would recommend updating as we have fixed several bugs for the newest version. 

Hope that helps,
Stephen

 

On 11/3/2011 9:05 PM, Collett, James R wrote:
Hello Everyone,
 
I’ve got Tripal up and running, and now I have some sequence data and MAKER output files from a collaborator that I would like to load into Chado.  Following the tutorial (v0.3b), I was able to load my GFF3 file which has contig, gene, exon, CDS, and mRNA features and a total of 91120 lines.  I’ve also been able to load the FASTA library of masked scaffolds, which I successfully synced to the contig features in Tripal.  I was also able to load the transcripts FASTA library that MAKER outputs and sync the transcript sequences with the mRNA features in Tripal.
 
I would now like to load and sync gene sequences in Tripal, but these sequences will need to be extracted from the scaffolds (contigs) FASTA library according to their coordinates for their corresponding gene features in the GFF3 file.  Is there a script somewhere in my installed Tripal, Chado, or MAKER distributions that can do this for me?
 
Also, I’m having a bit of trouble finding the right Multi-FASTA loader settings and regular expressions to correctly load my library of translated protein sequences into Tripal in order to create and sync protein feature nodes. I’ve assumed that the workflow for this operation will be quite similar to that shown in the tutorial, wherein the newly created protein features (and their sequences) are associated with their respective mRNA features.  The names and IDs of the mRNA features in the GFF3 file are identical and look like this: “CLAGR_000001-RA”.  The protein FASTA header lines have the same exact name as their respective mRNA features (“>CLAGR_000001-RA”).  Can you suggest the proper settings and regular expressions to use in this case?
 
Thanks,
 
Jim
 
__________________________________________________
James R. Collett, Ph.D.
Senior Scientist
Chemical and Biological Process Development Group
Energy and Environment Directorate
 
Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, MSIN P8-60
Richland, WA  99352 USA
 
 
 


------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1


_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal


------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: Extracting gene sequences from GFF3 and FASTA library for Multi-FASTA loading into Tripal

Stephen Ficklin-2
Hi Jim,

My apologies, I think I told you wrong about GBrowse.  Unless someone else on the list knows and can correct me, I wasn't able to pull out the gene sequences from GBrowse as I told you.  But  I did find a niftly Perl script that can use a GFF file with gene features and a FASTA file of the chromosomes and pull out a FASTA file of whatever type you want:

http://insects.eugenes.org/species/data/work/motifs/cdsgff2seq.pl

Stephen

On 11/4/2011 10:27 AM, Stephen Ficklin wrote:
Hi Jim,

Great job on getting all your data loaded!  

If your goal is to simply show the gene sequence on the gene page, then Tripal can do that without uploading any gene sequences.  If you still have the default side bar then you should see the gene sequence by clicking the link 'gene Colored Sequence'.   Tripal will automatically pull out the sequence based on the gene coordinates on the chromosome (or scaffold or pseudomolecule) and display it.  It also tries to do some highlighting of sub features, which doesn't quite look it's best for a gene (we need to do a bit more work on that.. it looks great for an mRNA feature), so,  if you don't like the way that looks then you'll have to upload the gene sequence like you suggested.  Tripal can't generate these sequences for you.  But if you have a GBRowse instance for your data, I believe you can use your GBrowse to download an entire track (e.g. gene track) in FASTA format.  You can then use that file to load into Tripal for the gene sequences.

For loading your protein FASTA file try these settings:
1.  sequence type:  protein
2.  match type name:  unique name
3.  No need to provide any regular expressions for name and uniquename
4.  Relationship Type:  'produced_by'
5.  Regular exrpession:  I don't think you need one, but if it complains you can use this:  ^.*$
6.  Parent Type: mRNA

So, a brief explanation.  It is okay to have two features with identical names so long as they are of different types.  In your case, you have mRNA and protein types that have the same name.  Since your FASTA file headers look like this '>CLAGR_000001-RA' then you don't have anything on the line but the feature name.  So, you don't need any regular expressions to pull out a separate name and uniquename.    However, you do want these protein sequences to have links from your mRNA pages so that after you sync your proteins folks can get to those pages from the mRNA page.  The relationship fields do this for you.  The relationship type indicates the protein is derived from the parent of type mRNA.  And the regular expression is used to find the parent by using information in the header of your FASTA file.  You don't have any parent information in your FASTA header, but fortunately, the mRNA (parent) and protein are named the same so we can take advantage of that to match the parent.  I don't think you need a regular expression for that match, but just in case (I can't remember exactly), I provided one that should match the entire definition line.

By the way, you mention you were following the v0.3b tutorial.   Are you using Tripal v0.3b or v0.3.1b?  If v0.3b I would recommend updating as we have fixed several bugs for the newest version. 

Hope that helps,
Stephen

 

On 11/3/2011 9:05 PM, Collett, James R wrote:
Hello Everyone,
 
I’ve got Tripal up and running, and now I have some sequence data and MAKER output files from a collaborator that I would like to load into Chado.  Following the tutorial (v0.3b), I was able to load my GFF3 file which has contig, gene, exon, CDS, and mRNA features and a total of 91120 lines.  I’ve also been able to load the FASTA library of masked scaffolds, which I successfully synced to the contig features in Tripal.  I was also able to load the transcripts FASTA library that MAKER outputs and sync the transcript sequences with the mRNA features in Tripal.
 
I would now like to load and sync gene sequences in Tripal, but these sequences will need to be extracted from the scaffolds (contigs) FASTA library according to their coordinates for their corresponding gene features in the GFF3 file.  Is there a script somewhere in my installed Tripal, Chado, or MAKER distributions that can do this for me?
 
Also, I’m having a bit of trouble finding the right Multi-FASTA loader settings and regular expressions to correctly load my library of translated protein sequences into Tripal in order to create and sync protein feature nodes. I’ve assumed that the workflow for this operation will be quite similar to that shown in the tutorial, wherein the newly created protein features (and their sequences) are associated with their respective mRNA features.  The names and IDs of the mRNA features in the GFF3 file are identical and look like this: “CLAGR_000001-RA”.  The protein FASTA header lines have the same exact name as their respective mRNA features (“>CLAGR_000001-RA”).  Can you suggest the proper settings and regular expressions to use in this case?
 
Thanks,
 
Jim
 
__________________________________________________
James R. Collett, Ph.D.
Senior Scientist
Chemical and Biological Process Development Group
Energy and Environment Directorate
 
Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, MSIN P8-60
Richland, WA  99352 USA
 
 
 


------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1


_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal



------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: Extracting gene sequences from GFF3 and FASTA library for Multi-FASTA loading into Tripal

Scott Cain
Hi Stephen,

You are right that there is not a built in way to dump out gene sequences, but it would be trivial to write a plugin that does it.  Bundled with GBrowse now is a protein sequence dumper that extracts the sequences for the genes in the current view, translates them and prints them out as fasta.  That said, doing it with a command line tool is probably better.

Scott


On Sat, Nov 5, 2011 at 10:07 AM, Stephen Ficklin <[hidden email]> wrote:
Hi Jim,

My apologies, I think I told you wrong about GBrowse.  Unless someone else on the list knows and can correct me, I wasn't able to pull out the gene sequences from GBrowse as I told you.  But  I did find a niftly Perl script that can use a GFF file with gene features and a FASTA file of the chromosomes and pull out a FASTA file of whatever type you want:

http://insects.eugenes.org/species/data/work/motifs/cdsgff2seq.pl

Stephen


On 11/4/2011 10:27 AM, Stephen Ficklin wrote:
Hi Jim,

Great job on getting all your data loaded!  

If your goal is to simply show the gene sequence on the gene page, then Tripal can do that without uploading any gene sequences.  If you still have the default side bar then you should see the gene sequence by clicking the link 'gene Colored Sequence'.   Tripal will automatically pull out the sequence based on the gene coordinates on the chromosome (or scaffold or pseudomolecule) and display it.  It also tries to do some highlighting of sub features, which doesn't quite look it's best for a gene (we need to do a bit more work on that.. it looks great for an mRNA feature), so,  if you don't like the way that looks then you'll have to upload the gene sequence like you suggested.  Tripal can't generate these sequences for you.  But if you have a GBRowse instance for your data, I believe you can use your GBrowse to download an entire track (e.g. gene track) in FASTA format.  You can then use that file to load into Tripal for the gene sequences.

For loading your protein FASTA file try these settings:
1.  sequence type:  protein
2.  match type name:  unique name
3.  No need to provide any regular expressions for name and uniquename
4.  Relationship Type:  'produced_by'
5.  Regular exrpession:  I don't think you need one, but if it complains you can use this:  ^.*$
6.  Parent Type: mRNA

So, a brief explanation.  It is okay to have two features with identical names so long as they are of different types.  In your case, you have mRNA and protein types that have the same name.  Since your FASTA file headers look like this '>CLAGR_000001-RA' then you don't have anything on the line but the feature name.  So, you don't need any regular expressions to pull out a separate name and uniquename.    However, you do want these protein sequences to have links from your mRNA pages so that after you sync your proteins folks can get to those pages from the mRNA page.  The relationship fields do this for you.  The relationship type indicates the protein is derived from the parent of type mRNA.  And the regular expression is used to find the parent by using information in the header of your FASTA file.  You don't have any parent information in your FASTA header, but fortunately, the mRNA (parent) and protein are named the same so we can take advantage of that to match the parent.  I don't think you need a regular expression for that match, but just in case (I can't remember exactly), I provided one that should match the entire definition line.

By the way, you mention you were following the v0.3b tutorial.   Are you using Tripal v0.3b or v0.3.1b?  If v0.3b I would recommend updating as we have fixed several bugs for the newest version. 

Hope that helps,
Stephen

 

On 11/3/2011 9:05 PM, Collett, James R wrote:
Hello Everyone,
 
I’ve got Tripal up and running, and now I have some sequence data and MAKER output files from a collaborator that I would like to load into Chado.  Following the tutorial (v0.3b), I was able to load my GFF3 file which has contig, gene, exon, CDS, and mRNA features and a total of 91120 lines.  I’ve also been able to load the FASTA library of masked scaffolds, which I successfully synced to the contig features in Tripal.  I was also able to load the transcripts FASTA library that MAKER outputs and sync the transcript sequences with the mRNA features in Tripal.
 
I would now like to load and sync gene sequences in Tripal, but these sequences will need to be extracted from the scaffolds (contigs) FASTA library according to their coordinates for their corresponding gene features in the GFF3 file.  Is there a script somewhere in my installed Tripal, Chado, or MAKER distributions that can do this for me?
 
Also, I’m having a bit of trouble finding the right Multi-FASTA loader settings and regular expressions to correctly load my library of translated protein sequences into Tripal in order to create and sync protein feature nodes. I’ve assumed that the workflow for this operation will be quite similar to that shown in the tutorial, wherein the newly created protein features (and their sequences) are associated with their respective mRNA features.  The names and IDs of the mRNA features in the GFF3 file are identical and look like this: “CLAGR_000001-RA”.  The protein FASTA header lines have the same exact name as their respective mRNA features (“>CLAGR_000001-RA”).  Can you suggest the proper settings and regular expressions to use in this case?
 
Thanks,
 
Jim
 
__________________________________________________
James R. Collett, Ph.D.
Senior Scientist
Chemical and Biological Process Development Group
Energy and Environment Directorate
 
Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, MSIN P8-60
Richland, WA  99352 USA
 
 
 


------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1


_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal



------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal




--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal