Re: Annotation question - B. glabrata.

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Annotation question - B. glabrata.

Monica Munoz-Torres
Hello Michael, - and everyone on the list - :) 

Sending an email directly to me works well. Unfortunately I have been out of the office due to a family emergency and conferences in the last few weeks, so it is likely that your email hid in my list of "unread" messages. Thanks for your patience. 

I checked the scaffold you mentioned (LGUN_random_SCaffold2198) and saw that Vector Base has enabled the User-created Annotation (Uc-A) area on this region. That was the first hurdle - at least with one of your previous emails - , and I am glad it is now fixed. I know that VectorBase is working very hard in making the best possible resources available to the research community, and they will eventually begin undertaking efforts to migrate their data to newer versions of Apollo, which will improve the performance of their instances and likely decrease the hiccups you and others may have encountered. 

Although Colin offered information regarding how other projects have successfully added customizations to their Apollo instances, unfortunately VB is not yet one of these teams. In this case it is necessary to find alternative ways to "create" an exon and a gene model where none has been predicted. 

I used the sequence you provided in an earlier email to query the snail genome and found two high-scoring segment paris (hsp; I imagine you have already seen these). 

 Inline image 1

As you have pointed out, it is not possible to drag these elements into the UcA area in order to create a new annotation, there are no predicted models in the region, and none of the sequences from Diptera species, ticks or lice align to this region either. Without any elements on the evidence track you are left with the need to artificially create an annotation, IF and only IF there is evidence in support of this gene model.

In this case, when adding the track labeled "generic RNAseq for gene prediction", RNAseq data in support of transcription in this region are evident. Although the actual reads for this RNAseq data set are not available as tracks, you can see the coverage in the region. And you can see that RNAseq coverage data is visible for those regions where we found the hsps. 

Yellow shading highlights the results of the BLAT search. Blue coverage from the "generic RNAseq" track is shown below. 

     Inline image 9   Inline image 8

From here on, I would like to kindly remind you - and other reading this - that the steps I have taken below are drastic measures to conduct an annotation. You should always stick to the experimental and alignment data available to you; doing that will simplify the way in which you most effectively use Apollo. In the absence of all other data, you can take these more complex steps.

With that in mind... 

So, using the coordinates in the results of the BLAT search, I drafted the coordinates for a gene model containing two exons that span the region highlighted in the BLAT results.These coordinates were placed in a "Generic Feature Format Version 3 (GFF3)" text file. Apollo is able to read these type of files, locally, and you can visualize the data contained in them in the form of genome elements. I used the conventions (flavor) of GFF3 that the version of Apollo at VectorBase uses (this *should* not vary too much from GFF3 elsewhere), and I used the name for the genes as those that the VB Apollo instance has implemented. (GFF file also attached). 

This is what the gene coordinates look like in a GFF3 file.

Inline image 13

I used the "File" Menu in Apollo to select the option to "Open" and load the local files. 

Inline image 14


And I loaded the GFF3 from my computer using the dialog box. Click on "Select Files...", choose your file, and load. Once it loaded, I clicked on the "Open" rectangular button at the bottom of the box. 

Inline image 15

This is the result: a track is now visible and labeled with the name of the file I used. You can then select that gene model (by double-clicking the genomic element) and drag it up into the Uc-A. You can also change the name of the file to make the label shrink - later versions of Apollo allow you to temporarily remove all the track labels. 

Inline image 11

As expected, it is evident that the results of the BLAT search produced aligned regions that do not necessarily correspond with the exact coordinates of exons with canonical boundaries, so it will be necessary to correct those boundaries. Later versions of Apollo can do this automatically. For the time being, remember the rule of thumb for "canonical splice sites". In case you need a refresher, this is what they will look like in Apollo:

Inline image 16 


To take a look at where to start with the edits, I retrieved the resulting amino acid protein without modifying any of the boundaries and used it to query NCBI's non-redundant protein collection on their website (http://blast.ncbi.nlm.nih.gov/Blast.cgi). A small fragment of the conserved domain from the "Variant erythrocyte surface antigen-1" family is visible, and the amino acid string that I used has significant sequence similarity to the sequences available for metallothionein from other organisms. 

Inline image 17

Evidently this is only the small fragment of a protein. You can tell that there are at least two other regions with RNAseq coverage (according to the "generic" track), so I would start investigating if the protein can be extended to cover those regions, and what the results of those edits are. You will have to risk it by getting a little "creative" with how you add exons to this gene model because, as noted above, there is not evidence that you can "drag up" into the Uc-A area. For example, I would duplicate the first gene model, delete one of the exons, and modify the other one to meet the coordinates of the areas with RNAseq coverage around these two hsps, and so on. 

Inline image 18

I offer a warning to please be very careful that at all times you continue to review the resulting gene models by retrieving the amino acid sequences and comparing them to sequences available in public databases. Also, be conservative when adjusting exon/intron boundaries in order to find the canonical-splice sites that are better supported by the RNAseq data. Always check that the integrity and accuracy of the protein have been preserved after each edit. 

I am leaving the fragment of the gene model I created in the region so you can use it as a starting point. Now you are aware of how to artificially "create" a gene model when none has been predicted in a region of interest, provided that both sequence similarity searches and experimental data support your decision. 

I hope this answer is helpful. Please do not hesitate to contact us should you have any additional questions.

cheers, 
~moni. 



--
Mentorship Matters!
--
Monica Munoz-Torres, PhD.
Berkeley Bioinformatics Open-source Projects (BBOP)
Environmental Genomics and Functional Biology Division
Lawrence Berkeley National Laboratory

Mailing Address:
Lawrence Berkeley National Laboratory
1 Cyclotron Road Mailstop 977
Berkeley, CA 94720




On Fri, Aug 21, 2015 at 3:02 AM, Niederwanger, Michael <[hidden email]> wrote:

Hello,

 

Some time ago I had (and still have) issues annotating a gene in Apollo but I have to finish it soon. It’s all there, but I am simply not able to annotate  it.

 

[...]To remember what I am talking about

Part of my email:

 

In the mails from the Apollo list was the suggestion to simply drag the BLAT results into the User created annotations area within Apollo. However this does not work, or I’m too stupid. Also there was a suggestion that every alignment file can be used as a template. I used clustal w for alignments, but how can I get a file format out of this which can be opened in Apollo? Or do I need different alignment program? I couldn’t drag or use the alignment files from vectorbase as well.

I’m at a loss what to do.

If possible I would like to send to you a partial sequence (or full sequence) of the gene in question (Word file or FASTA), from which I would be able to continue annotating. When using this for a BLAT search in Apollo it gives you 100% match and shows nicely the regions within Apollo (LGUN_random_Scaffold…). And this is where it ends (where I end). I am not able to continue. Can you help me to convert this sequence in any format that can be recognized by Apollo? Or how would you do it?

 

It would be awesome if you or someone could help me with that,

 

Kind  Regards

Michael Niederwanger

 

 

 






This list is for the Apollo Annotation Editing Tool. Info at http://genomearchitect.org/
If you wish to unsubscribe from the Apollo List: 1. From the address with which you subscribed to the list, send a message to [hidden email] | 2. In the subject line of your email type: unsubscribe apollo | 3. Leave the message body blank.

Reply | Threaded
Open this post in threaded view
|

Re: Annotation question - B. glabrata.

Monica Munoz-Torres
aha! the file... 

On Fri, Aug 21, 2015 at 5:41 PM, Monica Munoz-Torres <[hidden email]> wrote:
Hello Michael, - and everyone on the list - :) 

Sending an email directly to me works well. Unfortunately I have been out of the office due to a family emergency and conferences in the last few weeks, so it is likely that your email hid in my list of "unread" messages. Thanks for your patience. 

I checked the scaffold you mentioned (LGUN_random_SCaffold2198) and saw that Vector Base has enabled the User-created Annotation (Uc-A) area on this region. That was the first hurdle - at least with one of your previous emails - , and I am glad it is now fixed. I know that VectorBase is working very hard in making the best possible resources available to the research community, and they will eventually begin undertaking efforts to migrate their data to newer versions of Apollo, which will improve the performance of their instances and likely decrease the hiccups you and others may have encountered. 

Although Colin offered information regarding how other projects have successfully added customizations to their Apollo instances, unfortunately VB is not yet one of these teams. In this case it is necessary to find alternative ways to "create" an exon and a gene model where none has been predicted. 

I used the sequence you provided in an earlier email to query the snail genome and found two high-scoring segment paris (hsp; I imagine you have already seen these). 

 Inline image 1

As you have pointed out, it is not possible to drag these elements into the UcA area in order to create a new annotation, there are no predicted models in the region, and none of the sequences from Diptera species, ticks or lice align to this region either. Without any elements on the evidence track you are left with the need to artificially create an annotation, IF and only IF there is evidence in support of this gene model.

In this case, when adding the track labeled "generic RNAseq for gene prediction", RNAseq data in support of transcription in this region are evident. Although the actual reads for this RNAseq data set are not available as tracks, you can see the coverage in the region. And you can see that RNAseq coverage data is visible for those regions where we found the hsps. 

Yellow shading highlights the results of the BLAT search. Blue coverage from the "generic RNAseq" track is shown below. 

     Inline image 9   Inline image 8

From here on, I would like to kindly remind you - and other reading this - that the steps I have taken below are drastic measures to conduct an annotation. You should always stick to the experimental and alignment data available to you; doing that will simplify the way in which you most effectively use Apollo. In the absence of all other data, you can take these more complex steps.

With that in mind... 

So, using the coordinates in the results of the BLAT search, I drafted the coordinates for a gene model containing two exons that span the region highlighted in the BLAT results.These coordinates were placed in a "Generic Feature Format Version 3 (GFF3)" text file. Apollo is able to read these type of files, locally, and you can visualize the data contained in them in the form of genome elements. I used the conventions (flavor) of GFF3 that the version of Apollo at VectorBase uses (this *should* not vary too much from GFF3 elsewhere), and I used the name for the genes as those that the VB Apollo instance has implemented. (GFF file also attached). 

This is what the gene coordinates look like in a GFF3 file.

Inline image 13

I used the "File" Menu in Apollo to select the option to "Open" and load the local files. 

Inline image 14


And I loaded the GFF3 from my computer using the dialog box. Click on "Select Files...", choose your file, and load. Once it loaded, I clicked on the "Open" rectangular button at the bottom of the box. 

Inline image 15

This is the result: a track is now visible and labeled with the name of the file I used. You can then select that gene model (by double-clicking the genomic element) and drag it up into the Uc-A. You can also change the name of the file to make the label shrink - later versions of Apollo allow you to temporarily remove all the track labels. 

Inline image 11

As expected, it is evident that the results of the BLAT search produced aligned regions that do not necessarily correspond with the exact coordinates of exons with canonical boundaries, so it will be necessary to correct those boundaries. Later versions of Apollo can do this automatically. For the time being, remember the rule of thumb for "canonical splice sites". In case you need a refresher, this is what they will look like in Apollo:

Inline image 16 


To take a look at where to start with the edits, I retrieved the resulting amino acid protein without modifying any of the boundaries and used it to query NCBI's non-redundant protein collection on their website (http://blast.ncbi.nlm.nih.gov/Blast.cgi). A small fragment of the conserved domain from the "Variant erythrocyte surface antigen-1" family is visible, and the amino acid string that I used has significant sequence similarity to the sequences available for metallothionein from other organisms. 

Inline image 17

Evidently this is only the small fragment of a protein. You can tell that there are at least two other regions with RNAseq coverage (according to the "generic" track), so I would start investigating if the protein can be extended to cover those regions, and what the results of those edits are. You will have to risk it by getting a little "creative" with how you add exons to this gene model because, as noted above, there is not evidence that you can "drag up" into the Uc-A area. For example, I would duplicate the first gene model, delete one of the exons, and modify the other one to meet the coordinates of the areas with RNAseq coverage around these two hsps, and so on. 

Inline image 18

I offer a warning to please be very careful that at all times you continue to review the resulting gene models by retrieving the amino acid sequences and comparing them to sequences available in public databases. Also, be conservative when adjusting exon/intron boundaries in order to find the canonical-splice sites that are better supported by the RNAseq data. Always check that the integrity and accuracy of the protein have been preserved after each edit. 

I am leaving the fragment of the gene model I created in the region so you can use it as a starting point. Now you are aware of how to artificially "create" a gene model when none has been predicted in a region of interest, provided that both sequence similarity searches and experimental data support your decision. 

I hope this answer is helpful. Please do not hesitate to contact us should you have any additional questions.

cheers, 
~moni. 



--
Mentorship Matters!
--
Monica Munoz-Torres, PhD.
Berkeley Bioinformatics Open-source Projects (BBOP)
Environmental Genomics and Functional Biology Division
Lawrence Berkeley National Laboratory

Mailing Address:
Lawrence Berkeley National Laboratory
1 Cyclotron Road Mailstop 977
Berkeley, CA 94720




On Fri, Aug 21, 2015 at 3:02 AM, Niederwanger, Michael <[hidden email]> wrote:

Hello,

 

Some time ago I had (and still have) issues annotating a gene in Apollo but I have to finish it soon. It’s all there, but I am simply not able to annotate  it.

 

[...]To remember what I am talking about

Part of my email:

 

In the mails from the Apollo list was the suggestion to simply drag the BLAT results into the User created annotations area within Apollo. However this does not work, or I’m too stupid. Also there was a suggestion that every alignment file can be used as a template. I used clustal w for alignments, but how can I get a file format out of this which can be opened in Apollo? Or do I need different alignment program? I couldn’t drag or use the alignment files from vectorbase as well.

I’m at a loss what to do.

If possible I would like to send to you a partial sequence (or full sequence) of the gene in question (Word file or FASTA), from which I would be able to continue annotating. When using this for a BLAT search in Apollo it gives you 100% match and shows nicely the regions within Apollo (LGUN_random_Scaffold…). And this is where it ends (where I end). I am not able to continue. Can you help me to convert this sequence in any format that can be recognized by Apollo? Or how would you do it?

 

It would be awesome if you or someone could help me with that,

 

Kind  Regards

Michael Niederwanger

 

 

 





--
Mentorship Matters!
--
Monica Munoz-Torres, PhD.
Berkeley Bioinformatics Open-source Projects (BBOP)
Environmental Genomics and Functional Biology Division
Lawrence Berkeley National Laboratory

Mailing Address:
Lawrence Berkeley National Laboratory
1 Cyclotron Road Mailstop 977
Berkeley, CA 94720




This list is for the Apollo Annotation Editing Tool. Info at http://genomearchitect.org/
If you wish to unsubscribe from the Apollo List: 1. From the address with which you subscribed to the list, send a message to [hidden email] | 2. In the subject line of your email type: unsubscribe apollo | 3. Leave the message body blank.


LGUN_random_Scaffold2198-LGUN_random_Scaffold2198-Niederwanger_MMT-rev.gff (1K) Download Attachment