evidence for MAKER vs evidence to train gene finders

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

evidence for MAKER vs evidence to train gene finders

Steven Sullivan
I'm confused about the use(s) of gene sequence evidence in the MAKER de novo annotation pipeline

As I understand it, MAKER combines 1) its own BLAST alignments of user-supplied RNA ('EST evidence') and protein ('protein homology evidence') sequences to the genome assembly, with 2) models suggested by trained ab initio gene finders that run in parallel. 

The gene finders require a prior training step,  and the training sub-protocol in Campbell et al 2014 (Curr. Prot. Bioinf.) assumes that no 'gold standard' gene annotation exist for a newly-sequenced genome.  Therefore it describes an iterative/bootstrap  process whereby initial MAKER output becomes the gene finder training input for e.g. SNAP, whose output is then used in the next  MAKER round.  

But in my case, even before the genome was sequenced, a few hundred individual high-quality DNA/protein gene sequences for my species  have already been deposited  in public databases (Genbank, Swissprot) by various labs over the years, to accompany various publications.

Should these be used to train gene finders prior to a MAKER run, and *also* as user-supplied 'protein homology evidence' to MAKER itself? 

Or am I misunderstanding the workflow?






_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: evidence for MAKER vs evidence to train gene finders

Carson Holt-2
The training does not involve so much the sequence, rather the structure (i.e. intron exon, start, stop etc.). You could use the evidence deposited as input to the iterative process described, but not directly. This is because you have the sequence but not the structure.

What MAKER does with the est2genome/protein2genome options is to align the evidence to the reference, polish for correct splicing (because blast alignments are not splice aware), then identify correct open reading frames with start and stop codons. The result is an intron/exon structure. The HMM for the predictor then builds probability models for moving from intron to exon states (which includes info such as leading sequence before the start codons, average intron lengths, etc.). All of which is not directly available from the protein or transcript data. But once it’s been polished against the reference, the structure can be discovered.

After initial training (i.e. the bootstrap run), MAKER provides hints in the form of probability bonuses when evidence alignments suggest UTR, CDS, intron, or exon. Then when the predictors run, they perform better than they would without the hint. As a result the second round of predictions are better than the first, and can be used as training to improve the HMM.

—Carson



> On Sep 19, 2016, at 10:21 PM, Steven Sullivan <[hidden email]> wrote:
>
> I'm confused about the use(s) of gene sequence evidence in the MAKER de novo annotation pipeline
>
> As I understand it, MAKER combines 1) its own BLAST alignments of user-supplied RNA ('EST evidence') and protein ('protein homology evidence') sequences to the genome assembly, with 2) models suggested by trained ab initio gene finders that run in parallel.
>
> The gene finders require a prior training step,  and the training sub-protocol in Campbell et al 2014 (Curr. Prot. Bioinf.) assumes that no 'gold standard' gene annotation exist for a newly-sequenced genome.  Therefore it describes an iterative/bootstrap  process whereby initial MAKER output becomes the gene finder training input for e.g. SNAP, whose output is then used in the next  MAKER round.  
>
> But in my case, even before the genome was sequenced, a few hundred individual high-quality DNA/protein gene sequences for my species  have already been deposited  in public databases (Genbank, Swissprot) by various labs over the years, to accompany various publications.
>
> Should these be used to train gene finders prior to a MAKER run, and *also* as user-supplied 'protein homology evidence' to MAKER itself?
>
> Or am I misunderstanding the workflow?
>
>
>
>
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: evidence for MAKER vs evidence to train gene finders

Daniel Ence
Just chiming in with my own perspective on the question. The gold-standard genes can be used as input for training the gene predictors  and also as evidence for the genome annotation. Presumably, you’ll have much more evidence than the gold-standard genes for the annotation, so it won’t be circular. As Carson said, the gene predictors are using the structure of the alignments of the input, rather than the sequence itself. The other source for input for gene predictors, in the case of a true bootstrap where you have no gold-standard, would be to use alignment generated by a program, like BUSCO or CEGMA, that identifies conserved orthologs in the genome.

~Daniel




Daniel Ence
Graduate Student
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330

> On Sep 19, 2016, at 10:34 PM, Carson Holt <[hidden email]> wrote:
>
> The training does not involve so much the sequence, rather the structure (i.e. intron exon, start, stop etc.). You could use the evidence deposited as input to the iterative process described, but not directly. This is because you have the sequence but not the structure.
>
> What MAKER does with the est2genome/protein2genome options is to align the evidence to the reference, polish for correct splicing (because blast alignments are not splice aware), then identify correct open reading frames with start and stop codons. The result is an intron/exon structure. The HMM for the predictor then builds probability models for moving from intron to exon states (which includes info such as leading sequence before the start codons, average intron lengths, etc.). All of which is not directly available from the protein or transcript data. But once it’s been polished against the reference, the structure can be discovered.
>
> After initial training (i.e. the bootstrap run), MAKER provides hints in the form of probability bonuses when evidence alignments suggest UTR, CDS, intron, or exon. Then when the predictors run, they perform better than they would without the hint. As a result the second round of predictions are better than the first, and can be used as training to improve the HMM.
>
> —Carson
>
>
>
>> On Sep 19, 2016, at 10:21 PM, Steven Sullivan <[hidden email]> wrote:
>>
>> I'm confused about the use(s) of gene sequence evidence in the MAKER de novo annotation pipeline
>>
>> As I understand it, MAKER combines 1) its own BLAST alignments of user-supplied RNA ('EST evidence') and protein ('protein homology evidence') sequences to the genome assembly, with 2) models suggested by trained ab initio gene finders that run in parallel.
>>
>> The gene finders require a prior training step,  and the training sub-protocol in Campbell et al 2014 (Curr. Prot. Bioinf.) assumes that no 'gold standard' gene annotation exist for a newly-sequenced genome.  Therefore it describes an iterative/bootstrap  process whereby initial MAKER output becomes the gene finder training input for e.g. SNAP, whose output is then used in the next  MAKER round.  
>>
>> But in my case, even before the genome was sequenced, a few hundred individual high-quality DNA/protein gene sequences for my species  have already been deposited  in public databases (Genbank, Swissprot) by various labs over the years, to accompany various publications.
>>
>> Should these be used to train gene finders prior to a MAKER run, and *also* as user-supplied 'protein homology evidence' to MAKER itself?
>>
>> Or am I misunderstanding the workflow?
>>
>>
>>
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> [hidden email]
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: evidence for MAKER vs evidence to train gene finders

Fields, Christopher J
I can add that BUSCO did work well as a first-pass bootstrap (with the added convenience of running Augustus for generating an initial model).  

chris

> On Sep 19, 2016, at 11:45 PM, Daniel Ence <[hidden email]> wrote:
>
> Just chiming in with my own perspective on the question. The gold-standard genes can be used as input for training the gene predictors  and also as evidence for the genome annotation. Presumably, you’ll have much more evidence than the gold-standard genes for the annotation, so it won’t be circular. As Carson said, the gene predictors are using the structure of the alignments of the input, rather than the sequence itself. The other source for input for gene predictors, in the case of a true bootstrap where you have no gold-standard, would be to use alignment generated by a program, like BUSCO or CEGMA, that identifies conserved orthologs in the genome.
>
> ~Daniel
>
>
>
>
> Daniel Ence
> Graduate Student
> Eccles Institute of Human Genetics
> University of Utah
> 15 North 2030 East, Room 2100
> Salt Lake City, UT 84112-5330
>
>> On Sep 19, 2016, at 10:34 PM, Carson Holt <[hidden email]> wrote:
>>
>> The training does not involve so much the sequence, rather the structure (i.e. intron exon, start, stop etc.). You could use the evidence deposited as input to the iterative process described, but not directly. This is because you have the sequence but not the structure.
>>
>> What MAKER does with the est2genome/protein2genome options is to align the evidence to the reference, polish for correct splicing (because blast alignments are not splice aware), then identify correct open reading frames with start and stop codons. The result is an intron/exon structure. The HMM for the predictor then builds probability models for moving from intron to exon states (which includes info such as leading sequence before the start codons, average intron lengths, etc.). All of which is not directly available from the protein or transcript data. But once it’s been polished against the reference, the structure can be discovered.
>>
>> After initial training (i.e. the bootstrap run), MAKER provides hints in the form of probability bonuses when evidence alignments suggest UTR, CDS, intron, or exon. Then when the predictors run, they perform better than they would without the hint. As a result the second round of predictions are better than the first, and can be used as training to improve the HMM.
>>
>> —Carson
>>
>>
>>
>>> On Sep 19, 2016, at 10:21 PM, Steven Sullivan <[hidden email]> wrote:
>>>
>>> I'm confused about the use(s) of gene sequence evidence in the MAKER de novo annotation pipeline
>>>
>>> As I understand it, MAKER combines 1) its own BLAST alignments of user-supplied RNA ('EST evidence') and protein ('protein homology evidence') sequences to the genome assembly, with 2) models suggested by trained ab initio gene finders that run in parallel.
>>>
>>> The gene finders require a prior training step,  and the training sub-protocol in Campbell et al 2014 (Curr. Prot. Bioinf.) assumes that no 'gold standard' gene annotation exist for a newly-sequenced genome.  Therefore it describes an iterative/bootstrap  process whereby initial MAKER output becomes the gene finder training input for e.g. SNAP, whose output is then used in the next  MAKER round.  
>>>
>>> But in my case, even before the genome was sequenced, a few hundred individual high-quality DNA/protein gene sequences for my species  have already been deposited  in public databases (Genbank, Swissprot) by various labs over the years, to accompany various publications.
>>>
>>> Should these be used to train gene finders prior to a MAKER run, and *also* as user-supplied 'protein homology evidence' to MAKER itself?
>>>
>>> Or am I misunderstanding the workflow?
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> maker-devel mailing list
>>> [hidden email]
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> [hidden email]
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: evidence for MAKER vs evidence to train gene finders

Steven Sullivan
In reply to this post by Daniel Ence
Thanks! So, I think for training the gene predictors, I'll try to identify any sequences in my gold-standard set that have structural in information...i.e. genes for which the genomic sequence was cloned....and use those.  But  I doubt there's enough of those to train e.g. Augustus, so I'll probably have to use the bootstrap method as well . Is there a way to combine both?

For the BLAST-based annotation, if I use entire Uniprot/Swissprot or Genbank FASTA sets as protein homology evidence , my gold standards are already included in those.  I gather from these replies that that's not a problem. 

However, there *are* public database sequences (predicted genes from an older annotation of this species) that I *do* want to exclude from evidence.  (Because we want to run MAKER as if this genome was 'new', never before annotated.)  Can I use something  like the -negative_gilist  option in blastp , to omit previous genome project predictions from consideration?  (An  option that only works with Genbank sequences, I think) .  Or do I have to create a  custom version of the large public database?






_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: evidence for MAKER vs evidence to train gene finders

Carson Holt-2
You would need to create a custom database without the sequences you wish to exclude.

—Carson

> On Sep 20, 2016, at 1:28 PM, Steven Sullivan <[hidden email]> wrote:
>
> Thanks! So, I think for training the gene predictors, I'll try to identify any sequences in my gold-standard set that have structural in information...i.e. genes for which the genomic sequence was cloned....and use those.  But  I doubt there's enough of those to train e.g. Augustus, so I'll probably have to use the bootstrap method as well . Is there a way to combine both?
>
> For the BLAST-based annotation, if I use entire Uniprot/Swissprot or Genbank FASTA sets as protein homology evidence , my gold standards are already included in those.  I gather from these replies that that's not a problem.
>
> However, there *are* public database sequences (predicted genes from an older annotation of this species) that I *do* want to exclude from evidence.  (Because we want to run MAKER as if this genome was 'new', never before annotated.)  Can I use something  like the -negative_gilist  option in blastp , to omit previous genome project predictions from consideration?  (An  option that only works with Genbank sequences, I think) .  Or do I have to create a  custom version of the large public database?
>
>
>
>
>


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org