guidance for first and subsequent annotation parameters

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

guidance for first and subsequent annotation parameters

Devon O'Rourke
With so many posts on the forum it's been challenging to determine what the best practices are for performing multiple rounds of annotation with Maker.
My first round used est, altest, and protein fasta files with a custom GFF repeat masked file. The resulting vertebrate genome produced 21,970 gene models with a mean length of about 9016 bp; the BUSCO score was C:66.0%[S:64.2%,D:1.8%],F:4.2%,M:29.8%,n:9226 (mammalia_odb10 set). Things seemed to be on the right track, so I set up the next Maker round using both SNAP and Augustus-trained information in the round2 maker_opts.ctl file. At the end of that second round, I noticed a marked decrease in BUSCO score (C:53.3%[S:51.0%,D:2.3%],F:11.6%,M:35.1%,n:9226), yet an increase in the number of gene models (28,646) and mean length (16266 bp).

This got me to wondering if I was setting up the _opts.ctl file incorrectly? I'm concerned with a few things (and maybe missing even more I should be concerned about!?):
  • I specified the evidence to come from EST/Protein instead of using the section available under "#-----Re-annotation Using MAKER Derived GFF3". Maybe that was a fundamental mistake? What is the expected change in behavior if I moved my round1 Maker output into that category instead of using the EST/Protein Homology evidence sections as I did below?
  • I wasn't sure what to do with the RepeatMasking GFF files in Round2. The RepeatMasker GFF I included in Round1 consisted of just complex repeats (setting model_org=simple and softmask=1 to effectively only hard mask those complex areas for the initial alignments). But what should be used in Round2 - the output GFF of Round1, or the input GFF from Round1?
Here's what I did for the Round2 maker_opts.ctl file:

#-----Genome (these are always required)
genome=/scratch/dro49/myluwork/annotation/input_files/mylu_hic_rails_noMasks.fa
organism_type=eukaryotic
#-----EST Evidence (for best results provide a file for at least one)
est_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.est2genome.gff
altest_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.cdna2genome.gff
#-----Protein Homology Evidence (for best results provide a file for at least one)
protein_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.protein2genome.gff
#-----Repeat Masking (leave values blank to skip repeat masking)
rm_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.repeats.gff
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
#-----Gene Prediction
snaphmm=/scratch/dro49/myluwork/annotation/maker_rd2/snap_rd1/lu_rnd1.zff.length50_aed0.25.hmm #SNAP HMM file
augustus_species=mylu #Augustus gene prediction species model
run_evm=0 #run EvidenceModeler, 1 = yes, 0 = no
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
allow_overlap= #allowed gene overlap fraction (value from 0 to 1, blank for default)


Thank you for your insights and support,

Devon

--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork

_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: guidance for first and subsequent annotation parameters

Carson Holt-2
You may need to select a subset of gene models to drive training.  I find that I get best results when I use protein2genome models only from uniprot/swiss-prot alignments to generate a training set, set always_complete=1. Uniprot/swiss-prot is manually curated, so is very high quality. Then I select models with the highest end-to-end completion (low AED). Also if you add est_forward=1 the score column in the GFF3 will be the % match to the original model.  It’s and easy way to select only models with a very high percent match. Remove models without start codons and stop codons.  You can relax these parameters if you don’t have many models, but in general you want 100-300 models to train with. Only one round of training is needed with this type of training set. The EST method requires 2 rounds and I don’t like it as much.

In the some cases, model selection for training will be a mostly manual task. You can use editors like Apollo to identify models that match evidence well, and delete odd models. Then train on that result.


What you are seeing is likely the result of over-training. Usually happens if you use more that 2 rounds of training, but can happen with just two rounds.

—Carson

 

On Mar 20, 2020, at 5:30 AM, Devon O'Rourke <[hidden email]> wrote:

With so many posts on the forum it's been challenging to determine what the best practices are for performing multiple rounds of annotation with Maker.
My first round used est, altest, and protein fasta files with a custom GFF repeat masked file. The resulting vertebrate genome produced 21,970 gene models with a mean length of about 9016 bp; the BUSCO score was C:66.0%[S:64.2%,D:1.8%],F:4.2%,M:29.8%,n:9226 (mammalia_odb10 set). Things seemed to be on the right track, so I set up the next Maker round using both SNAP and Augustus-trained information in the round2 maker_opts.ctl file. At the end of that second round, I noticed a marked decrease in BUSCO score (C:53.3%[S:51.0%,D:2.3%],F:11.6%,M:35.1%,n:9226), yet an increase in the number of gene models (28,646) and mean length (16266 bp).

This got me to wondering if I was setting up the _opts.ctl file incorrectly? I'm concerned with a few things (and maybe missing even more I should be concerned about!?):
  • I specified the evidence to come from EST/Protein instead of using the section available under "#-----Re-annotation Using MAKER Derived GFF3". Maybe that was a fundamental mistake? What is the expected change in behavior if I moved my round1 Maker output into that category instead of using the EST/Protein Homology evidence sections as I did below?
  • I wasn't sure what to do with the RepeatMasking GFF files in Round2. The RepeatMasker GFF I included in Round1 consisted of just complex repeats (setting model_org=simple and softmask=1 to effectively only hard mask those complex areas for the initial alignments). But what should be used in Round2 - the output GFF of Round1, or the input GFF from Round1?
Here's what I did for the Round2 maker_opts.ctl file:

#-----Genome (these are always required)
genome=/scratch/dro49/myluwork/annotation/input_files/mylu_hic_rails_noMasks.fa
organism_type=eukaryotic
#-----EST Evidence (for best results provide a file for at least one)
est_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.est2genome.gff
altest_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.cdna2genome.gff
#-----Protein Homology Evidence (for best results provide a file for at least one)
protein_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.protein2genome.gff
#-----Repeat Masking (leave values blank to skip repeat masking)
rm_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.repeats.gff
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
#-----Gene Prediction
snaphmm=/scratch/dro49/myluwork/annotation/maker_rd2/snap_rd1/lu_rnd1.zff.length50_aed0.25.hmm #SNAP HMM file
augustus_species=mylu #Augustus gene prediction species model
run_evm=0 #run EvidenceModeler, 1 = yes, 0 = no
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
allow_overlap= #allowed gene overlap fraction (value from 0 to 1, blank for default)


Thank you for your insights and support,

Devon

--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork


_______________________________________________
maker-devel mailing list
[hidden email]
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org