With so many posts on the forum it's been challenging to determine what the best practices are for performing multiple rounds of annotation with Maker.
My first round used est, altest, and protein fasta files with a custom GFF repeat masked file. The resulting vertebrate genome produced 21,970 gene models with a mean length of about 9016 bp; the BUSCO score was C:66.0%[S:64.2%,D:1.8%],F:4.2%,M:29.8%,n:9226 (mammalia_odb10 set). Things seemed to be on the right track, so I set up the next Maker round using both SNAP and Augustus-trained information in the round2 maker_opts.ctl file. At the end of that second round, I noticed a marked decrease in BUSCO score (C:53.3%[S:51.0%,D:2.3%],F:11.6%,M:35.1%,n:9226), yet an increase in the number of gene models (28,646) and mean length (16266 bp).
This got me to wondering if I was setting up the _opts.ctl file incorrectly? I'm concerned with a few things (and maybe missing even more I should be concerned about!?):
Here's what I did for the Round2 maker_opts.ctl file:
#-----Genome (these are always required)
#-----EST Evidence (for best results provide a file for at least one)
#-----Protein Homology Evidence (for best results provide a file for at least one)
#-----Repeat Masking (leave values blank to skip repeat masking)
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
snaphmm=/scratch/dro49/myluwork/annotation/maker_rd2/snap_rd1/lu_rnd1.zff.length50_aed0.25.hmm #SNAP HMM file
augustus_species=mylu #Augustus gene prediction species model
run_evm=0 #run EvidenceModeler, 1 = yes, 0 = no
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
allow_overlap= #allowed gene overlap fraction (value from 0 to 1, blank for default)
Thank you for your insights and support,
maker-devel mailing list
You may need to select a subset of gene models to drive training. I find that I get best results when I use protein2genome models only from uniprot/swiss-prot alignments to generate a training set, set always_complete=1. Uniprot/swiss-prot is manually curated, so is very high quality. Then I select models with the highest end-to-end completion (low AED). Also if you add est_forward=1 the score column in the GFF3 will be the % match to the original model. It’s and easy way to select only models with a very high percent match. Remove models without start codons and stop codons. You can relax these parameters if you don’t have many models, but in general you want 100-300 models to train with. Only one round of training is needed with this type of training set. The EST method requires 2 rounds and I don’t like it as much.
In the some cases, model selection for training will be a mostly manual task. You can use editors like Apollo to identify models that match evidence well, and delete odd models. Then train on that result.
What you are seeing is likely the result of over-training. Usually happens if you use more that 2 rounds of training, but can happen with just two rounds.
maker-devel mailing list
|Free forum by Nabble||Edit this page|