Genemark Self-Training in MAKER

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Genemark Self-Training in MAKER

katebush-2
Greetings,

Working on annotating a small eukaryotic genome and struggling a bit
with how to run genemark-es in MAKER.  I've installed
genemark_es_bp_linux64_v2.3a.  It seems to be working ok but is
predicting much fewer gene models than AUGUSTUS and SNAP.  I'm
supplying it with a model I trained on a draft assembly of my organism
using the gm_es.pl script and specify that in the maker_opts.ctl file
under the #-----Gene Prediction Options/gmhmm: option in the control
file below.  In the tutorial it says that Genemark can be self-
training...so you would not supply a model...how would I configure the
maker_opts.ctl file to do this?

thanks!


Kathryn


#-----Genome (Required for De-Novo Annotation)
genome:/raid4/spatafora/bushleyk/Velvet/Assembly5_HybridVelvet/
BESTHYBRIDS_0510/BH2Assembly5_AllF500Backbone/BH2_Filter300.fa

#-----Re-annotation Options (Only Maker derived GFF3)
genome_gff: #re-annotate genome based on this gff3 file
est_pass:0 #use ests in genome_gff: 1 = yes, 0 = no
altest_pass:0 #use alternate organism ests in genome_gff: 1 = yes, 0 =
no
protein_pass:0 #use proteins in genome_gff: 1 = yes, 0 = no
rm_pass:0 #use repeats in genome_gff: 1 = yes, 0 = no
model_pass:1 #use gene models in genome_gff: 1 = yes, 0 = no
pred_pass:0 #use ab-initio predictions in genome_gff: 1 = yes, 0 = no
other_pass:0 #passthrough everything else in genome_gff: 1 = yes, 0 =
no

#-----EST Evidence (you should provide a value for at least one)
est:#/raid4/spatafora/bushleyk/Velvet/Assembly4_Transcriptome/
Hash31_40mer/MocklerFilteredReads/contigs_Filter300.fa#none yet
est_reads:#unassembled nextgen mRNASeq in fasta format (not fully
implemented)
altest:/raid4/spatafora/bushleyk/Sequences/ESTs/
Cmilitaris_ESTs.fasta#ests from an alternate species
est_gff: #EST evidence from an external gff3 file
altest_gff: #Alternate organism EST evidence from a seperate gff3 file

#-----Protein Homology Evidence (you should provide a value for at
least one)
protein:/raid4/spatafora/yoderr/hal_contigs_run/genomes/
Fusarium_graminearum.fasta
protein_gff:  #protein homology evidence from an external gff3 file

#-----Repeat Masking (leave values blank to skip)
model_org:all#model organism for RepBase masking in RepeatMasker-
default all
repeat_protein:/raid4/spatafora/bushleyk/Sequences/Repeats/
Ti_TeProteins.txt#a database of transposable element proteins in fasta
format
rmlib:/raid4/spatafora/bushleyk/Sequences/Repeats/Ti_RepeatLib.txt#an
organism specific repeat library in fasta format
rm_gff: #repeat elements from an external gff3 file

#-----Gene Prediction Options
organism_type:eukaryotic #eukaryotic or prokaryotic. Default is
eukaryotic
predictor:est2genome,snap,augustus,genemark#prediction methods for
annotations (seperate multiple values by ',')
unmask:1#Also run ab-initio methods on unmasked sequence, 1 = yes, 0 =
no
snaphmm:/local/cluster/spatafora/SNAP/snap/HMM/
fusarium_graminearum_1.length.hmm #SNAP HMM model
gmhmm:/local/cluster/spatafora/genemark_hmm_euk.linux/
Tinflatum_BH2_52310.mod#GeneMark HMM model
augustus_species:fusarium_graminearum #Augustus gene prediction model
fgenesh_par_file: #Fgenesh parameter file
model_gff: #gene models from an external gff3 file (annotation pass-
through)
pred_gff: #ab-initio predictions from an external gff3 file

#-----Other Annotation Type Options (features maker doesn't recognize)
other_gff: #features to pass-through to final output from an extenal
gff3 file

#-----External Application Specific Options
alt_peptide:C #amino acid used to replace non standard amino acids in
blast databases
cpus:1 #max number of cpus to use in BLAST and RepeatMasker

#-----Maker Specific Options
evaluate:0 #run Evaluator on all annotations, 1 = yes, 0 = no
max_dna_len:100000 #length for dividing up contigs into chunks (larger
values increase memory usage)
min_contig:1 #all contigs from the input genome file below this size
will be skipped
min_protein:0 #all gene annotations must produce a protein of at least
this many amino acids in length
softmask:1 #use soft-masked rather than hard-masked seg filtering for
wublast
split_hit:10000 #length for the splitting of hits (expected max intron
size for evidence alignments)
pred_flank:200 #length of sequence surrounding EST and protein
evidence used to extend gene predictions
single_exon:0 #consider single exon EST evidence when generating
annotations, 1 = yes, 0 = no
single_length:250 #min length required for single exon ESTs if
'single_exon is enabled'
keep_preds:0 #Add non-overlapping ab-inito gene prediction to final
annotation set, 1 = yes, 0 = no
map_forward:0 #try to map names and attributes forward from gff3
annotations, 1 = yes, 0 = no
retry:1 #number of times to retry a contig if there is a failure for
some reason
clean_try:0 #removeall data from previous run before retrying, 1 =
yes, 0 = no
clean_up:0 #removes theVoid directory with individual analysis files,
1 = yes, 0 = no
TMP: #specify a directory other than the system default temporary
directory for temporary files

#-----EVALUATOR Control Options
side_thre:5
eva_window_size:70
eva_split_hit:1
eva_hspmax:100
eva_gspmax:100
enable_fathom:0



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Genemark Self-Training in MAKER

Jason Stajich-2
Hi Kathryn -

I run the gm_es.pl on your whole genome or a large-ish test set.
Then I copy the es.mod file from the mod directory (Created by gm_es.pl)

I usually rename it Myorganism_GMES.mod

This file then just need to be pointed to in your maker_opts.ctl like this:
gmhmm:/data/gene_prediction/GeneMark/MYORG.mod
snaphmm:/data/gene_prediction/SNAP/MYORG.hmm

-jason
katebush wrote, On 8/25/10 5:18 PM:

> Greetings,
>
> Working on annotating a small eukaryotic genome and struggling a bit
> with how to run genemark-es in MAKER.  I've installed
> genemark_es_bp_linux64_v2.3a.  It seems to be working ok but is
> predicting much fewer gene models than AUGUSTUS and SNAP.  I'm
> supplying it with a model I trained on a draft assembly of my organism
> using the gm_es.pl script and specify that in the maker_opts.ctl file
> under the #-----Gene Prediction Options/gmhmm: option in the control
> file below.  In the tutorial it says that Genemark can be self-
> training...so you would not supply a model...how would I configure the
> maker_opts.ctl file to do this?
>
> thanks!
>
>
> Kathryn
>
>
> #-----Genome (Required for De-Novo Annotation)
> genome:/raid4/spatafora/bushleyk/Velvet/Assembly5_HybridVelvet/
> BESTHYBRIDS_0510/BH2Assembly5_AllF500Backbone/BH2_Filter300.fa
>
> #-----Re-annotation Options (Only Maker derived GFF3)
> genome_gff: #re-annotate genome based on this gff3 file
> est_pass:0 #use ests in genome_gff: 1 = yes, 0 = no
> altest_pass:0 #use alternate organism ests in genome_gff: 1 = yes, 0 =
> no
> protein_pass:0 #use proteins in genome_gff: 1 = yes, 0 = no
> rm_pass:0 #use repeats in genome_gff: 1 = yes, 0 = no
> model_pass:1 #use gene models in genome_gff: 1 = yes, 0 = no
> pred_pass:0 #use ab-initio predictions in genome_gff: 1 = yes, 0 = no
> other_pass:0 #passthrough everything else in genome_gff: 1 = yes, 0 =
> no
>
> #-----EST Evidence (you should provide a value for at least one)
> est:#/raid4/spatafora/bushleyk/Velvet/Assembly4_Transcriptome/
> Hash31_40mer/MocklerFilteredReads/contigs_Filter300.fa#none yet
> est_reads:#unassembled nextgen mRNASeq in fasta format (not fully
> implemented)
> altest:/raid4/spatafora/bushleyk/Sequences/ESTs/
> Cmilitaris_ESTs.fasta#ests from an alternate species
> est_gff: #EST evidence from an external gff3 file
> altest_gff: #Alternate organism EST evidence from a seperate gff3 file
>
> #-----Protein Homology Evidence (you should provide a value for at
> least one)
> protein:/raid4/spatafora/yoderr/hal_contigs_run/genomes/
> Fusarium_graminearum.fasta
> protein_gff:  #protein homology evidence from an external gff3 file
>
> #-----Repeat Masking (leave values blank to skip)
> model_org:all#model organism for RepBase masking in RepeatMasker-
> default all
> repeat_protein:/raid4/spatafora/bushleyk/Sequences/Repeats/
> Ti_TeProteins.txt#a database of transposable element proteins in fasta
> format
> rmlib:/raid4/spatafora/bushleyk/Sequences/Repeats/Ti_RepeatLib.txt#an
> organism specific repeat library in fasta format
> rm_gff: #repeat elements from an external gff3 file
>
> #-----Gene Prediction Options
> organism_type:eukaryotic #eukaryotic or prokaryotic. Default is
> eukaryotic
> predictor:est2genome,snap,augustus,genemark#prediction methods for
> annotations (seperate multiple values by ',')
> unmask:1#Also run ab-initio methods on unmasked sequence, 1 = yes, 0 =
> no
> snaphmm:/local/cluster/spatafora/SNAP/snap/HMM/
> fusarium_graminearum_1.length.hmm #SNAP HMM model
> gmhmm:/local/cluster/spatafora/genemark_hmm_euk.linux/
> Tinflatum_BH2_52310.mod#GeneMark HMM model
> augustus_species:fusarium_graminearum #Augustus gene prediction model
> fgenesh_par_file: #Fgenesh parameter file
> model_gff: #gene models from an external gff3 file (annotation pass-
> through)
> pred_gff: #ab-initio predictions from an external gff3 file
>
> #-----Other Annotation Type Options (features maker doesn't recognize)
> other_gff: #features to pass-through to final output from an extenal
> gff3 file
>
> #-----External Application Specific Options
> alt_peptide:C #amino acid used to replace non standard amino acids in
> blast databases
> cpus:1 #max number of cpus to use in BLAST and RepeatMasker
>
> #-----Maker Specific Options
> evaluate:0 #run Evaluator on all annotations, 1 = yes, 0 = no
> max_dna_len:100000 #length for dividing up contigs into chunks (larger
> values increase memory usage)
> min_contig:1 #all contigs from the input genome file below this size
> will be skipped
> min_protein:0 #all gene annotations must produce a protein of at least
> this many amino acids in length
> softmask:1 #use soft-masked rather than hard-masked seg filtering for
> wublast
> split_hit:10000 #length for the splitting of hits (expected max intron
> size for evidence alignments)
> pred_flank:200 #length of sequence surrounding EST and protein
> evidence used to extend gene predictions
> single_exon:0 #consider single exon EST evidence when generating
> annotations, 1 = yes, 0 = no
> single_length:250 #min length required for single exon ESTs if
> 'single_exon is enabled'
> keep_preds:0 #Add non-overlapping ab-inito gene prediction to final
> annotation set, 1 = yes, 0 = no
> map_forward:0 #try to map names and attributes forward from gff3
> annotations, 1 = yes, 0 = no
> retry:1 #number of times to retry a contig if there is a failure for
> some reason
> clean_try:0 #removeall data from previous run before retrying, 1 =
> yes, 0 = no
> clean_up:0 #removes theVoid directory with individual analysis files,
> 1 = yes, 0 = no
> TMP: #specify a directory other than the system default temporary
> directory for temporary files
>
> #-----EVALUATOR Control Options
> side_thre:5
> eva_window_size:70
> eva_split_hit:1
> eva_hspmax:100
> eva_gspmax:100
> enable_fathom:0
>
>
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>    

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org