Re-annotation, fewer gene predictions

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Re-annotation, fewer gene predictions

morgan sobol
Hello, 

I previously used Maker to annotate two different fungal genomes that were created using Illumina sequences only. For these genomes, I had over 11,000 genes predicted. 
I recently obtained PacBio sequences for the same genomes, so I created two hybrid assemblies. Both assemblies were very familiar in length and completed number of orthologs to the Illumina only assembly, but had much fewer, but longer contigs. 

I re-ran Maker using the settings below. For one of my genomes, I got around 11,000 genes predicted again, as expected. However, for the other genome, I am continuously getting ~4,400 predicted genes. 

I am asking for help as to how I can determine why I keep getting fewer predicted genes for only one of my genomes, even though I ran them the same?

Thanks,
Morgan S. 

maker_opts.log
#-----Genome (these are always required)
genome=/work/Geomicrobiology/msobol/IODP_329_SPG/1368D2H1/repeatmasker/unicycler/1368D_unicycler_contigs.fasta.masked #genome sequence (fasta file or$
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff=/work/Geomicrobiology/msobol/IODP_329_SPG/1368D2H1/maker/1368D_2H1_contigs.fasta.maker.output/1368D_2H1_contigs.fasta.all.gff #MAKER derive$
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=1 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est= #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=/work/Geomicrobiology/msobol/IODP_329_SPG/uniprot_sprot.fasta  #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org= #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=0 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm=/home/msobol/genemark/68D_2/output/gmhmm.mod #GeneMark HMM file
augustus_species=1368D_uni #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file

#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=1 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=1 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)

split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=1 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes

tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Re-annotation, fewer gene predictions

Xabier Vázquez-Campos
Hi Morgan,

We had a similar issue with AUGUSTUS underpredicting when using a BUSCO-derived gene model

Also, check the number of proteins by each individual predictor. If the numbers from one of them are off, you may find a possible source of issues.
We didn't have a very good experience with GM, as it used to overpredict an absurd number of proteins.

Xabi

On Mon, 4 Feb 2019 at 06:15, morgan sobol <[hidden email]> wrote:
Hello, 

I previously used Maker to annotate two different fungal genomes that were created using Illumina sequences only. For these genomes, I had over 11,000 genes predicted. 
I recently obtained PacBio sequences for the same genomes, so I created two hybrid assemblies. Both assemblies were very familiar in length and completed number of orthologs to the Illumina only assembly, but had much fewer, but longer contigs. 

I re-ran Maker using the settings below. For one of my genomes, I got around 11,000 genes predicted again, as expected. However, for the other genome, I am continuously getting ~4,400 predicted genes. 

I am asking for help as to how I can determine why I keep getting fewer predicted genes for only one of my genomes, even though I ran them the same?

Thanks,
Morgan S. 

maker_opts.log
#-----Genome (these are always required)
genome=/work/Geomicrobiology/msobol/IODP_329_SPG/1368D2H1/repeatmasker/unicycler/1368D_unicycler_contigs.fasta.masked #genome sequence (fasta file or$
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff=/work/Geomicrobiology/msobol/IODP_329_SPG/1368D2H1/maker/1368D_2H1_contigs.fasta.maker.output/1368D_2H1_contigs.fasta.all.gff #MAKER derive$
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=1 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est= #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=/work/Geomicrobiology/msobol/IODP_329_SPG/uniprot_sprot.fasta  #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org= #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=0 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm=/home/msobol/genemark/68D_2/output/gmhmm.mod #GeneMark HMM file
augustus_species=1368D_uni #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file

#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=1 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=1 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)

split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=1 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes

tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Re-annotation, fewer gene predictions

Xabier Vázquez-Campos
Don't you use SNAP? It usually produces quite decent results. And easier to train than any of the other predictors

In any case, the Augustus gene model is way off in both cases
GM doesn't seem bad if your fungus has a rather usual genome... in the first. For the second, it looks bad

I'm not too familiar with the reannotation but I'd rather create the gene models from scratch rather than reuse the ones from the Illumina-only genomes.
Note that assemblies with long-reads, have a higher proportion of repetitive elements that need masking and RepeatMasker only may not be enough. In theory, this shouldn't affect Augustus model if trained through BUSCO as it uses defined conserved markers to create the gene model, but I'm not so sure about GM.

If you trained Augustus with BUSCO, and this is the result, I'd discard the gene model and train it again by the "traditional way", i.e. as it used to be when we only had CEGMA. I had good results just by changing the training method.

Hope it helps,
Xabi




On Wed, 6 Feb 2019 at 02:19, morgan sobol <[hidden email]> wrote:
Thank you, Xabi for the response. 
The number of proteins from each source is greatly lower than before. 
Previous numbers were 325, 10,899, and 11,243 for augustus, genemark, and maker respectively. 
The more recent numbers are 25, 857, 4418 respectively. 

So do you think maybe this hints that something is wrong from genemark? 

Morgan



From: Xabier Vázquez-Campos <[hidden email]>
Sent: Sunday, February 3, 2019 4:43 PM
To: morgan sobol
Cc: [hidden email]
Subject: Re: [maker-devel] Re-annotation, fewer gene predictions
 
Hi Morgan,

We had a similar issue with AUGUSTUS underpredicting when using a BUSCO-derived gene model

Also, check the number of proteins by each individual predictor. If the numbers from one of them are off, you may find a possible source of issues.
We didn't have a very good experience with GM, as it used to overpredict an absurd number of proteins.

Xabi

On Mon, 4 Feb 2019 at 06:15, morgan sobol <[hidden email]> wrote:
Hello, 

I previously used Maker to annotate two different fungal genomes that were created using Illumina sequences only. For these genomes, I had over 11,000 genes predicted. 
I recently obtained PacBio sequences for the same genomes, so I created two hybrid assemblies. Both assemblies were very familiar in length and completed number of orthologs to the Illumina only assembly, but had much fewer, but longer contigs. 

I re-ran Maker using the settings below. For one of my genomes, I got around 11,000 genes predicted again, as expected. However, for the other genome, I am continuously getting ~4,400 predicted genes. 

I am asking for help as to how I can determine why I keep getting fewer predicted genes for only one of my genomes, even though I ran them the same?

Thanks,
Morgan S. 

maker_opts.log
#-----Genome (these are always required)
genome=/work/Geomicrobiology/msobol/IODP_329_SPG/1368D2H1/repeatmasker/unicycler/1368D_unicycler_contigs.fasta.masked #genome sequence (fasta file or$
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff=/work/Geomicrobiology/msobol/IODP_329_SPG/1368D2H1/maker/1368D_2H1_contigs.fasta.maker.output/1368D_2H1_contigs.fasta.all.gff #MAKER derive$
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=1 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est= #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=/work/Geomicrobiology/msobol/IODP_329_SPG/uniprot_sprot.fasta  #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org= #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=0 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm=/home/msobol/genemark/68D_2/output/gmhmm.mod #GeneMark HMM file
augustus_species=1368D_uni #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file

#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=1 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=1 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)

split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=1 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes

tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Re-annotation, fewer gene predictions

Xabier Vázquez-Campos
Oh, sorry, I didn't explain myself well. What I was trying to say is that before BUSCO, when we only had CEGMA, we would proceed in a different way to train Augustus as CEGMA wouldn't produce Augustus gene models automatically. I don't mean you to use CEGMA.

This is what I have on my own documentation about how to train Augustus "the old way"
AUGUSTUS… the old way

Alternatively, you can train AUGUSTUS in a more “manual” way, like when we were using CEGMA. The training starts with the output from the second instance of fathom in the SNAP training section.

cd ${MYGENOME_DIR}/maker/snap1
perl ~/bin/zff2augustus_gbk.pl > ${MYGENOME}.train1.gb

zff2augustus_gbk.pl generates a GenBank file from export.dna.

The actual training of AUGUSTUS will be through the webAUGUSTUS server.

Before proceed, it is recommended to rename the fasta headers, specially if they contain special characters and/or very long headers. This is the main reason of failure for the jobs submitted to webAUGUSTUS. You can use the simplifyFastaHeaders.pl script for that:

perl ~/bin/simplifyFastaHeaders.pl ${MYGENOME}_assembly.fasta nameStem ${MYGENOME}_contigs_rename.fasta ${MYGENOME}_contigs.map

perl ~/bin/simplifyFastaHeaders.pl ${MYGENOME}_transcripts_assembled.fasta nameStem ${MYGENOME}_rna_rename.fasta ${MYGENOME}_rna.map

nameStem is the base name for naming each of the sequences in the multifasta files. Use a value with something appropriate. Use contig and rna for the assembly and RNA-seq files, respectively; or something based on that. For example, ‘pgcontig’ and ‘pgrna’ for contigs and RNA from Puccinia graminis
DO NOT give the same nameStem to both fasta files, and don’t use any special character.

We need the following files (minimum):

  • ${MYGENOME}_assembly.fasta as Genome file
  • ${MYGENOME}.train1.gb as Training gene structure file

If we also have RNA-seq data:

  • ${MYGENOME}_assembled_transcripts.fasta as cDNA file

Use ${MYGENOME}_v1 as Species name. We will need to have a different species name in the retraining step. Otherwise when Maker2 is rerun, Maker2 will see the same name and will not rerun AUGUSTUS, even though the species profile is different. So, ${MYGENOME}_v1 just do the job and tracks version.

Once the job is finished, the Species parameter archive (parameters.tar.gz) will contain a folder with the model files for your species. Copy it to the species folder of your AUGUSTUS installation.

Hope this helps

PS: hit reply all so this is logged in Maker's mail list in case anybody else experiences similar issues

On Thu, 7 Feb 2019 at 06:36, morgan sobol <[hidden email]> wrote:
I have not used SNAP or CEGMA, however, I see that CEGMA was discontinued in 2015. 
Do you think that will be a problem, or is it still worth using the old version?



From: Xabier Vázquez-Campos <[hidden email]>
Sent: Tuesday, February 5, 2019 4:42 PM
To: morgan sobol; Maker Mailing List
Subject: Re: [maker-devel] Re-annotation, fewer gene predictions
 
Don't you use SNAP? It usually produces quite decent results. And easier to train than any of the other predictors

In any case, the Augustus gene model is way off in both cases
GM doesn't seem bad if your fungus has a rather usual genome... in the first. For the second, it looks bad

I'm not too familiar with the reannotation but I'd rather create the gene models from scratch rather than reuse the ones from the Illumina-only genomes.
Note that assemblies with long-reads, have a higher proportion of repetitive elements that need masking and RepeatMasker only may not be enough. In theory, this shouldn't affect Augustus model if trained through BUSCO as it uses defined conserved markers to create the gene model, but I'm not so sure about GM.

If you trained Augustus with BUSCO, and this is the result, I'd discard the gene model and train it again by the "traditional way", i.e. as it used to be when we only had CEGMA. I had good results just by changing the training method.

Hope it helps,
Xabi




On Wed, 6 Feb 2019 at 02:19, morgan sobol <[hidden email]> wrote:
Thank you, Xabi for the response. 
The number of proteins from each source is greatly lower than before. 
Previous numbers were 325, 10,899, and 11,243 for augustus, genemark, and maker respectively. 
The more recent numbers are 25, 857, 4418 respectively. 

So do you think maybe this hints that something is wrong from genemark? 

Morgan



From: Xabier Vázquez-Campos <[hidden email]>
Sent: Sunday, February 3, 2019 4:43 PM
To: morgan sobol
Cc: [hidden email]
Subject: Re: [maker-devel] Re-annotation, fewer gene predictions
 
Hi Morgan,

We had a similar issue with AUGUSTUS underpredicting when using a BUSCO-derived gene model

Also, check the number of proteins by each individual predictor. If the numbers from one of them are off, you may find a possible source of issues.
We didn't have a very good experience with GM, as it used to overpredict an absurd number of proteins.

Xabi

On Mon, 4 Feb 2019 at 06:15, morgan sobol <[hidden email]> wrote:
Hello, 

I previously used Maker to annotate two different fungal genomes that were created using Illumina sequences only. For these genomes, I had over 11,000 genes predicted. 
I recently obtained PacBio sequences for the same genomes, so I created two hybrid assemblies. Both assemblies were very familiar in length and completed number of orthologs to the Illumina only assembly, but had much fewer, but longer contigs. 

I re-ran Maker using the settings below. For one of my genomes, I got around 11,000 genes predicted again, as expected. However, for the other genome, I am continuously getting ~4,400 predicted genes. 

I am asking for help as to how I can determine why I keep getting fewer predicted genes for only one of my genomes, even though I ran them the same?

Thanks,
Morgan S. 

maker_opts.log
#-----Genome (these are always required)
genome=/work/Geomicrobiology/msobol/IODP_329_SPG/1368D2H1/repeatmasker/unicycler/1368D_unicycler_contigs.fasta.masked #genome sequence (fasta file or$
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff=/work/Geomicrobiology/msobol/IODP_329_SPG/1368D2H1/maker/1368D_2H1_contigs.fasta.maker.output/1368D_2H1_contigs.fasta.all.gff #MAKER derive$
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=1 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est= #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=/work/Geomicrobiology/msobol/IODP_329_SPG/uniprot_sprot.fasta  #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org= #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=0 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm=/home/msobol/genemark/68D_2/output/gmhmm.mod #GeneMark HMM file
augustus_species=1368D_uni #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file

#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=1 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=1 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)

split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=1 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes

tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Re-annotation, fewer gene predictions

Carson Holt-2
In reply to this post by morgan sobol
One thing you can also do is use old models as protein= input and run the protein2genome option just to see where things align. You may find that not all old models are recoverable in the new assembly. Fewer genes in the new assembly may mean redundant/duplicate contigs were collapse and split contigs were joined resulting in multiple gene fragments becoming a unified single model. Make sure to always review contigs in a browser to see how models and evidence correlate.

—Carson



On Feb 3, 2019, at 12:13 PM, morgan sobol <[hidden email]> wrote:

Hello, 

I previously used Maker to annotate two different fungal genomes that were created using Illumina sequences only. For these genomes, I had over 11,000 genes predicted. 
I recently obtained PacBio sequences for the same genomes, so I created two hybrid assemblies. Both assemblies were very familiar in length and completed number of orthologs to the Illumina only assembly, but had much fewer, but longer contigs. 

I re-ran Maker using the settings below. For one of my genomes, I got around 11,000 genes predicted again, as expected. However, for the other genome, I am continuously getting ~4,400 predicted genes. 

I am asking for help as to how I can determine why I keep getting fewer predicted genes for only one of my genomes, even though I ran them the same?

Thanks,
Morgan S. 

maker_opts.log
#-----Genome (these are always required)
genome=/work/Geomicrobiology/msobol/IODP_329_SPG/1368D2H1/repeatmasker/unicycler/1368D_unicycler_contigs.fasta.masked #genome sequence (fasta file or$
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff=/work/Geomicrobiology/msobol/IODP_329_SPG/1368D2H1/maker/1368D_2H1_contigs.fasta.maker.output/1368D_2H1_contigs.fasta.all.gff #MAKER derive$
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=1 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est= #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=/work/Geomicrobiology/msobol/IODP_329_SPG/uniprot_sprot.fasta  #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org= #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=0 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm=/home/msobol/genemark/68D_2/output/gmhmm.mod #GeneMark HMM file
augustus_species=1368D_uni #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file

#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=1 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=1 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)

split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=1 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes

tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Re-annotation, fewer gene predictions

morgan sobol
In reply to this post by Xabier Vázquez-Campos
Thank you, Xabi and Carson. 
With your help, I was able to improve the annotation with a more appropriate number of predictions. 

Best,
Morgan 


From: Xabier Vázquez-Campos <[hidden email]>
Sent: Wednesday, February 6, 2019 11:33 PM
To: morgan sobol; Maker Mailing List
Subject: Re: [maker-devel] Re-annotation, fewer gene predictions
 
Oh, sorry, I didn't explain myself well. What I was trying to say is that before BUSCO, when we only had CEGMA, we would proceed in a different way to train Augustus as CEGMA wouldn't produce Augustus gene models automatically. I don't mean you to use CEGMA.

This is what I have on my own documentation about how to train Augustus "the old way"
AUGUSTUS… the old way

Alternatively, you can train AUGUSTUS in a more “manual” way, like when we were using CEGMA. The training starts with the output from the second instance of fathom in the SNAP training section.

cd ${MYGENOME_DIR}/maker/snap1
perl ~/bin/zff2augustus_gbk.pl > ${MYGENOME}.train1.gb

zff2augustus_gbk.pl generates a GenBank file from export.dna.

The actual training of AUGUSTUS will be through the webAUGUSTUS server.

Before proceed, it is recommended to rename the fasta headers, specially if they contain special characters and/or very long headers. This is the main reason of failure for the jobs submitted to webAUGUSTUS. You can use the simplifyFastaHeaders.pl script for that:

perl ~/bin/simplifyFastaHeaders.pl ${MYGENOME}_assembly.fasta nameStem ${MYGENOME}_contigs_rename.fasta ${MYGENOME}_contigs.map

perl ~/bin/simplifyFastaHeaders.pl ${MYGENOME}_transcripts_assembled.fasta nameStem ${MYGENOME}_rna_rename.fasta ${MYGENOME}_rna.map

nameStem is the base name for naming each of the sequences in the multifasta files. Use a value with something appropriate. Use contig and rna for the assembly and RNA-seq files, respectively; or something based on that. For example, ‘pgcontig’ and ‘pgrna’ for contigs and RNA from Puccinia graminis
DO NOT give the same nameStem to both fasta files, and don’t use any special character.

We need the following files (minimum):

  • ${MYGENOME}_assembly.fasta as Genome file
  • ${MYGENOME}.train1.gb as Training gene structure file

If we also have RNA-seq data:

  • ${MYGENOME}_assembled_transcripts.fasta as cDNA file

Use ${MYGENOME}_v1 as Species name. We will need to have a different species name in the retraining step. Otherwise when Maker2 is rerun, Maker2 will see the same name and will not rerun AUGUSTUS, even though the species profile is different. So, ${MYGENOME}_v1 just do the job and tracks version.

Once the job is finished, the Species parameter archive (parameters.tar.gz) will contain a folder with the model files for your species. Copy it to the species folder of your AUGUSTUS installation.

Hope this helps

PS: hit reply all so this is logged in Maker's mail list in case anybody else experiences similar issues

On Thu, 7 Feb 2019 at 06:36, morgan sobol <[hidden email]> wrote:
I have not used SNAP or CEGMA, however, I see that CEGMA was discontinued in 2015. 
Do you think that will be a problem, or is it still worth using the old version?



From: Xabier Vázquez-Campos <[hidden email]>
Sent: Tuesday, February 5, 2019 4:42 PM
To: morgan sobol; Maker Mailing List
Subject: Re: [maker-devel] Re-annotation, fewer gene predictions
 
Don't you use SNAP? It usually produces quite decent results. And easier to train than any of the other predictors

In any case, the Augustus gene model is way off in both cases
GM doesn't seem bad if your fungus has a rather usual genome... in the first. For the second, it looks bad

I'm not too familiar with the reannotation but I'd rather create the gene models from scratch rather than reuse the ones from the Illumina-only genomes.
Note that assemblies with long-reads, have a higher proportion of repetitive elements that need masking and RepeatMasker only may not be enough. In theory, this shouldn't affect Augustus model if trained through BUSCO as it uses defined conserved markers to create the gene model, but I'm not so sure about GM.

If you trained Augustus with BUSCO, and this is the result, I'd discard the gene model and train it again by the "traditional way", i.e. as it used to be when we only had CEGMA. I had good results just by changing the training method.

Hope it helps,
Xabi




On Wed, 6 Feb 2019 at 02:19, morgan sobol <[hidden email]> wrote:
Thank you, Xabi for the response. 
The number of proteins from each source is greatly lower than before. 
Previous numbers were 325, 10,899, and 11,243 for augustus, genemark, and maker respectively. 
The more recent numbers are 25, 857, 4418 respectively. 

So do you think maybe this hints that something is wrong from genemark? 

Morgan



From: Xabier Vázquez-Campos <[hidden email]>
Sent: Sunday, February 3, 2019 4:43 PM
To: morgan sobol
Cc: [hidden email]
Subject: Re: [maker-devel] Re-annotation, fewer gene predictions
 
Hi Morgan,

We had a similar issue with AUGUSTUS underpredicting when using a BUSCO-derived gene model

Also, check the number of proteins by each individual predictor. If the numbers from one of them are off, you may find a possible source of issues.
We didn't have a very good experience with GM, as it used to overpredict an absurd number of proteins.

Xabi

On Mon, 4 Feb 2019 at 06:15, morgan sobol <[hidden email]> wrote:
Hello, 

I previously used Maker to annotate two different fungal genomes that were created using Illumina sequences only. For these genomes, I had over 11,000 genes predicted. 
I recently obtained PacBio sequences for the same genomes, so I created two hybrid assemblies. Both assemblies were very familiar in length and completed number of orthologs to the Illumina only assembly, but had much fewer, but longer contigs. 

I re-ran Maker using the settings below. For one of my genomes, I got around 11,000 genes predicted again, as expected. However, for the other genome, I am continuously getting ~4,400 predicted genes. 

I am asking for help as to how I can determine why I keep getting fewer predicted genes for only one of my genomes, even though I ran them the same?

Thanks,
Morgan S. 

maker_opts.log
#-----Genome (these are always required)
genome=/work/Geomicrobiology/msobol/IODP_329_SPG/1368D2H1/repeatmasker/unicycler/1368D_unicycler_contigs.fasta.masked #genome sequence (fasta file or$
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff=/work/Geomicrobiology/msobol/IODP_329_SPG/1368D2H1/maker/1368D_2H1_contigs.fasta.maker.output/1368D_2H1_contigs.fasta.all.gff #MAKER derive$
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=1 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est= #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=/work/Geomicrobiology/msobol/IODP_329_SPG/uniprot_sprot.fasta  #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org= #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=0 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm=/home/msobol/genemark/68D_2/output/gmhmm.mod #GeneMark HMM file
augustus_species=1368D_uni #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file

#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=1 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=1 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)

split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=1 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes

tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org