Dear makers of MAKER,
first of all - thank you for this awesome program! In the context of my
project, I have been running MAKER on a set of novel genomes and it
worked very well :)
During the last days, I realized that the reference sequences of species
A that I have been using as starting point for gene model prediction for
species B, C and D are a bit flawed. How do I know that? Through
visualization of novel RNA-Seq data that was mapped to the "old" gene
models of species A, I came to the conclusion that the "old" gene models
are not really accurate. In fact, between 52–62% of the reads map within
annotated exonic regions of the genome and up to 47% map within
intergenic regions. Intron/Exon boarders are pretty messed up and there
are a lot more transcribed sequences in species A than previously
Hence, I would love to update the gene models of species A including the
new RNA-Seq evidence and hope to get more accurate gene models out of
it. The more accurate gene models of species A would be then used to
predict genes for species B, C and D (which are hopefully going to
benefit from the more accurate input). However, I there is a little
understanding issue on how to set the parameters of the maker_opts.ctl.
My plan is to produce new RNA-Seq based gene models for species A using
Cufflinks, Stringtie, Breaker, Trinity and Velvet. I would pass the
output to the maker_opts.ctl as:
#the new gene models
est=Species_A_reference_ests.fasta #the old/flawed gene models
Question 1: is this correct so far?
But what sequences do I use for training the ab-initio predictors?
Question 2: Do I use the "old/flawed" sequences that I know are not
really good? I am not sure what sequences to use for training.
Any help on this issue would be amazing!
Alessandro W. Rossoni, M.Sc.
Institute for Plant Biochemistry
E-Mail: [hidden email]
maker-devel mailing list
If the old models are poor, then I suggest you do new training using BUSCO, CEGMA, or the est2genome or protein2genome options within MAKER —>
Also this thread —> https://groups.google.com/forum/#!topic/maker-devel/FWMSTdqWQqI
model_gff is for existing gene models you want to keep. So none of these should go there —> Cufflinks.gff,Stringtie.gff,Breaker.gff,Trinity.gff,Velvet.gff
model_gff will always make it into the final annotation set even without any evidence support. By putting those files there, you are basically turning every feature in each of those files into a final gene model no matter how bad it is.
Also if the original models are poor, don’t put them there either. You can doing reciprocal best blast hits with final models to old models to see how they match each other in the end. Will take a little data processing to make it work though.
For all transcript based files, you should provide those to est_gff since they are evidence alignments and not model predictions. For Breaker.gff, that should be pred_gff since it is a prediction model.
With Trinity, I suggest you provide the fasta file and allow MAKER to align and filter things rather than a GFF3. The problem with using GFF3 is you are basically short circuiting upstream prioritization and filtering saying “take this evidence as is.” Also providing same evidence from multiple sources is a bad idea. By purposely making the evidence dataset more noisy, you are forcing lower accuracy.
My suggesting would be not to use Cufflinks (it will introduce a very high false positive rate). Provide Trinity input as fasta (also make sure you use jaccard_clip option was used when assembling). And you will have to manually review models with and without Stringtie data to see if it hurts more than it helps.
Provide Breaker.gff to pred_gff, but still allow maker to run Augustus itself internally (otherwise you won’t be able to use protein evidence as hints).
maker-devel mailing list
|Free forum by Nabble||Edit this page|