Quantcast

Updating Reference Gene Models - Legacy - Strategy

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Updating Reference Gene Models - Legacy - Strategy

Alessandro Rossoni-3
Dear makers of MAKER,
first of all - thank you for this awesome program! In the context of my
project, I have been running MAKER on a set of novel genomes and it
worked very well :)

During the last days, I realized that the reference sequences of species
A that I have been using as starting point for gene model prediction for
species B, C and D are a bit flawed. How do I know that? Through
visualization of novel RNA-Seq data that was mapped to the "old" gene
models of species A, I came to the conclusion that the "old" gene models
are not really accurate. In fact, between 52–62% of the reads map within
annotated exonic regions of the genome and up to 47% map within
intergenic regions. Intron/Exon boarders are pretty messed up and there
are a lot more transcribed sequences in species A than previously
thought.

Hence, I would love to update the gene models of species A including the
new RNA-Seq evidence and hope to get more accurate gene models out of
it. The more accurate gene models of species A would be then used to
predict genes for species B, C and D (which are hopefully going to
benefit from the more accurate input). However, I there is a little
understanding issue on how to set the parameters of the maker_opts.ctl.

My plan is to produce new RNA-Seq based gene models for species A using
Cufflinks, Stringtie, Breaker, Trinity and Velvet. I would pass the
output to the maker_opts.ctl as:

model_gff=Cufflinks.gff,Stringtie.gff,Breaker.gff,Trinity.gff,Velvet.gff
#the new gene models
est=Species_A_reference_ests.fasta  #the old/flawed gene models
protein=swissprot.fasta

Question 1: is this correct so far?

But what sequences do I use for training the ab-initio predictors?
snaphmm=
augustus_species=

Question 2: Do I use the "old/flawed" sequences that I know are not
really good? I am not sure what sequences to use for training.

Any help on this issue would be amazing!
Best,
Ale


--
Alessandro W. Rossoni, M.Sc.
Institute for Plant Biochemistry
Heinrich-Heine-University

--
http:///www.plant-biochemistry.hhu.de/
E-Mail:  [hidden email]

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Updating Reference Gene Models - Legacy - Strategy

Carson Holt-2
If the old models are poor, then I suggest you do new training using BUSCO, CEGMA, or the est2genome or protein2genome options within MAKER —>



model_gff is for existing gene models you want to keep. So none of these should go there —> Cufflinks.gff,Stringtie.gff,Breaker.gff,Trinity.gff,Velvet.gff

model_gff will always make it into the final annotation set even without any evidence support. By putting those files there, you are basically turning every feature in each of those files into a final gene model no matter how bad it is.

Also if the original models are poor, don’t put them there either. You can doing reciprocal best blast hits with final models to old models to see how they match each other in the end. Will take a little data processing to make it work though.


For all transcript based files, you should provide those to est_gff since they are evidence alignments and not model predictions. For Breaker.gff, that should be pred_gff since it is a prediction model.

With Trinity, I suggest you provide the fasta file and allow MAKER to align and filter things rather than a GFF3. The problem with using GFF3 is you are basically short circuiting upstream prioritization and filtering saying “take this evidence as is.” Also providing same evidence from multiple sources is a bad idea. By purposely making the evidence dataset more noisy, you are forcing lower accuracy.

My suggesting would be not to use Cufflinks (it will introduce a very high false positive rate). Provide Trinity input as fasta (also make sure you use jaccard_clip option was used when assembling). And you will have to manually review models with and without Stringtie data to see if it hurts more than it helps.

Provide Breaker.gff to pred_gff, but still allow maker to run Augustus itself internally (otherwise you won’t be able to use protein evidence as hints).

Thanks,
Carson


On Feb 10, 2017, at 8:50 AM, Alessandro Rossoni <[hidden email]> wrote:

Dear makers of MAKER,
first of all - thank you for this awesome program! In the context of my project, I have been running MAKER on a set of novel genomes and it worked very well :)

During the last days, I realized that the reference sequences of species A that I have been using as starting point for gene model prediction for species B, C and D are a bit flawed. How do I know that? Through visualization of novel RNA-Seq data that was mapped to the "old" gene models of species A, I came to the conclusion that the "old" gene models are not really accurate. In fact, between 52–62% of the reads map within annotated exonic regions of the genome and up to 47% map within intergenic regions. Intron/Exon boarders are pretty messed up and there are a lot more transcribed sequences in species A than previously thought.

Hence, I would love to update the gene models of species A including the new RNA-Seq evidence and hope to get more accurate gene models out of it. The more accurate gene models of species A would be then used to predict genes for species B, C and D (which are hopefully going to benefit from the more accurate input). However, I there is a little understanding issue on how to set the parameters of the maker_opts.ctl.

My plan is to produce new RNA-Seq based gene models for species A using Cufflinks, Stringtie, Breaker, Trinity and Velvet. I would pass the output to the maker_opts.ctl as:

model_gff=Cufflinks.gff,Stringtie.gff,Breaker.gff,Trinity.gff,Velvet.gff #the new gene models
est=Species_A_reference_ests.fasta  #the old/flawed gene models
protein=swissprot.fasta

Question 1: is this correct so far?

But what sequences do I use for training the ab-initio predictors?
snaphmm=
augustus_species=

Question 2: Do I use the "old/flawed" sequences that I know are not really good? I am not sure what sequences to use for training.

Any help on this issue would be amazing!
Best,
Ale


--
Alessandro W. Rossoni, M.Sc.
Institute for Plant Biochemistry
Heinrich-Heine-University

--
http:///www.plant-biochemistry.hhu.de/
E-Mail:  [hidden email]

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Loading...