Re: Training SNAP in absense of ESTs?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Training SNAP in absense of ESTs?

Carson Hinton Holt
Re: [maker-devel] Training SNAP in absense of ESTs? Sorry for the slow reply.  I finished moving to Toronto and have yet to have my internet hooked up at home.

Based on the GFF3 you sent me, nothing should have changed with the alteration of those parameters for this region of sequence.  You will notice that for the entire length of the contig, there are no evidence clusters for which there are no genes.  The altESTs aligneing always have a corresponding prediction on either the plus or minus strand.  Remember that for ESTs belonging to a different species there is currently no software capable of always properly handling alignment around splice sites (for regular same species ESTs there are).  What this means is that they sometime align to the opposite strand (amplification steps in preparing EST libraries make strandedness impossible to determine).  So when you see a lonely altEST, the gene model may actually be on the opposite strand.  Since all models were already predicted, there was nothing to add and results stayed the same.  The only way this would have changed with the single exon parameter is if you had a barren area (both strands), and there was a nice single exon altEST overlapping that position.

Thanks,
Carson


On 7/4/11 10:43 PM, "Khan, Anar" <Anar.Khan@...> wrote:

Hi Carson
 
Here is a gff3 file in case you’re interested. (Please keep this file to yourself if you don’t mind.) There’s no need for you to manually check the file because, as you can see there are some single exon predictions...
 
Thanks for your help :-)
 
Cheers
Anar
 
 
 

From: Carson Holt [[hidden email]]
Sent: Tuesday, 5 July 2011 2:45 p.m.
To: Khan, Anar; maker-devel@...
Subject: Re: [maker-devel] Training SNAP in absense of ESTs?

I would have to physically look at the evidence alignments in Apollo.  It may just be that the altESTs that are single exon are not aligning well to an open reading frame.  I can usually manually review the alignments and deconstruct the logic of how MAKER arrived at a given conclusion.  So send me some example GFF3 file :-)

Thanks,
Carson


On 6/29/11 3:45 PM, "Khan, Anar" <Anar.Khan@...> wrote:
Hi
 
I reran MAKER in my original analysis directory, this time changing 2 parameters in the options control file:
 
single_exon=1
single_length=250
 
and to my surprise I obtained the same results as for the previous run (single_exon=0), checked via counting the total number of maker predictions in gff files and checking output from fathom -gene-stats:
 
fathom genome.ann genome.dna -gene-stats
318 sequences
0.521223 avg GC fraction (min=0.463702 max=0.581287)
393 genes (plus=194 minus=199)
0 (0.000000) single-exon
393 (1.000000) multi-exon
341.366943 mean exon (min=3 max=4233)
91.340637 mean intron (min=4 max=1152)
 
As you can see, fathom tells me no single exon predictions were identified.
 
I’ve checked my maker_opts file to confirm I truly changed the single_* parameters!
 
Are there other parameters which interact with single_exon? Might I have done something silly?
 
I’ve attached my options control file.
 
Cheers!
Anar
 
 

From: maker-devel-bounces@... [[hidden email]] On Behalf Of Khan, Anar
Sent: Wednesday, 22 June 2011 10:03 a.m.
To: Carson Holt; maker-devel@...
Subject: Re: [maker-devel] Training SNAP in absense of ESTs?

Hi Carson
 
Thanks very much for your advice, on both this post and my last one re: single_exon. I will rerun MAKER using (1) single_exon and (2) retraining SNAP and hope for a positive effect on the predictions (:
 
Cheers
Anar
 

From: Carson Holt [[hidden email]]
Sent: Wednesday, 22 June 2011 3:44 a.m.
To: Khan, Anar; maker-devel@...
Subject: Re: [maker-devel] Training SNAP in absense of ESTs?

393 is ok for a fungal species, more is always better, but this is comparable to the number you get get when training SNAP with CEGMA.  I have data for an upcoming MAKER2 paper that shows that SNAP, Augustus, and GeneMark perform as well inside of MAKER2 using completely  incorrect species parameter as they do alone using highly optimized parameter file for C. elegans, D. melanogaster, and A. thaliana (they perform horribly when ran alone using the incorrect file).  This means even using the wrong file they will perform very well inside of MAKER2 as a result of “hints” from the evidence alignments from ESTs and proteins.  Of course they perform even better when using correct parameter files inside of MAKER2.

Thanks,
Carson

On 6/8/11 6:14 PM, "Khan, Anar" <Anar.Khan@...> wrote:
Hi
 
I don’t have EST data for my fungal species of interest, and I’m currently using EST contigs from a closely related species (86% identity in aligned transcript regions) as alternative EST evidence (altest) and SwissProt as protein evidence. I’m also using parameter files from a (different/less) related species for SNAP, Augustus and FGENESH. I’d like to use the bootstrapping procedure described in the SNAP paper (Korf, BMC Bioinformatics, 2004). I thought the best approach would be to run MAKER using all of the inputs listed above i.e. generate the best predictions possible or at least use all info available, then use the results to retrain SNAP. On running maker2zff on the output, fathom -gene-stats gives me:
 
<a few models with errors detected>
318 sequences
0.521223 avg GC fraction (min=0.463702 max=0.581287)
393 genes (plus=194 minus=199)
0 (0.000000) single-exon
393 (1.000000) multi-exon
341.366943 mean exon (min=3 max=4233)
91.340637 mean intron (min=4 max=1152)
 
 Is this sufficient training data (n=393)? Would you recommend a different bootstrapping approach?
 
Any advice would be appreciated!
 
Cheers
Anar
 
 
  




Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately.





Carson Holt
Graduate Student
Yandell Lab
http:/www.yandell-lab.org/ <http://www.yandell-lab.org/>  <http://www.yandell-lab.org/>
Eccles Institute of Human Genetics
University of Utah


Carson Holt
Graduate Student
Yandell Lab
http:/www.yandell-lab.org/ <http://www.yandell-lab.org/>
Eccles Institute of Human Genetics
University of Utah


Carson Holt
Graduate Student
Yandell Lab
http:/www.yandell-lab.org/
Eccles Institute of Human Genetics
University of Utah

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Training SNAP in absense of ESTs?

Khan, Anar
Re: [maker-devel] Training SNAP in absense of ESTs?

Hi Carson

 

Thanks for taking the time to look at my example and for explaining what is going on, I had misunderstood the way that single_exon works.

 

Cheers

Anar

 

From: Carson Holt [mailto:[hidden email]]
Sent: Tuesday, 26 July 2011 1:54 a.m.
To: Khan, Anar
Cc: MAKER
Subject: Re: [maker-devel] Training SNAP in absense of ESTs?

 

Sorry for the slow reply.  I finished moving to Toronto and have yet to have my internet hooked up at home.

Based on the GFF3 you sent me, nothing should have changed with the alteration of those parameters for this region of sequence.  You will notice that for the entire length of the contig, there are no evidence clusters for which there are no genes.  The altESTs aligneing always have a corresponding prediction on either the plus or minus strand.  Remember that for ESTs belonging to a different species there is currently no software capable of always properly handling alignment around splice sites (for regular same species ESTs there are).  What this means is that they sometime align to the opposite strand (amplification steps in preparing EST libraries make strandedness impossible to determine).  So when you see a lonely altEST, the gene model may actually be on the opposite strand.  Since all models were already predicted, there was nothing to add and results stayed the same.  The only way this would have changed with the single exon parameter is if you had a barren area (both strands), and there was a nice single exon altEST overlapping that position.

Thanks,
Carson


On 7/4/11 10:43 PM, "Khan, Anar" <Anar.Khan@...> wrote:

Hi Carson
 
Here is a gff3 file in case you’re interested. (Please keep this file to yourself if you don’t mind.) There’s no need for you to manually check the file because, as you can see there are some single exon predictions...
 
Thanks for your help :-)
 
Cheers
Anar
 
 
 

From: Carson Holt [[hidden email]]
Sent: Tuesday, 5 July 2011 2:45 p.m.
To: Khan, Anar; maker-devel@...
Subject: Re: [maker-devel] Training SNAP in absense of ESTs?

I would have to physically look at the evidence alignments in Apollo.  It may just be that the altESTs that are single exon are not aligning well to an open reading frame.  I can usually manually review the alignments and deconstruct the logic of how MAKER arrived at a given conclusion.  So send me some example GFF3 file :-)

Thanks,
Carson


On 6/29/11 3:45 PM, "Khan, Anar" <Anar.Khan@...> wrote:
Hi
 
I reran MAKER in my original analysis directory, this time changing 2 parameters in the options control file:
 
single_exon=1
single_length=250
 
and to my surprise I obtained the same results as for the previous run (single_exon=0), checked via counting the total number of maker predictions in gff files and checking output from fathom -gene-stats:
 
fathom genome.ann genome.dna -gene-stats
318 sequences
0.521223 avg GC fraction (min=0.463702 max=0.581287)
393 genes (plus=194 minus=199)
0 (0.000000) single-exon
393 (1.000000) multi-exon
341.366943 mean exon (min=3 max=4233)
91.340637 mean intron (min=4 max=1152)
 
As you can see, fathom tells me no single exon predictions were identified.
 
I’ve checked my maker_opts file to confirm I truly changed the single_* parameters!
 
Are there other parameters which interact with single_exon? Might I have done something silly?
 
I’ve attached my options control file.
 
Cheers!
Anar
 
 

From: maker-devel-bounces@... [[hidden email]] On Behalf Of Khan, Anar
Sent: Wednesday, 22 June 2011 10:03 a.m.
To: Carson Holt; maker-devel@...
Subject: Re: [maker-devel] Training SNAP in absense of ESTs?

Hi Carson
 
Thanks very much for your advice, on both this post and my last one re: single_exon. I will rerun MAKER using (1) single_exon and (2) retraining SNAP and hope for a positive effect on the predictions (:
 
Cheers
Anar
 

From: Carson Holt [[hidden email]]
Sent: Wednesday, 22 June 2011 3:44 a.m.
To: Khan, Anar; maker-devel@...
Subject: Re: [maker-devel] Training SNAP in absense of ESTs?

393 is ok for a fungal species, more is always better, but this is comparable to the number you get get when training SNAP with CEGMA.  I have data for an upcoming MAKER2 paper that shows that SNAP, Augustus, and GeneMark perform as well inside of MAKER2 using completely  incorrect species parameter as they do alone using highly optimized parameter file for C. elegans, D. melanogaster, and A. thaliana (they perform horribly when ran alone using the incorrect file).  This means even using the wrong file they will perform very well inside of MAKER2 as a result of “hints” from the evidence alignments from ESTs and proteins.  Of course they perform even better when using correct parameter files inside of MAKER2.

Thanks,
Carson

On 6/8/11 6:14 PM, "Khan, Anar" <Anar.Khan@...> wrote:
Hi
 
I don’t have EST data for my fungal species of interest, and I’m currently using EST contigs from a closely related species (86% identity in aligned transcript regions) as alternative EST evidence (altest) and SwissProt as protein evidence. I’m also using parameter files from a (different/less) related species for SNAP, Augustus and FGENESH. I’d like to use the bootstrapping procedure described in the SNAP paper (Korf, BMC Bioinformatics, 2004). I thought the best approach would be to run MAKER using all of the inputs listed above i.e. generate the best predictions possible or at least use all info available, then use the results to retrain SNAP. On running maker2zff on the output, fathom -gene-stats gives me:
 
<a few models with errors detected>
318 sequences
0.521223 avg GC fraction (min=0.463702 max=0.581287)
393 genes (plus=194 minus=199)
0 (0.000000) single-exon
393 (1.000000) multi-exon
341.366943 mean exon (min=3 max=4233)
91.340637 mean intron (min=4 max=1152)
 
 Is this sufficient training data (n=393)? Would you recommend a different bootstrapping approach?
 
Any advice would be appreciated!
 
Cheers
Anar
 
 
  




Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately.





Carson Holt
Graduate Student
Yandell Lab
http:/www.yandell-lab.org/ <http://www.yandell-lab.org/>  <http://www.yandell-lab.org/>
Eccles Institute of Human Genetics
University of Utah


Carson Holt
Graduate Student
Yandell Lab
http:/www.yandell-lab.org/ <http://www.yandell-lab.org/>
Eccles Institute of Human Genetics
University of Utah


Carson Holt
Graduate Student
Yandell Lab
http:/www.yandell-lab.org/
Eccles Institute of Human Genetics
University of Utah


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org