training of gene finders using whole assembly or longest contigs?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

training of gene finders using whole assembly or longest contigs?

Quanwei Zhang
Hello:

I am training the gene finders using the whole assembly. But it seems very time consuming. Besides, I have to repeat the training process several times.  Although I am running it on 25 nodes on a server, it may still take 3 (or even more) weeks for the training. I wonder how you guys train the SNAP. Do you use the whole assembly or just select the longest contigs for the training. If I only use longest contigs (like top 20% longest), will it be good enough as that get by using the whole assembly? Or should I randomly select 20% contigs for the training, for which we will have similar length distribution as the whole assembly?

Thanks

Best
Quanwei


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: training of gene finders using whole assembly or longest contigs?

Carson Holt-2
Example of training here —> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors

You can also search the devel mailing list archives here —> https://groups.google.com/forum/#!forum/maker-devel

There are lots and lots of threads that go into detail on training. Note more than 2 rounds of training is not beneficial, and can actually make performance worse (there is an overtraining paradox).

—Carson


On Feb 10, 2017, at 9:03 AM, Quanwei Zhang <[hidden email]> wrote:

Hello:

I am training the gene finders using the whole assembly. But it seems very time consuming. Besides, I have to repeat the training process several times.  Although I am running it on 25 nodes on a server, it may still take 3 (or even more) weeks for the training. I wonder how you guys train the SNAP. Do you use the whole assembly or just select the longest contigs for the training. If I only use longest contigs (like top 20% longest), will it be good enough as that get by using the whole assembly? Or should I randomly select 20% contigs for the training, for which we will have similar length distribution as the whole assembly?

Thanks

Best
Quanwei

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Loading...