In addition to training, the evidence alignments can have an effect. If the evidence tends to be sparse or fragmented, this can result in fragmented gene models. To compensate for fragmented evidence, increase the pred_flank value in the maker_opt.ctl file. This will make evidence cluster together better even if there are larger gaps between alignments.
RepeatMasker handles repetitive elements for maker, and will help keep repetitive regions from falsely becoming part of a gene, but maker is capable of overriding the masking if there is very strong spliced EST or protein homology evidence suggesting that the repetitive region is in fact part of the gene. If you are worried about loss of real gene containing repeats, you can set unmask to 1 in the maker_opt.ctl file. This will allow SNAP and Augustus gene models based on completely unmasked sequence to be considered equally as an alternative to the masked genome models. I almost always set unmask to 1 in at least to see if it improves gene models.
Basically what I usually do when annotating a genome is to manually view several contigs after MAKER finishes, and the I decide what parameters to tweak or if I need to retrain prediction algorithms. I usually go through 3-4 rounds of annotation before I am satisfied with the set. You can do this on a subset of the genome as well (i.e. About 10-20 megabases). Repeat annotation is relatively fast because MAKER is able to reuse data from previous runs rather than rerunning everything.
On 8/25/10 3:36 AM, "Martin Kapun" <capoony@...> wrote:
maker-devel mailing list
|Free forum by Nabble||Edit this page|