Quantcast

Re: non-M gene models

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: non-M gene models

Carson Holt
Maybe. Those two options can result in a lot of partial models. Also setting always_complete=1 will help some.

Models without M at the start are generally partial models. There is often something about the contig that keeps it from being a whole model (single basepair error breaks ORF or splice site, or a string of NNN’s overlap part of an exon). You can also try identifying InterPro domain and dropping any model without a defined domain (i.e. if it’s going to be partial, at least make sure it’s useful in its partial form).

—Carson



On Mar 29, 2017, at 4:23 AM, Dario Copetti <[hidden email]> wrote:

Looking at the config file again I notice this:
est2genome=1 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no

I usually turn them on only to get models from ESTs to train Augustus and SNAP: do you think that having these parameters on during the final annotation will produce the non-M models?
If so, do you think that re-running MAKER again with them turned off and using the MAKER-derived gff3 will clean out these models?

Can you elaborate a bit more on the usage of these two parameters?
Thanks,

Dario


On 3/29/2017 12:07 PM, Dario Copetti wrote:

Hi Carson,

We are ready to submit several different sets of annotations but we are now stuck with the issue of having models which protein sequence does not start with Met, and NCBI is picky about that.
Below I paste an example of a genome we are working on: as you see, most (95%) of the models start with M, but a significant fraction (almost 1500 models!) does not.

We used MAKER 2.31.8, specifying the option of having models that only start with M. We realize that this issue may not be easy to fix - and also that there are indeed isoforms that do not start with M - but how would you fix this? Within or outside MAKER I mean, any help will be appreciated.

Some time ago, Josh and Sharon (cc'd) fixed the models by having the CDS start at the first M that was in frame with the exon, and wrote a script for that.
Is this issue maybe fixed in a newer version of MAKER? How else would you fix it or deal with NCBI genomes people?
Thanks,

Dario


grep -A1 ">" maker_proteins_161026.fasta | grep -v ">" | grep -v "\-\-" | cut -c1 | sort | uniq -c
    106 A
     33 C
     69 D
     88 E
     53 F
     94 G
     34 H
     86 I
     77 K
    144 L
  28245 M
     58 N
     72 P
     44 Q
     95 R
    142 S
     80 T
    114 V
     29 W
      6 X
     53 Y




-- 
Dario Copetti, PhD
Research Associate | Arizona Genomics Institute
University of Arizona | BIO5

1657 E. Helen St.
Tucson, AZ  85721, USA
www.genome.arizona.edu

-- 
Dario Copetti, PhD
Research Associate | Arizona Genomics Institute
University of Arizona | BIO5

1657 E. Helen St.
Tucson, AZ  85721, USA
www.genome.arizona.edu


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Loading...