maker decisions

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

maker decisions

Reith, Michael

Hi Carson,

 

I’ve been using Maker for genome annotation of small eukaryotic protist.  I’ve run into some cases where Maker seems to fuse together 2 genes that snap and augustus call as separate.  Here are the relevant parts of the gff for one example (I’ve only included the snap results – augustus lines are essentially the same; ID parts of the lines are stripped out for clarity):

 

c1830-c780_1      maker gene  7298  9722  .     -     .    

c1830-c780_1      maker mRNA  7298  9722  .     -     .    

c1830-c780_1      maker exon  9623  9722  16.291      -     .    

c1830-c780_1      maker exon  9314  9523  88.349      -     .    

c1830-c780_1      maker exon  8937  9206  131.566     -     .    

c1830-c780_1      maker exon  8086  8177  32.877      -     .    

c1830-c780_1      maker exon  7940  7993  23.355      -     .    

c1830-c780_1      maker exon  7298  7822  255.601     -     .    

c1830-c780_1      maker CDS   7298  7822  .     -     0    

c1830-c780_1      maker CDS   7940  7993  .     -     0    

c1830-c780_1      maker CDS   8086  8177  .     -     2    

c1830-c780_1      maker CDS   8937  9206  .     -     2    

c1830-c780_1      maker CDS   9314  9523  .     -     2    

c1830-c780_1      maker CDS   9623  9644  .     -     0

 

c1830-c780_1      snap  match 7298  8175  182.682     -     .    

c1830-c780_1      snap  match_part  8086  8175  21.497      -     .    

c1830-c780_1      snap  match_part  7940  7993  13.236      -     .    

c1830-c780_1      snap  match_part  7298  7822  147.949     -     .    

c1830-c780_1      snap  match 8652  9644  156.573     -     .    

c1830-c780_1      snap  match_part  9623  9644  16.080      -     .    

c1830-c780_1      snap  match_part  9314  9523  45.529      -     .    

c1830-c780_1      snap  match_part  8937  9206  76.168      -     .    

c1830-c780_1      snap  match_part  8808  8826  5.400 -     .    

c1830-c780_1      snap  match_part  8652  8709  13.396      -     .          

 

As you can see, maker calls 1 gene with 6 exons, while snap and augustus call 2 genes with 3 and 5 exons.  By Blast, it’s clear that the snap/augustus calls are more accurate.  Maker has been run multiple times on this set of data, with various parameter tweaks, so one thought I had was this may be a carry over from an earlier run where the gene models weren’t so good.  This run used est2genome, snap & pred_gff (to input the augustus data, since we’ve got that set up on a separate machine) as predictors.  The other thought was that est2genome was causing these to be fused because there are a bunch of ESTs in this region, though I don’t see one that overlaps both genes.  I’m wondering if I rerun maker but leave out est2genome, whether this might correct things.  Any thoughts?

 

Thanks,

Mike

---------------------------------------------------------

Michael Reith

Principal Research Officer

Functional Genomics Group Leader

NRC Institute for Marine Biosciences

1411 Oxford St.

Halifax, N.S.    B3H 3Z1

Canada

 

phone:  (902) 426-8276

fax:       (902) 426-9413

email:   [hidden email]

-----------------------------------------------------------
The information contained in this e-mail may contain confidential information intended for a specific individual and purpose. The information is private and is legally protected by law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on the comments of this information is strictly prohibited. If you have received this communication
in error, please notify the sender immediately by telephone or return e-mail.
Thank you.

 


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: maker decisions

Carson Hinton Holt
Re: [maker-devel] maker decisions I could tell you more with the full GFF3 file (all of the most relevant detail in in the 9th column), but basically what happens is MAKER takes the protein and EST alignments and shares that information with SNAP and Augustus, then they make new predictions.  So SNAP or Augustus thinks the region corresponds to a single gene after being made aware of the blast alignments.  MAKER can override their updated models and revert back to the ab initio model based on evidence overlap (AED value in GFF3).  So there must be something causing these two regions to merge or cluster together during evidence comparison, so MAKER decides the updated SNAP/Augustus model better matches the evidence.  If you are using model_gff, that could be the culprit because if a previous model bridges across the two genes then MAKER will take that model as a form of evidence that there may be some reason to believe that these two regions are merged and will promote the extened model over the smaller models.  Alternatively there is a pred_flank option in the control options that affects the edges of evidence clustering.

Basically it’s not that the models built will change, but that the final model selected will change.  SNAP will produce 2 models, 1 ab initio and 1 blast aware model.  Augustus can also produce an ab initio and a blast aware model, but because of the way you have this set up with pred_gff, it looks like you will only get the ab initio models.  These 3 models will then be compared against the evidence and a statistic called annotation edit distance (AED) will be calculated.  This is calculated against evidence clustered around a locus.  If the evidence groups into a single cluster, then larger models will be selected, if there is no bridge and the evidence produces 2 clusters then the shorter models will be selected for each cluster.

My first suggestion would be to decrease the value of pred_flank.  Also est2genome only produces a model in the absence of SNAP and Augustus predictions, so I would just turn that off.  It’s best used to assist in training SNAP and Augustus, and once they are trained I turn it off, otherwise weird EST alignments can sometimes produce models when SNAP and Augustus refuse to make predictions.  This can cause weird RNA genes to be annotated as protein coding.

If you want you can also send me the entire contig GFF3 and maker_opts.ctl file, and I might be able to suggest further optimizations.

Thanks,
Carson

On 10/6/10 7:03 AM, "Reith, Michael" <Michael.Reith@...> wrote:

Hi Carson,
 
I’ve been using Maker for genome annotation of small eukaryotic protist.  I’ve run into some cases where Maker seems to fuse together 2 genes that snap and augustus call as separate.  Here are the relevant parts of the gff for one example (I’ve only included the snap results – augustus lines are essentially the same; ID parts of the lines are stripped out for clarity):
 
c1830-c780_1      maker gene  7298  9722  .     -     .     
c1830-c780_1      maker mRNA  7298  9722  .     -     .     
c1830-c780_1      maker exon  9623  9722  16.291      -     .     
c1830-c780_1      maker exon  9314  9523  88.349      -     .     
c1830-c780_1      maker exon  8937  9206  131.566     -     .     
c1830-c780_1      maker exon  8086  8177  32.877      -     .     
c1830-c780_1      maker exon  7940  7993  23.355      -     .     
c1830-c780_1      maker exon  7298  7822  255.601     -     .     
c1830-c780_1      maker CDS   7298  7822  .     -     0     
c1830-c780_1      maker CDS   7940  7993  .     -     0     
c1830-c780_1      maker CDS   8086  8177  .     -     2     
c1830-c780_1      maker CDS   8937  9206  .     -     2     
c1830-c780_1      maker CDS   9314  9523  .     -     2     
c1830-c780_1      maker CDS   9623  9644  .     -     0
 
c1830-c780_1      snap  match 7298  8175  182.682     -     .     
c1830-c780_1      snap  match_part  8086  8175  21.497      -     .     
c1830-c780_1      snap  match_part  7940  7993  13.236      -     .     
c1830-c780_1      snap  match_part  7298  7822  147.949     -     .     
c1830-c780_1      snap  match 8652  9644  156.573     -     .     
c1830-c780_1      snap  match_part  9623  9644  16.080      -     .     
c1830-c780_1      snap  match_part  9314  9523  45.529      -     .     
c1830-c780_1      snap  match_part  8937  9206  76.168      -     .     
c1830-c780_1      snap  match_part  8808  8826  5.400 -     .     
c1830-c780_1      snap  match_part  8652  8709  13.396      -     .           

As you can see, maker calls 1 gene with 6 exons, while snap and augustus call 2 genes with 3 and 5 exons.  By Blast, it’s clear that the snap/augustus calls are more accurate.  Maker has been run multiple times on this set of data, with various parameter tweaks, so one thought I had was this may be a carry over from an earlier run where the gene models weren’t so good.  This run used est2genome, snap & pred_gff (to input the augustus data, since we’ve got that set up on a separate machine) as predictors.  The other thought was that est2genome was causing these to be fused because there are a bunch of ESTs in this region, though I don’t see one that overlaps both genes.  I’m wondering if I rerun maker but leave out est2genome, whether this might correct things.  Any thoughts?
 
Thanks,
Mike
---------------------------------------------------------
Michael Reith
Principal Research Officer

Functional Genomics Group Leader
NRC Institute for Marine Biosciences
1411 Oxford St.
Halifax, N.S.    B3H 3Z1
Canada

phone: (902) 426-8276

fax:      (902) 426-9413
email:  michael.reith@...
-----------------------------------------------------------
The information contained in this e-mail may contain confidential information intended for a specific individual and purpose. The information is private and is legally protected by law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on the comments of this information is strictly prohibited. If you have received this communication
in error, please notify the sender immediately by telephone or return e-mail.
Thank you.



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: maker decisions

Carson Hinton Holt
In reply to this post by Reith, Michael
Re: [maker-devel] maker decisions AED is defined in a joint paper with the sequence ontology project here http://www.biomedcentral.com/1471-2105/10/67 .  My use of this measurement varies slightly from the paper.  Rather than comparing one transcript to another, I treat the evidence clusters as if they were individual transcripts.  Then I calculate the distance from the evidence to the Augustus and SNAP predictions

QI is explained in the MAKER paper here http://www.ncbi.nlm.nih.gov/pubmed/18025269 .  Each column is a statistic for different aspects of a gene model.

There is also a MAKER wiki here http://gmod.org/wiki/MAKER_Tutorial .

Thanks,
Carson


On 10/6/10 10:05 AM, "Reith, Michael" <Michael.Reith@...> wrote:

Thanks Carson. Here are the files.  Sorry about trimming of the important part, which leads to another question – are the AED and QI and other stuff at the end of the line defined in the documentation somewhere?
 
Greatly appreciate your help,
Mike
 


From: Carson Holt [[hidden email]]
Sent: Wednesday, October 06, 2010 12:11 PM
To: Reith, Michael; maker-devel@...
Subject: Re: [maker-devel] maker decisions

I could tell you more with the full GFF3 file (all of the most relevant detail in in the 9th column), but basically what happens is MAKER takes the protein and EST alignments and shares that information with SNAP and Augustus, then they make new predictions.  So SNAP or Augustus thinks the region corresponds to a single gene after being made aware of the blast alignments.  MAKER can override their updated models and revert back to the ab initio model based on evidence overlap (AED value in GFF3).  So there must be something causing these two regions to merge or cluster together during evidence comparison, so MAKER decides the updated SNAP/Augustus model better matches the evidence.  If you are using model_gff, that could be the culprit because if a previous model bridges across the two genes then MAKER will take that model as a form of evidence that there may be some reason to believe that these two regions are merged and will promote the extened model over the smaller models.  Alternatively there is a pred_flank option in the control options that affects the edges of evidence clustering.

Basically it’s not that the models built will change, but that the final model selected will change.  SNAP will produce 2 models, 1 ab initio and 1 blast aware model.  Augustus can also produce an ab initio and a blast aware model, but because of the way you have this set up with pred_gff, it looks like you will only get the ab initio models.  These 3 models will then be compared against the evidence and a statistic called annotation edit distance (AED) will be calculated.  This is calculated against evidence clustered around a locus.  If the evidence groups into a single cluster, then larger models will be selected, if there is no bridge and the evidence produces 2 clusters then the shorter models will be selected for each cluster.

My first suggestion would be to decrease the value of pred_flank.  Also est2genome only produces a model in the absence of SNAP and Augustus predictions, so I would just turn that off.  It’s best used to assist in training SNAP and Augustus, and once they are trained I turn it off, otherwise weird EST alignments can sometimes produce models when SNAP and Augustus refuse to make predictions.  This can cause weird RNA genes to be annotated as protein coding.

If you want you can also send me the entire contig GFF3 and maker_opts.ctl file, and I might be able to suggest further optimizations.

Thanks,
Carson

On 10/6/10 7:03 AM, "Reith, Michael" <Michael.Reith@...> wrote:
Hi Carson,
 
I’ve been using Maker for genome annotation of small eukaryotic protist.  I’ve run into some cases where Maker seems to fuse together 2 genes that snap and augustus call as separate.  Here are the relevant parts of the gff for one example (I’ve only included the snap results – augustus lines are essentially the same; ID parts of the lines are stripped out for clarity):
 
c1830-c780_1      maker gene  7298  9722  .     -     .     
c1830-c780_1      maker mRNA  7298  9722  .     -     .     
c1830-c780_1      maker exon  9623  9722  16.291      -     .     
c1830-c780_1      maker exon  9314  9523  88.349      -     .     
c1830-c780_1      maker exon  8937  9206  131.566     -     .     
c1830-c780_1      maker exon  8086  8177  32.877      -     .     
c1830-c780_1      maker exon  7940  7993  23.355      -     .     
c1830-c780_1      maker exon  7298  7822  255.601     -     .     
c1830-c780_1      maker CDS   7298  7822  .     -     0     
c1830-c780_1      maker CDS   7940  7993  .     -     0     
c1830-c780_1      maker CDS   8086  8177  .     -     2     
c1830-c780_1      maker CDS   8937  9206  .     -     2     
c1830-c780_1      maker CDS   9314  9523  .     -     2     
c1830-c780_1      maker CDS   9623  9644  .     -     0
 
c1830-c780_1      snap  match 7298  8175  182.682     -     .     
c1830-c780_1      snap  match_part  8086  8175  21.497      -     .     
c1830-c780_1      snap  match_part  7940  7993  13.236      -     .     
c1830-c780_1      snap  match_part  7298  7822  147.949     -     .     
c1830-c780_1      snap  match 8652  9644  156.573     -     .     
c1830-c780_1      snap  match_part  9623  9644  16.080      -     .     
c1830-c780_1      snap  match_part  9314  9523  45.529      -     .     
c1830-c780_1      snap  match_part  8937  9206  76.168      -     .     
c1830-c780_1      snap  match_part  8808  8826  5.400 -     .     
c1830-c780_1      snap  match_part  8652  8709  13.396      -     .           

As you can see, maker calls 1 gene with 6 exons, while snap and augustus call 2 genes with 3 and 5 exons.  By Blast, it’s clear that the snap/augustus calls are more accurate.  Maker has been run multiple times on this set of data, with various parameter tweaks, so one thought I had was this may be a carry over from an earlier run where the gene models weren’t so good.  This run used est2genome, snap & pred_gff (to input the augustus data, since we’ve got that set up on a separate machine) as predictors.  The other thought was that est2genome was causing these to be fused because there are a bunch of ESTs in this region, though I don’t see one that overlaps both genes.  I’m wondering if I rerun maker but leave out est2genome, whether this might correct things.  Any thoughts?
 
Thanks,
Mike
---------------------------------------------------------
Michael Reith
Principal Research Officer

Functional Genomics Group Leader
NRC Institute for Marine Biosciences
1411 Oxford St.
Halifax, N.S.    B3H 3Z1
Canada

phone: (902) 426-8276

fax:      (902) 426-9413
email:  michael.reith@...
-----------------------------------------------------------
The information contained in this e-mail may contain confidential information intended for a specific individual and purpose. The information is private and is legally protected by law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on the comments of this information is strictly prohibited. If you have received this communication
in error, please notify the sender immediately by telephone or return e-mail.
Thank you.



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org