Curious pattern in AED distributions

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Curious pattern in AED distributions

Lior Glick
Hi MAKER users,
Lately I've been performing annotations for multiple genomes from the same species.
When plotting the histogram of AED scores over all genes, I repeatedly see a very specific pattern, that looks something like this:
AED_hist.png
This pattern is a bit surprising to me, in two aspects:
1) Why is there a surge towards 0.5?
2) Why is there a sudden drop right after that surge?

Has anyone else seen this, or is this a specific outcome of my data/configuration?
Any ideas of what may cause such a distribution?

While this is not necessarily an indication of a problem or bug, it does seem a bit odd, and  might imply some bias or artifact.
Would appreciate your comments.
Thank you!

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Curious pattern in AED distributions

Mark Yandell

Hi Lior,

 

 

Fun! The short answer is I don’t know. Obviously, the good stuff is on the right side of 0.5.

That said, I can think of a couple of things to look into to explain the left side of the graph. Are you allowing single exon genes? Are you using RNA seq data, protein, or both? What about repeat masking? Are you doing it?  Do you have your own library?

 

My first guess, would be low complexity/repeat sequences generating more or less random blastx hits across the genome…Carson, what do you think?

 

And finally, what does the AED look like for the genes included in the final build?

 

 

Sorry for all the questions, Lior. That’s your punishment for asking an interesting one. 😉

 

--mark

 

 

From: maker-devel <[hidden email]> on behalf of Lior Glick <[hidden email]>
Date: Sunday, April 7, 2019 at 7:26 AM
To: "[hidden email]" <[hidden email]>
Subject: [maker-devel] Curious pattern in AED distributions

 

Hi MAKER users,

Lately I've been performing annotations for multiple genomes from the same species.

When plotting the histogram of AED scores over all genes, I repeatedly see a very specific pattern, that looks something like this:

AED_hist.png

This pattern is a bit surprising to me, in two aspects:

1) Why is there a surge towards 0.5?

2) Why is there a sudden drop right after that surge?

 

Has anyone else seen this, or is this a specific outcome of my data/configuration?

Any ideas of what may cause such a distribution?

 

While this is not necessarily an indication of a problem or bug, it does seem a bit odd, and  might imply some bias or artifact.

Would appreciate your comments.

Thank you!


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Curious pattern in AED distributions

Lior Glick-2
Dear Mark,
Thank you for the quick reply. I'm happy to see this ignites your interest and am willing to endure your punishing questions (;
Before I answer them, I just want to make sure we're on the same page - as far as I understand, lower AED scores indicate higher agreement with the evidence, so the "good stuff" is actually left of the 0.5 surge. Am I correct? Otherwise, this is a very poor annotation...
Now for the questions:
1) I did not make any filtrations so far, so single exon genes are included as well. in fact, I'm exploring the results in order to develop some criteria for filtering the genes. Would you suggest discarding single exon genes?

2) My evidence consist of assembled transcripts, proteins and predicted gene models (pred_gff).

3) As for repeats, I'm masking based on a repeats library obtained from a previous publication, specific to my organism of interest.

Unfortunately, I didn't understand your final question. Could you please explain what you mean by "final build"?

Hope these answers are helpful, and waiting to hear more thoughts.

Thanks again.


On Sun, Apr 7, 2019, 18:11 Mark Yandell <[hidden email]> wrote:

Hi Lior,

 

 

Fun! The short answer is I don’t know. Obviously, the good stuff is on the right side of 0.5.

That said, I can think of a couple of things to look into to explain the left side of the graph. Are you allowing single exon genes? Are you using RNA seq data, protein, or both? What about repeat masking? Are you doing it?  Do you have your own library?

 

My first guess, would be low complexity/repeat sequences generating more or less random blastx hits across the genome…Carson, what do you think?

 

And finally, what does the AED look like for the genes included in the final build?

 

 

Sorry for all the questions, Lior. That’s your punishment for asking an interesting one. 😉

 

--mark

 

 

From: maker-devel <[hidden email]> on behalf of Lior Glick <[hidden email]>
Date: Sunday, April 7, 2019 at 7:26 AM
To: "[hidden email]" <[hidden email]>
Subject: [maker-devel] Curious pattern in AED distributions

 

Hi MAKER users,

Lately I've been performing annotations for multiple genomes from the same species.

When plotting the histogram of AED scores over all genes, I repeatedly see a very specific pattern, that looks something like this:

This pattern is a bit surprising to me, in two aspects:

1) Why is there a surge towards 0.5?

2) Why is there a sudden drop right after that surge?

 

Has anyone else seen this, or is this a specific outcome of my data/configuration?

Any ideas of what may cause such a distribution?

 

While this is not necessarily an indication of a problem or bug, it does seem a bit odd, and  might imply some bias or artifact.

Would appreciate your comments.

Thank you!


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

image001.png (11K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Curious pattern in AED distributions

Mark Yandell

·         Sorry. I’m dyslexic, especially early in the morning. Yes, good stuff is on the left. As regards single exon genes, that’s always a hard call, as these have a higher false positive rate. Things to consider are how prevalent are introns in your org? Cason can give more advice on this point, I’m sure.

·          

·         By ‘"final build", I meant is this using the ‘Standard build’  or ‘Max Build’ protocol from PMC4286374?

 

 

From: Lior Glick <[hidden email]>
Date: Sunday, April 7, 2019 at 10:29 AM
To: Mark Yandell <[hidden email]>
Cc: "[hidden email]" <[hidden email]>, "[hidden email]" <[hidden email]>
Subject: Re: [maker-devel] Curious pattern in AED distributions

 

Dear Mark,

Thank you for the quick reply. I'm happy to see this ignites your interest and am willing to endure your punishing questions (;

Before I answer them, I just want to make sure we're on the same page - as far as I understand, lower AED scores indicate higher agreement with the evidence, so the "good stuff" is actually left of the 0.5 surge. Am I correct? Otherwise, this is a very poor annotation...

Now for the questions:

1) I did not make any filtrations so far, so single exon genes are included as well. in fact, I'm exploring the results in order to develop some criteria for filtering the genes. Would you suggest discarding single exon genes?

 

2) My evidence consist of assembled transcripts, proteins and predicted gene models (pred_gff).

 

3) As for repeats, I'm masking based on a repeats library obtained from a previous publication, specific to my organism of interest.

 

Unfortunately, I didn't understand your final question. Could you please explain what you mean by "final build"?

 

Hope these answers are helpful, and waiting to hear more thoughts.

 

Thanks again.

 

On Sun, Apr 7, 2019, 18:11 Mark Yandell <[hidden email]> wrote:

Hi Lior,

 

 

Fun! The short answer is I don’t know. Obviously, the good stuff is on the right side of 0.5.

That said, I can think of a couple of things to look into to explain the left side of the graph. Are you allowing single exon genes? Are you using RNA seq data, protein, or both? What about repeat masking? Are you doing it?  Do you have your own library?

 

My first guess, would be low complexity/repeat sequences generating more or less random blastx hits across the genome…Carson, what do you think?

 

And finally, what does the AED look like for the genes included in the final build?

 

 

Sorry for all the questions, Lior. That’s your punishment for asking an interesting one. 😉

 

--mark

 

 

From: maker-devel <[hidden email]> on behalf of Lior Glick <[hidden email]>
Date: Sunday, April 7, 2019 at 7:26 AM
To: "[hidden email]" <[hidden email]>
Subject: [maker-devel] Curious pattern in AED distributions

 

Hi MAKER users,

Lately I've been performing annotations for multiple genomes from the same species.

When plotting the histogram of AED scores over all genes, I repeatedly see a very specific pattern, that looks something like this:

This pattern is a bit surprising to me, in two aspects:

1) Why is there a surge towards 0.5?

2) Why is there a sudden drop right after that surge?

 

Has anyone else seen this, or is this a specific outcome of my data/configuration?

Any ideas of what may cause such a distribution?

 

While this is not necessarily an indication of a problem or bug, it does seem a bit odd, and  might imply some bias or artifact.

Would appreciate your comments.

Thank you!


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Curious pattern in AED distributions

Carson Holt-2
In reply to this post by Lior Glick
That’s interesting. It could be a handful of internal filters that help with spurious results.

I use a 0.5 sensitivity/specificity to identify shared edges for a jaccardian split on overlapping evidence clusters for example. There are also a couple of places where if the only thing supporting a model is a single exon blastx hit (i.e. no exonerate, ab initio model, or est splice support, but just a chunk od single exon blastx) then maker will use a reading frame aware AED value of 0.5 as a filter (as in it checks if the reading frame matches and not just raw overlap). If that’s the case, the spike near 0.5 may indicate I needed to be a little strickter than my empirical cutoff estimate. Perhaps 0.4 or 0.45 would be the better cuttoff for these spurious blastx induced models.

—Carson


> On Apr 7, 2019, at 7:25 AM, Lior Glick <[hidden email]> wrote:
>
> Hi MAKER users,
> Lately I've been performing annotations for multiple genomes from the same species.
> When plotting the histogram of AED scores over all genes, I repeatedly see a very specific pattern, that looks something like this:
> <AED_hist.png>
> This pattern is a bit surprising to me, in two aspects:
> 1) Why is there a surge towards 0.5?
> 2) Why is there a sudden drop right after that surge?
>
> Has anyone else seen this, or is this a specific outcome of my data/configuration?
> Any ideas of what may cause such a distribution?
>
> While this is not necessarily an indication of a problem or bug, it does seem a bit odd, and  might imply some bias or artifact.
> Would appreciate your comments.
> Thank you!
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Curious pattern in AED distributions

Xabier Vázquez-Campos
If you train SNAP, the maker2zff script has internal quality cutoffs based on the existence of evidence. e.g. by default it will require having some EST evidence

On Mon, 8 Apr 2019 at 11:32, Carson Holt <[hidden email]> wrote:
That’s interesting. It could be a handful of internal filters that help with spurious results.

I use a 0.5 sensitivity/specificity to identify shared edges for a jaccardian split on overlapping evidence clusters for example. There are also a couple of places where if the only thing supporting a model is a single exon blastx hit (i.e. no exonerate, ab initio model, or est splice support, but just a chunk od single exon blastx) then maker will use a reading frame aware AED value of 0.5 as a filter (as in it checks if the reading frame matches and not just raw overlap). If that’s the case, the spike near 0.5 may indicate I needed to be a little strickter than my empirical cutoff estimate. Perhaps 0.4 or 0.45 would be the better cuttoff for these spurious blastx induced models.

—Carson


> On Apr 7, 2019, at 7:25 AM, Lior Glick <[hidden email]> wrote:
>
> Hi MAKER users,
> Lately I've been performing annotations for multiple genomes from the same species.
> When plotting the histogram of AED scores over all genes, I repeatedly see a very specific pattern, that looks something like this:
> <AED_hist.png>
> This pattern is a bit surprising to me, in two aspects:
> 1) Why is there a surge towards 0.5?
> 2) Why is there a sudden drop right after that surge?
>
> Has anyone else seen this, or is this a specific outcome of my data/configuration?
> Any ideas of what may cause such a distribution?
>
> While this is not necessarily an indication of a problem or bug, it does seem a bit odd, and  might imply some bias or artifact.
> Would appreciate your comments.
> Thank you!
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Curious pattern in AED distributions

Carson Holt-2
Yes. maker2zff tries to further select a subset of the best supported models by requiring multiple forms of evidence support.

—Carson


On Apr 7, 2019, at 10:42 PM, Xabier Vázquez-Campos <[hidden email]> wrote:

If you train SNAP, the maker2zff script has internal quality cutoffs based on the existence of evidence. e.g. by default it will require having some EST evidence

On Mon, 8 Apr 2019 at 11:32, Carson Holt <[hidden email]> wrote:
That’s interesting. It could be a handful of internal filters that help with spurious results.

I use a 0.5 sensitivity/specificity to identify shared edges for a jaccardian split on overlapping evidence clusters for example. There are also a couple of places where if the only thing supporting a model is a single exon blastx hit (i.e. no exonerate, ab initio model, or est splice support, but just a chunk od single exon blastx) then maker will use a reading frame aware AED value of 0.5 as a filter (as in it checks if the reading frame matches and not just raw overlap). If that’s the case, the spike near 0.5 may indicate I needed to be a little strickter than my empirical cutoff estimate. Perhaps 0.4 or 0.45 would be the better cuttoff for these spurious blastx induced models.

—Carson


> On Apr 7, 2019, at 7:25 AM, Lior Glick <[hidden email]> wrote:
>
> Hi MAKER users,
> Lately I've been performing annotations for multiple genomes from the same species.
> When plotting the histogram of AED scores over all genes, I repeatedly see a very specific pattern, that looks something like this:
> <AED_hist.png>
> This pattern is a bit surprising to me, in two aspects:
> 1) Why is there a surge towards 0.5?
> 2) Why is there a sudden drop right after that surge?
>
> Has anyone else seen this, or is this a specific outcome of my data/configuration?
> Any ideas of what may cause such a distribution?
>
> While this is not necessarily an indication of a problem or bug, it does seem a bit odd, and  might imply some bias or artifact.
> Would appreciate your comments.
> Thank you!
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Curious pattern in AED distributions

Lior Glick
Hello again and thank you all for your interesting answers.
I mistakenly answered Mark yesterday from an unsubscribed mail, which resulted in only him getting it, so for documentation sake, I'm posting my answer here again, and Mark's reply:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dear Mark,
Thank you for the quick reply. I'm happy to see this ignites your interest and am willing to endure your punishing questions (;
Before I answer them, I just want to make sure we're on the same page - as far as I understand, lower AED scores indicate higher agreement with the evidence, so the "good stuff" is actually left of the 0.5 surge. Am I correct? Otherwise, this is a very poor annotation...
Now for the questions:
1) I did not make any filtrations so far, so single exon genes are included as well. in fact, I'm exploring the results in order to develop some criteria for filtering the genes. Would you suggest discarding single exon genes?

2) My evidence consist of assembled transcripts, proteins and predicted gene models (pred_gff).

3) As for repeats, I'm masking based on a repeats library obtained from a previous publication, specific to my organism of interest.

Unfortunately, I didn't understand your final question. Could you please explain what you mean by "final build"?

Hope these answers are helpful, and waiting to hear more thoughts.

Thanks again.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
To which Mark replied:

Sorry. I’m dyslexic, especially early in the morning. Yes, good stuff is on the left. As regards single exon genes, that’s always a hard call, as these have a higher false positive rate. Things to consider are how prevalent are introns in your org? Cason can give more advice on this point, I’m sure.

·          

·         By ‘"final build", I meant is this using the ‘Standard build’  or ‘Max Build’ protocol from PMC4286374? 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  

 Mark - well, as I said I haven't done any filtration yet, so I guess my annotation currently includes genes that would be discarded even with the "max build". I'll give this a try and look at the resulting distribution. 

Xabier - thanks, but I'm not using SNAP (just Augustus).

Carson - I see a few fingers pointing in the direction of single-exon models, so maybe I should see what happens to the distribution of AED when these genes are removed.

I'll get back to you with some more results.

‫בתאריך יום ב׳, 8 באפר׳ 2019 ב-8:20 מאת ‪Carson Holt‬‏ <‪[hidden email]‬‏>:‬
Yes. maker2zff tries to further select a subset of the best supported models by requiring multiple forms of evidence support.

—Carson


On Apr 7, 2019, at 10:42 PM, Xabier Vázquez-Campos <[hidden email]> wrote:

If you train SNAP, the maker2zff script has internal quality cutoffs based on the existence of evidence. e.g. by default it will require having some EST evidence

On Mon, 8 Apr 2019 at 11:32, Carson Holt <[hidden email]> wrote:
That’s interesting. It could be a handful of internal filters that help with spurious results.

I use a 0.5 sensitivity/specificity to identify shared edges for a jaccardian split on overlapping evidence clusters for example. There are also a couple of places where if the only thing supporting a model is a single exon blastx hit (i.e. no exonerate, ab initio model, or est splice support, but just a chunk od single exon blastx) then maker will use a reading frame aware AED value of 0.5 as a filter (as in it checks if the reading frame matches and not just raw overlap). If that’s the case, the spike near 0.5 may indicate I needed to be a little strickter than my empirical cutoff estimate. Perhaps 0.4 or 0.45 would be the better cuttoff for these spurious blastx induced models.

—Carson


> On Apr 7, 2019, at 7:25 AM, Lior Glick <[hidden email]> wrote:
>
> Hi MAKER users,
> Lately I've been performing annotations for multiple genomes from the same species.
> When plotting the histogram of AED scores over all genes, I repeatedly see a very specific pattern, that looks something like this:
> <AED_hist.png>
> This pattern is a bit surprising to me, in two aspects:
> 1) Why is there a surge towards 0.5?
> 2) Why is there a sudden drop right after that surge?
>
> Has anyone else seen this, or is this a specific outcome of my data/configuration?
> Any ideas of what may cause such a distribution?
>
> While this is not necessarily an indication of a problem or bug, it does seem a bit odd, and  might imply some bias or artifact.
> Would appreciate your comments.
> Thank you!
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Curious pattern in AED distributions

Lior Glick
Hi again - quick update:
I made a plot comparing the histograms of single-exon genes to multi-exon genes:
newplot (5).png
It definitely looks like single-exon genes are enriched for the 0.5 score, but it does not account for the entire surge, as there also seem to be lots of multi-exon genes involved. This may suggest that the 0.5 peak is a result of multiple effects buried within the software.
Any other thoughts/suggestions?

Thanks again,


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Curious pattern in AED distributions

Carson Holt-2
One note. When I say single exon blastx hit, I mean that the evidence is single exon, not that the gene model is single exon. What I think you are seeing is an effect that seems to be partially related to under-masking, i.e. a spurious partial blastx alignment to a low complexity repeat (which is why the blastx protein alignment refuses to polish with exonerate).  That is why the filter was added. So if a model (single or multi-exon) has no additional ab initio prediction support, has no EST support, and has no exonerate polished protein support, but does have a single-exon/single-hsp blastx overlap it gets filtered out at 0.5 (that threshold based on trial and error on a couple of genomes where we saw this occur - but your graph suggests that filter might be too loose and 0.4 or 0.45 might be a better value). So the spike is caused by poor blastx and under-masking (this may be explained if your are using in pred_gff models that were generated on an unmasked assembly outside of MAKER), then the drop around 0.5 is caused by MAKER filtering out models only supported by what appears to be spuious blastx alignments.

—Carson


On Apr 8, 2019, at 3:10 AM, Lior Glick <[hidden email]> wrote:

Hi again - quick update:
I made a plot comparing the histograms of single-exon genes to multi-exon genes:
<newplot (5).png>
It definitely looks like single-exon genes are enriched for the 0.5 score, but it does not account for the entire surge, as there also seem to be lots of multi-exon genes involved. This may suggest that the 0.5 peak is a result of multiple effects buried within the software.
Any other thoughts/suggestions?

Thanks again,



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Curious pattern in AED distributions

Carson Holt-2
In reply to this post by Lior Glick
Try also adding 2 exon models to the graph. It would be interesting to see if these are attempted single-exon models where the predictor added a micro-intron to keep the open reading frame going against a single exon blastx hint.

—Carson


On Apr 8, 2019, at 3:10 AM, Lior Glick <[hidden email]> wrote:

Hi again - quick update:
I made a plot comparing the histograms of single-exon genes to multi-exon genes:
<newplot (5).png>
It definitely looks like single-exon genes are enriched for the 0.5 score, but it does not account for the entire surge, as there also seem to be lots of multi-exon genes involved. This may suggest that the 0.5 peak is a result of multiple effects buried within the software.
Any other thoughts/suggestions?

Thanks again,



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org