Help debugging a MAKER result

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Help debugging a MAKER result

Lior Glick
Hi MAKER users,
I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results.
My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG.
For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation.
I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats.
As you can see, there seems to be two issues with my result:
1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together.
2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO.
This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region?
Any ideas on how my result can be improved without manual curation?

Many thanks!

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

maker.png (40K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Help debugging a MAKER result

Xabier Vázquez-Campos
Hi Lior,

without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected

The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict
And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model)

Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes.

Cheers,
Xabi

On Tue, 2 Oct 2018 at 05:23, Lior Glick <[hidden email]> wrote:
Hi MAKER users,
I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results.
My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG.
For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation.
I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats.
As you can see, there seems to be two issues with my result:
1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together.
2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO.
This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region?
Any ideas on how my result can be improved without manual curation?

Many thanks!
_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Help debugging a MAKER result

Lior Glick
Hi Xabier, and thanks for your reply.
I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence?
I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense?
As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous.

‫בתאריך יום ג׳, 2 באוק׳ 2018 ב-3:01 מאת ‪Xabier Vázquez-Campos‬‏ <‪[hidden email]‬‏>:‬
Hi Lior,

without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected

The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict
And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model)

Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes.

Cheers,
Xabi

On Tue, 2 Oct 2018 at 05:23, Lior Glick <[hidden email]> wrote:
Hi MAKER users,
I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results.
My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG.
For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation.
I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats.
As you can see, there seems to be two issues with my result:
1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together.
2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO.
This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region?
Any ideas on how my result can be improved without manual curation?

Many thanks!
_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Help debugging a MAKER result

Xabier Vázquez-Campos
Yeah, tomato should be rather well annotated.

I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things

You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive.

I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions.

On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation


On Tue, 2 Oct 2018 at 16:50, Lior Glick <[hidden email]> wrote:
Hi Xabier, and thanks for your reply.
I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence?
I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense?
As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous.

‫בתאריך יום ג׳, 2 באוק׳ 2018 ב-3:01 מאת ‪Xabier Vázquez-Campos‬‏ <‪[hidden email]‬‏>:‬
Hi Lior,

without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected

The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict
And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model)

Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes.

Cheers,
Xabi

On Tue, 2 Oct 2018 at 05:23, Lior Glick <[hidden email]> wrote:
Hi MAKER users,
I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results.
My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG.
For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation.
I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats.
As you can see, there seems to be two issues with my result:
1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together.
2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO.
This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region?
Any ideas on how my result can be improved without manual curation?

Many thanks!
_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Help debugging a MAKER result

Carson Holt-2
I’d just like to add info on how MAKER builds predictions. MAKER itself does not generate models. In your case, Augustus produces the models. Augustus will run twice. Once on it’s own (this will be on a repeat masked version of the assembly), and once again where MAKER provides it with a hints file as part of the command line used to run Augustus. The hints file is generated from the evidence alignments you provided to MAKER. The hints usually get Augustus to perform a little better than it does with training alone on a masked assembly.

Under-masking or overmasking the assembly can both confound Augustus. MAKER hard masks complex repeats in the assembly (turns them from ATCG into N’s), and soft-masks simple repeats (turns ATCG into lower case actg). The lower case “soft-masking” affects BLAST alignment but not Augustus predictions (Augustus ignores it). MAKER also removes the hard-masking when it runs Augustus with the hints file. This is done because we’ve constrained Augustus to a smaller padded evidence cluster at the locus, and Augustus can no longer see the whole assembly.

If you want to explore how masking affects the models, you can set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked assembly). You can then look at contigs in a browser to see how the masked vs unmasked models compare to each other.

—Carson


On Oct 2, 2018, at 10:39 PM, Xabier Vázquez-Campos <[hidden email]> wrote:

Yeah, tomato should be rather well annotated.

I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things

You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive.

I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions.

On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation


On Tue, 2 Oct 2018 at 16:50, Lior Glick <[hidden email]> wrote:
Hi Xabier, and thanks for your reply.
I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence?
I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense?
As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous.

‫בתאריך יום ג׳, 2 באוק׳ 2018 ב-3:01 מאת ‪Xabier Vázquez-Campos‬‏ <‪[hidden email]‬‏>:‬
Hi Lior,

without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected

The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict
And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model)

Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes.

Cheers,
Xabi

On Tue, 2 Oct 2018 at 05:23, Lior Glick <[hidden email]> wrote:
Hi MAKER users,
I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results.
My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG.
For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation.
I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats.
As you can see, there seems to be two issues with my result:
1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together.
2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO.
This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region?
Any ideas on how my result can be improved without manual curation?

Many thanks!
_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Help debugging a MAKER result

Mark Yandell

Cheers!

 

From: maker-devel <[hidden email]> on behalf of Carson Holt <[hidden email]>
Date: Thursday, October 4, 2018 at 5:52 PM
To: Lior Glick <[hidden email]>
Cc: Maker Mailing List <[hidden email]>
Subject: Re: [maker-devel] Help debugging a MAKER result

 

I’d just like to add info on how MAKER builds predictions. MAKER itself does not generate models. In your case, Augustus produces the models. Augustus will run twice. Once on it’s own (this will be on a repeat masked version of the assembly), and once again where MAKER provides it with a hints file as part of the command line used to run Augustus. The hints file is generated from the evidence alignments you provided to MAKER. The hints usually get Augustus to perform a little better than it does with training alone on a masked assembly.

 

Under-masking or overmasking the assembly can both confound Augustus. MAKER hard masks complex repeats in the assembly (turns them from ATCG into N’s), and soft-masks simple repeats (turns ATCG into lower case actg). The lower case “soft-masking” affects BLAST alignment but not Augustus predictions (Augustus ignores it). MAKER also removes the hard-masking when it runs Augustus with the hints file. This is done because we’ve constrained Augustus to a smaller padded evidence cluster at the locus, and Augustus can no longer see the whole assembly.

 

If you want to explore how masking affects the models, you can set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked assembly). You can then look at contigs in a browser to see how the masked vs unmasked models compare to each other.

 

—Carson

 



On Oct 2, 2018, at 10:39 PM, Xabier Vázquez-Campos <[hidden email]> wrote:

 

Yeah, tomato should be rather well annotated.

 

I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things

 

You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive.

 

I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions.

 

On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation

 

 

On Tue, 2 Oct 2018 at 16:50, Lior Glick <[hidden email]> wrote:

Hi Xabier, and thanks for your reply.

I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence?

I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense?

As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous.

 

בתאריך יום ג׳, 2 באוק׳ 2018 ב-3:01 מאת ‪Xabier Vázquez-Campos <‪[hidden email]‏>:

Hi Lior,

 

without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected

 

The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict

And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model)

 

Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes.

 

Cheers,

Xabi

 

On Tue, 2 Oct 2018 at 05:23, Lior Glick <[hidden email]> wrote:

Hi MAKER users,

I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results.

My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG.

For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation.

I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats.

As you can see, there seems to be two issues with my result:

1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together.

2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO.

This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region?

Any ideas on how my result can be improved without manual curation?

 

Many thanks!

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org



--

Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA



--

Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

 


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Help debugging a MAKER result

Carson Holt-2
In reply to this post by Carson Holt-2
One correction. I meant to say set unmask=1.

—Carson


On Oct 4, 2018, at 5:52 PM, Carson Holt <[hidden email]> wrote:

I’d just like to add info on how MAKER builds predictions. MAKER itself does not generate models. In your case, Augustus produces the models. Augustus will run twice. Once on it’s own (this will be on a repeat masked version of the assembly), and once again where MAKER provides it with a hints file as part of the command line used to run Augustus. The hints file is generated from the evidence alignments you provided to MAKER. The hints usually get Augustus to perform a little better than it does with training alone on a masked assembly.

Under-masking or overmasking the assembly can both confound Augustus. MAKER hard masks complex repeats in the assembly (turns them from ATCG into N’s), and soft-masks simple repeats (turns ATCG into lower case actg). The lower case “soft-masking” affects BLAST alignment but not Augustus predictions (Augustus ignores it). MAKER also removes the hard-masking when it runs Augustus with the hints file. This is done because we’ve constrained Augustus to a smaller padded evidence cluster at the locus, and Augustus can no longer see the whole assembly.

If you want to explore how masking affects the models, you can set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked assembly). You can then look at contigs in a browser to see how the masked vs unmasked models compare to each other.

—Carson


On Oct 2, 2018, at 10:39 PM, Xabier Vázquez-Campos <[hidden email]> wrote:

Yeah, tomato should be rather well annotated.

I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things

You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive.

I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions.

On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation


On Tue, 2 Oct 2018 at 16:50, Lior Glick <[hidden email]> wrote:
Hi Xabier, and thanks for your reply.
I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence?
I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense?
As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous.

‫בתאריך יום ג׳, 2 באוק׳ 2018 ב-3:01 מאת ‪Xabier Vázquez-Campos‬‏ <‪[hidden email]‬‏>:‬
Hi Lior,

without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected

The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict
And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model)

Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes.

Cheers,
Xabi

On Tue, 2 Oct 2018 at 05:23, Lior Glick <[hidden email]> wrote:
Hi MAKER users,
I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results.
My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG.
For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation.
I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats.
As you can see, there seems to be two issues with my result:
1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together.
2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO.
This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region?
Any ideas on how my result can be improved without manual curation?

Many thanks!
_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Help debugging a MAKER result

Lior Glick
Thank you both for your helpful ideas. I'm going to give them a try and see how this effects my results. Will update when I have them.
Cheers indeed.

‫בתאריך יום ו׳, 5 באוק׳ 2018 ב-3:10 מאת ‪Carson Holt‬‏ <‪[hidden email]‬‏>:‬
One correction. I meant to say set unmask=1.

—Carson


On Oct 4, 2018, at 5:52 PM, Carson Holt <[hidden email]> wrote:

I’d just like to add info on how MAKER builds predictions. MAKER itself does not generate models. In your case, Augustus produces the models. Augustus will run twice. Once on it’s own (this will be on a repeat masked version of the assembly), and once again where MAKER provides it with a hints file as part of the command line used to run Augustus. The hints file is generated from the evidence alignments you provided to MAKER. The hints usually get Augustus to perform a little better than it does with training alone on a masked assembly.

Under-masking or overmasking the assembly can both confound Augustus. MAKER hard masks complex repeats in the assembly (turns them from ATCG into N’s), and soft-masks simple repeats (turns ATCG into lower case actg). The lower case “soft-masking” affects BLAST alignment but not Augustus predictions (Augustus ignores it). MAKER also removes the hard-masking when it runs Augustus with the hints file. This is done because we’ve constrained Augustus to a smaller padded evidence cluster at the locus, and Augustus can no longer see the whole assembly.

If you want to explore how masking affects the models, you can set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked assembly). You can then look at contigs in a browser to see how the masked vs unmasked models compare to each other.

—Carson


On Oct 2, 2018, at 10:39 PM, Xabier Vázquez-Campos <[hidden email]> wrote:

Yeah, tomato should be rather well annotated.

I would double check how good was the tomato genome at the time of the creation of the gene model. Also, creating a new Augustus model based on the first prediction run might improve things

You have tomato on repbase. To be sure you are not missing anything, I would still run the advanced repeat library protocol, if it isn't computationally prohibitive.

I don't know how good is SNAP for plant genomes, so it could be worth to try on top of the Augustus predictions.

On top of this, I'd take a look into reference-based annotation tools like RATT. This would annotate all the common regions with the reference and then curate only on the regions that cannot be annotated from the reference using your Maker annotation


On Tue, 2 Oct 2018 at 16:50, Lior Glick <[hidden email]> wrote:
Hi Xabier, and thanks for your reply.
I forgot to mention it, but I used the annotated repeats derived from the ITAG annotation as repeats library, so I expect these to be quite appropriate. I guess my question is regarding the way Maker makes decisions: Is the fact that some repeats (simple repeats in this case) were predicted is enough to change a CDS into a UTR, despite sufficient protein evidence?
I did not train Augustus myself, rather I used the species (tomato) profile that comes with the Augustus release. Does that make sense?
As for the haploid/diploid issue - fortunately I don't have to deal with that since cultivated tomato varieties are repeatedly selfed, so they are (almost) completely homozygous.

‫בתאריך יום ג׳, 2 באוק׳ 2018 ב-3:01 מאת ‪Xabier Vázquez-Campos‬‏ <‪[hidden email]‬‏>:‬
Hi Lior,

without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected

The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict
And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model)

Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes.

Cheers,
Xabi

On Tue, 2 Oct 2018 at 05:23, Lior Glick <[hidden email]> wrote:
Hi MAKER users,
I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results.
My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG.
For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation.
I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats.
As you can see, there seems to be two issues with my result:
1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together.
2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO.
This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region?
Any ideas on how my result can be improved without manual curation?

Many thanks!
_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA


--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org