augustus underpredicting

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

augustus underpredicting

Xabier Vázquez Campos
Hi,
I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two.
I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem.
Has anybody come up with any similar issue?
I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49
Cheers,
Xabi

--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: augustus underpredicting

Carson Holt-2
BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results.

—Carson


On Sep 10, 2017, at 7:03 PM, Xabier Vázquez-Campos <[hidden email]> wrote:

Hi,
I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two.
I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem.
Has anybody come up with any similar issue?
I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49
Cheers,
Xabi

--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: augustus underpredicting

Xabier Vázquez Campos
I did it that way and AUGUSTUS is predicting a more reasonable number of genes, about 12500 in Maker, but about 19000 in the model assessment step.
In comparison, SNAP gives 16000 and GeneMark 19000.

I haven't found any reference about but, would it be a good idea to train Augustus over the masked genome instead?
Thanks,



On 12 September 2017 at 02:50, Carson Holt <[hidden email]> wrote:
BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results.

—Carson


On Sep 10, 2017, at 7:03 PM, Xabier Vázquez-Campos <[hidden email]> wrote:

Hi,
I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two.
I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem.
Has anybody come up with any similar issue?
I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49
Cheers,
Xabi

--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA




--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: augustus underpredicting

Carson Holt-2
Gene predictors tend to over predict, so I would not take the high numbers given by SNAP and GeneMark as true counts. You will probably end up with something like 7-10k in the final results. But now Augustus is giving a higher count, you should be good to start running MAKER.

—Carson




On Sep 17, 2017, at 7:12 PM, Xabier Vázquez-Campos <[hidden email]> wrote:

I did it that way and AUGUSTUS is predicting a more reasonable number of genes, about 12500 in Maker, but about 19000 in the model assessment step.
In comparison, SNAP gives 16000 and GeneMark 19000.

I haven't found any reference about but, would it be a good idea to train Augustus over the masked genome instead?
Thanks,



On 12 September 2017 at 02:50, Carson Holt <[hidden email]> wrote:
BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results.

—Carson


On Sep 10, 2017, at 7:03 PM, Xabier Vázquez-Campos <[hidden email]> wrote:

Hi,
I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two.
I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem.
Has anybody come up with any similar issue?
I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49
Cheers,
Xabi

--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA




--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: augustus underpredicting

Xabier Vázquez Campos
Thanks Carson.

Last quick question. After the first run (before using the gene predictors) I ran fasta_merge to get an idea of the numbers I should be looking for.
In summary, I got 14000 genes, only using Swissprot and a close highly curated reference genome to avoid any "fake" protein or partial proteins from draft annotations, plus assembled RNA-seq from my genome.
How should I consider this as a guide? (if I can do so) ... Is this a number I should be aiming as a minimum number of genes? maximum? something around that?

PS my genome (fungus) is 80+ Mbp and just 150 contigs so I expect very few possible fragments due assembly (seq errors aside)

On 20 September 2017 at 07:34, Carson Holt <[hidden email]> wrote:
Gene predictors tend to over predict, so I would not take the high numbers given by SNAP and GeneMark as true counts. You will probably end up with something like 7-10k in the final results. But now Augustus is giving a higher count, you should be good to start running MAKER.

—Carson




On Sep 17, 2017, at 7:12 PM, Xabier Vázquez-Campos <[hidden email]> wrote:

I did it that way and AUGUSTUS is predicting a more reasonable number of genes, about 12500 in Maker, but about 19000 in the model assessment step.
In comparison, SNAP gives 16000 and GeneMark 19000.

I haven't found any reference about but, would it be a good idea to train Augustus over the masked genome instead?
Thanks,



On 12 September 2017 at 02:50, Carson Holt <[hidden email]> wrote:
BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results.

—Carson


On Sep 10, 2017, at 7:03 PM, Xabier Vázquez-Campos <[hidden email]> wrote:

Hi,
I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two.
I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem.
Has anybody come up with any similar issue?
I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49
Cheers,
Xabi

--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA




--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA




--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: augustus underpredicting

Carson Holt-2
I don’t think you can use the protein2genome option to estimate gene count. It will turn any alignment that matches at east 50% into a gene model. So you can get a lot of partial models which will inflate gene count. It’s good enough for training but not so much annotation.

—Carson



On Sep 19, 2017, at 6:02 PM, Xabier Vázquez-Campos <[hidden email]> wrote:

Thanks Carson.

Last quick question. After the first run (before using the gene predictors) I ran fasta_merge to get an idea of the numbers I should be looking for.
In summary, I got 14000 genes, only using Swissprot and a close highly curated reference genome to avoid any "fake" protein or partial proteins from draft annotations, plus assembled RNA-seq from my genome.
How should I consider this as a guide? (if I can do so) ... Is this a number I should be aiming as a minimum number of genes? maximum? something around that?

PS my genome (fungus) is 80+ Mbp and just 150 contigs so I expect very few possible fragments due assembly (seq errors aside)

On 20 September 2017 at 07:34, Carson Holt <[hidden email]> wrote:
Gene predictors tend to over predict, so I would not take the high numbers given by SNAP and GeneMark as true counts. You will probably end up with something like 7-10k in the final results. But now Augustus is giving a higher count, you should be good to start running MAKER.

—Carson




On Sep 17, 2017, at 7:12 PM, Xabier Vázquez-Campos <[hidden email]> wrote:

I did it that way and AUGUSTUS is predicting a more reasonable number of genes, about 12500 in Maker, but about 19000 in the model assessment step.
In comparison, SNAP gives 16000 and GeneMark 19000.

I haven't found any reference about but, would it be a good idea to train Augustus over the masked genome instead?
Thanks,



On 12 September 2017 at 02:50, Carson Holt <[hidden email]> wrote:
BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results.

—Carson


On Sep 10, 2017, at 7:03 PM, Xabier Vázquez-Campos <[hidden email]> wrote:

Hi,
I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two.
I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem.
Has anybody come up with any similar issue?
I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49
Cheers,
Xabi

--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA




--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA




--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org