Re: diff. numbers of geneson contigs vs. scaffolded genome

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: diff. numbers of geneson contigs vs. scaffolded genome

Carson Hinton Holt
The contiged assembly is more likely to give spurious hits and alignments.
 They also can be harder to repeat mask.  Also gene predictors can behave
slightly different on small sequences than on longer ones.  If you have
fewer gene models than you expect, your first step should be to process
the scaffolds with CEGMA.  It will give you an estimate of the genomes
"completeness".  If CEGMA gives a 60% completeness value for example then
you can expect to only recover 60% of the expected number of genes. Next
you should run RepeatModeler of similar software to help generate a
species specific repeat library.  Under masked repeats can make predicting
genes on longer scaffolds far more difficult for ab initio predictors.

--Carson


On 9/19/14, 12:32 AM, "Stefan Zoller" <[hidden email]> wrote:

>Hi,
>
>I am working on the annotation of a plant genome (about 600MB) and we
>have a reasonable draft assembly, a fairly good transcriptome and quite
>a few proteins from related species. We have also extensively trained
>augustus and are also feeding genmark and snap predictions.
>
>Recently I noticed a behavior of Maker that seems fairly odd and which I
>cannot explain at all. When I take the scaffolded genome (about 23000
>scaffolds) I get roughly 9'000 maker approved gene models. Which is
>admittedly a bit on the low side and we have to work on this. However,
>when I break up the scaffolds into contigs at stretches of N longer
>500bp (about 60'000 contigs) I get about 17'000 maker gene models. Now
>obviously 17'000 is more in the range what I would expect, so I am
>inclined to go with these. I have looked at both annotations and the
>evidence in WebApollo and the evidence alignments are identical for both
>runs. The approved genes seem to be the same, except for the additional
>ones in the "contiged" genome version. The additional gene models are
>not necessarily at the ends of the contigs, so I think it has nothing to
>do with having the stretches of Ns nearby in the scaffolded genome. Do
>you have any idea why maker comes up with the additional numbers of gene
>models and how I could "convince" maker to give me the same gene models
>for the scaffolded assembly?
>
>Cheers,
>Stefan
>
>
>
>--
>Stefan Zoller, PhD
>Bioinformatics
>Genetic Diversity Centre
>ETH Zurich CHN E55.1
>Universitätsstrasse 16
>8092 Zurich
>Switzerland
>
>Phone: +41 44 632 66 85
>E-Mail:  [hidden email]
>Web: www.gdc.ethz.ch
>
>

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: diff. numbers of geneson contigs vs. scaffolded genome

Mark Yandell
Also are you numbers including the ab-inito predictions without evidence that have pfamm domains?

cheers,


--mark



Mark Yandell
Professor of Human Genetics
H.A. & Edna Benning Presidential Endowed Chair
Co-director USTAR Center for Genetic Discovery
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330
ph:801-587-7707

________________________________________
From: maker-devel [[hidden email]] on behalf of Carson Holt [[hidden email]]
Sent: Monday, September 22, 2014 2:17 PM
To: [hidden email]; [hidden email]
Subject: Re: [maker-devel] diff. numbers of geneson contigs vs. scaffolded      genome

The contiged assembly is more likely to give spurious hits and alignments.
 They also can be harder to repeat mask.  Also gene predictors can behave
slightly different on small sequences than on longer ones.  If you have
fewer gene models than you expect, your first step should be to process
the scaffolds with CEGMA.  It will give you an estimate of the genomes
"completeness".  If CEGMA gives a 60% completeness value for example then
you can expect to only recover 60% of the expected number of genes. Next
you should run RepeatModeler of similar software to help generate a
species specific repeat library.  Under masked repeats can make predicting
genes on longer scaffolds far more difficult for ab initio predictors.

--Carson


On 9/19/14, 12:32 AM, "Stefan Zoller" <[hidden email]> wrote:

>Hi,
>
>I am working on the annotation of a plant genome (about 600MB) and we
>have a reasonable draft assembly, a fairly good transcriptome and quite
>a few proteins from related species. We have also extensively trained
>augustus and are also feeding genmark and snap predictions.
>
>Recently I noticed a behavior of Maker that seems fairly odd and which I
>cannot explain at all. When I take the scaffolded genome (about 23000
>scaffolds) I get roughly 9'000 maker approved gene models. Which is
>admittedly a bit on the low side and we have to work on this. However,
>when I break up the scaffolds into contigs at stretches of N longer
>500bp (about 60'000 contigs) I get about 17'000 maker gene models. Now
>obviously 17'000 is more in the range what I would expect, so I am
>inclined to go with these. I have looked at both annotations and the
>evidence in WebApollo and the evidence alignments are identical for both
>runs. The approved genes seem to be the same, except for the additional
>ones in the "contiged" genome version. The additional gene models are
>not necessarily at the ends of the contigs, so I think it has nothing to
>do with having the stretches of Ns nearby in the scaffolded genome. Do
>you have any idea why maker comes up with the additional numbers of gene
>models and how I could "convince" maker to give me the same gene models
>for the scaffolded assembly?
>
>Cheers,
>Stefan
>
>
>
>--
>Stefan Zoller, PhD
>Bioinformatics
>Genetic Diversity Centre
>ETH Zurich CHN E55.1
>Universitätsstrasse 16
>8092 Zurich
>Switzerland
>
>Phone: +41 44 632 66 85
>E-Mail:  [hidden email]
>Web: www.gdc.ethz.ch
>
>

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: diff. numbers of geneson contigs vs. scaffolded genome

Carson Hinton Holt
Sorry for the slow reply.  I was trying to locate a script that might be
useful for you.

I think a species specific repeat libary will be of most benefit here
(it's surprising just how influential this step is).  Also note that you
should train SNAP and Augustus on your species and are not just using
another related species as a stand in.

With respect to PFAM domains, on some organisms you may not get a lot of
cross species protein alignments because of divergence or assembly issues.
This of course makes it harder to support these models with direct protein
alignments.  However you can run InterProscan over the
non-overlapping.proteins.fasta file produced by MAKER (contains
non-redundant rejected models).  Because an HMM is used for domain
identification, it can pick up protein domains that would not produce a
significant BLAST alignment because of divergence. You can then add models
with positive hits for protein domains back into your gene set.

This ad hoc procedure usually can only increase gene counts by about 10%
though for organisms where it's required. I've attached a script that
makes generating results for these genes easier.

1. First you run InterProScan with just PFAM.
2. Then you take the IDs of all models that have a domain in the report
and create a list (1 ID per line).
3. Next use the fasta_tool script that comes with MAKER together with the
--select flag to separate just the positive hits (ID's in your list) from
the non-overlapping.proteins.fasta and non-overlapping.transscripts.fasta
files.
4. Use the attached script to separate just the positive hits (your ID
list) from the GFF3. The script will upgrade match/match_part results to
gene/mRNA/exon/CDS results and print them out for you.
5. Use the fasta_maerge and gff3_merge scripts that come with MAKER to
merge the selected/upgraded GFF3 entries and selected FASTA entries back
into the original MAKER results.

--Carson



On 9/23/14, 6:36 AM, "Stefan Zoller" <[hidden email]> wrote:

>Please forgive my ignorance, I am not entirely sure if I understand your
>question correctly, but I will try to answer.
>As evidence we use:
>1) our own transcriptome (trinity assembled RNAseq, filtering out the
>very low expression transcripts).
>2) all swissprot plant proteins, and several protein sets from closely
>related plant species downloaded from NCBI.
>I am not sure if the ab-initio predictions without evidence have pfamm
>domains. Honestly, I would not know how to tell and how to
>include/exclude.
>I was assuming that we should not have too many Maker approved
>predictions without evidence anyway, because we use "keeps_preds=0".
>The numbers of gene predictions I mentioned in my email are the
>predictions reported by the fasta_merge/gff3_merge scripts in the
>"*maker.proteins.fasta". There are of course many more predictions in
>e.g., "*maker.augustus_masked.proteins.fasta" (about 68'000 in this file).
>
>I hope I am not totally off with my answer.
>Cheers, Stefan
>
>
>
>On 23.09.14 02:10, Mark Yandell wrote:
>> Also are you numbers including the ab-inito predictions without
>>evidence that have pfamm domains?
>>
>> cheers,
>>
>>
>> --mark
>>
>>
>>
>> Mark Yandell
>> Professor of Human Genetics
>> H.A. & Edna Benning Presidential Endowed Chair
>> Co-director USTAR Center for Genetic Discovery
>> Eccles Institute of Human Genetics
>> University of Utah
>> 15 North 2030 East, Room 2100
>> Salt Lake City, UT 84112-5330
>> ph:801-587-7707
>>
>> ________________________________________
>> From: maker-devel [[hidden email]] on behalf of
>>Carson Holt [[hidden email]]
>> Sent: Monday, September 22, 2014 2:17 PM
>> To: [hidden email]; [hidden email]
>> Subject: Re: [maker-devel] diff. numbers of geneson contigs vs.
>>scaffolded      genome
>>
>> The contiged assembly is more likely to give spurious hits and
>>alignments.
>>   They also can be harder to repeat mask.  Also gene predictors can
>>behave
>> slightly different on small sequences than on longer ones.  If you have
>> fewer gene models than you expect, your first step should be to process
>> the scaffolds with CEGMA.  It will give you an estimate of the genomes
>> "completeness".  If CEGMA gives a 60% completeness value for example
>>then
>> you can expect to only recover 60% of the expected number of genes. Next
>> you should run RepeatModeler of similar software to help generate a
>> species specific repeat library.  Under masked repeats can make
>>predicting
>> genes on longer scaffolds far more difficult for ab initio predictors.
>>
>> --Carson
>>
>>
>> On 9/19/14, 12:32 AM, "Stefan Zoller" <[hidden email]> wrote:
>>
>>> Hi,
>>>
>>> I am working on the annotation of a plant genome (about 600MB) and we
>>> have a reasonable draft assembly, a fairly good transcriptome and quite
>>> a few proteins from related species. We have also extensively trained
>>> augustus and are also feeding genmark and snap predictions.
>>>
>>> Recently I noticed a behavior of Maker that seems fairly odd and which
>>>I
>>> cannot explain at all. When I take the scaffolded genome (about 23000
>>> scaffolds) I get roughly 9'000 maker approved gene models. Which is
>>> admittedly a bit on the low side and we have to work on this. However,
>>> when I break up the scaffolds into contigs at stretches of N longer
>>> 500bp (about 60'000 contigs) I get about 17'000 maker gene models. Now
>>> obviously 17'000 is more in the range what I would expect, so I am
>>> inclined to go with these. I have looked at both annotations and the
>>> evidence in WebApollo and the evidence alignments are identical for
>>>both
>>> runs. The approved genes seem to be the same, except for the additional
>>> ones in the "contiged" genome version. The additional gene models are
>>> not necessarily at the ends of the contigs, so I think it has nothing
>>>to
>>> do with having the stretches of Ns nearby in the scaffolded genome. Do
>>> you have any idea why maker comes up with the additional numbers of
>>>gene
>>> models and how I could "convince" maker to give me the same gene models
>>> for the scaffolded assembly?
>>>
>>> Cheers,
>>> Stefan
>>>
>>>
>>>
>>> --
>>> Stefan Zoller, PhD
>>> Bioinformatics
>>> Genetic Diversity Centre
>>> ETH Zurich CHN E55.1
>>> Universitätsstrasse 16
>>> 8092 Zurich
>>> Switzerland
>>>
>>> Phone: +41 44 632 66 85
>>> E-Mail:  [hidden email]
>>> Web: www.gdc.ethz.ch
>>>
>>>
>> _______________________________________________
>> maker-devel mailing list
>> [hidden email]
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

gff3_preds2models (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: diff. numbers of geneson contigs vs. scaffolded genome

stef
Hi Carson
Thanks again for your help and suggestions. They are very helpful indeed!

I have now:
1) created a species specific repeat library, or actually several
versions (e.g., filtered for hits on known plant transposable elements
etc., or filtering out hits on proper plant proteins), and ran Maker
with it on a subset of the genome. Whatever version of repeat library I
use, I get +/- 5% the same number of Maker approved proteins. I get
slightly more proteins with the "best" species specific repeat library,
so I think it does make a difference, however not a big one.
Interestingly, if I turn off the repeat masking totally, I get about 20%
more Maker approved protein models. So either I am doing something
totally wrong here or the repeat masking is working quite well with the
specific repeat libraries.

2) filtered the non-overlapping ab-initio proteins with PFAM domains
according to your how-to. This works very nicely, thanks. However, I get
quite a lot of models with PFAM hits, even when stringently filtering
for e-value. For example, in the subset of the contiged genome I usually
get around 300 Maker models. And now I have an additional 180 from the
"non-overlapping-with-PFAM-domain" when filtering for e-value <1e-20.
For e-value < 1e-10 it would be 280, almost twice the number of
proteins. Extrapolating this to the full genome, this would be more than
32'000 proteins. This seems a bit excessive and I am not sure if I am
even supposed to use such a stringent e-value filtering. One reason of
having so many additional proteins I can think of, is that augustus and
snap are predicting similar non-overlapping models for the same location
and of course they then both have a PFAM domain. I can actually see this
for some locations when I load the data in WebApollo. I can think of a
crude way to select only the "best" model for a location (while
preferably also considering the already Maker approved protein) but I
wonder if maybe there is already a solution for this in Maker?

In short, I think the repeat masking seems not to be the problem (And I
think I have put quite some effort in the repeat library creation). On
the other hand, there are a lot of "good" models in the non-overlapping
proteins that could be filtered and promoted to proper models, if I only
could make the right selection.

Maybe, based on these additional informations you could point out
additional tests, filtering approaches or analyses I could do to home-in
to the "good" gene models in the non-overlapping gene models (or Maker
approved gene models in general).

Thanks again for your help!
Stefan



On 25.09.14 20:17, Carson Holt wrote:

> Sorry for the slow reply.  I was trying to locate a script that might be
> useful for you.
>
> I think a species specific repeat libary will be of most benefit here
> (it's surprising just how influential this step is).  Also note that you
> should train SNAP and Augustus on your species and are not just using
> another related species as a stand in.
>
> With respect to PFAM domains, on some organisms you may not get a lot of
> cross species protein alignments because of divergence or assembly issues.
> This of course makes it harder to support these models with direct protein
> alignments.  However you can run InterProscan over the
> non-overlapping.proteins.fasta file produced by MAKER (contains
> non-redundant rejected models).  Because an HMM is used for domain
> identification, it can pick up protein domains that would not produce a
> significant BLAST alignment because of divergence. You can then add models
> with positive hits for protein domains back into your gene set.
>
> This ad hoc procedure usually can only increase gene counts by about 10%
> though for organisms where it's required. I've attached a script that
> makes generating results for these genes easier.
>
> 1. First you run InterProScan with just PFAM.
> 2. Then you take the IDs of all models that have a domain in the report
> and create a list (1 ID per line).
> 3. Next use the fasta_tool script that comes with MAKER together with the
> --select flag to separate just the positive hits (ID's in your list) from
> the non-overlapping.proteins.fasta and non-overlapping.transscripts.fasta
> files.
> 4. Use the attached script to separate just the positive hits (your ID
> list) from the GFF3. The script will upgrade match/match_part results to
> gene/mRNA/exon/CDS results and print them out for you.
> 5. Use the fasta_maerge and gff3_merge scripts that come with MAKER to
> merge the selected/upgraded GFF3 entries and selected FASTA entries back
> into the original MAKER results.
>
> --Carson
>
>
>
> On 9/23/14, 6:36 AM, "Stefan Zoller" <[hidden email]> wrote:
>
>> Please forgive my ignorance, I am not entirely sure if I understand your
>> question correctly, but I will try to answer.
>> As evidence we use:
>> 1) our own transcriptome (trinity assembled RNAseq, filtering out the
>> very low expression transcripts).
>> 2) all swissprot plant proteins, and several protein sets from closely
>> related plant species downloaded from NCBI.
>> I am not sure if the ab-initio predictions without evidence have pfamm
>> domains. Honestly, I would not know how to tell and how to
>> include/exclude.
>> I was assuming that we should not have too many Maker approved
>> predictions without evidence anyway, because we use "keeps_preds=0".
>> The numbers of gene predictions I mentioned in my email are the
>> predictions reported by the fasta_merge/gff3_merge scripts in the
>> "*maker.proteins.fasta". There are of course many more predictions in
>> e.g., "*maker.augustus_masked.proteins.fasta" (about 68'000 in this file).
>>
>> I hope I am not totally off with my answer.
>> Cheers, Stefan
>>
>>
>>
>> On 23.09.14 02:10, Mark Yandell wrote:
>>> Also are you numbers including the ab-inito predictions without
>>> evidence that have pfamm domains?
>>>
>>> cheers,
>>>
>>>
>>> --mark
>>>
>>>
>>>
>>> Mark Yandell
>>> Professor of Human Genetics
>>> H.A. & Edna Benning Presidential Endowed Chair
>>> Co-director USTAR Center for Genetic Discovery
>>> Eccles Institute of Human Genetics
>>> University of Utah
>>> 15 North 2030 East, Room 2100
>>> Salt Lake City, UT 84112-5330
>>> ph:801-587-7707
>>>
>>> ________________________________________
>>> From: maker-devel [[hidden email]] on behalf of
>>> Carson Holt [[hidden email]]
>>> Sent: Monday, September 22, 2014 2:17 PM
>>> To: [hidden email]; [hidden email]
>>> Subject: Re: [maker-devel] diff. numbers of geneson contigs vs.
>>> scaffolded      genome
>>>
>>> The contiged assembly is more likely to give spurious hits and
>>> alignments.
>>>    They also can be harder to repeat mask.  Also gene predictors can
>>> behave
>>> slightly different on small sequences than on longer ones.  If you have
>>> fewer gene models than you expect, your first step should be to process
>>> the scaffolds with CEGMA.  It will give you an estimate of the genomes
>>> "completeness".  If CEGMA gives a 60% completeness value for example
>>> then
>>> you can expect to only recover 60% of the expected number of genes. Next
>>> you should run RepeatModeler of similar software to help generate a
>>> species specific repeat library.  Under masked repeats can make
>>> predicting
>>> genes on longer scaffolds far more difficult for ab initio predictors.
>>>
>>> --Carson
>>>
>>>
>>> On 9/19/14, 12:32 AM, "Stefan Zoller" <[hidden email]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am working on the annotation of a plant genome (about 600MB) and we
>>>> have a reasonable draft assembly, a fairly good transcriptome and quite
>>>> a few proteins from related species. We have also extensively trained
>>>> augustus and are also feeding genmark and snap predictions.
>>>>
>>>> Recently I noticed a behavior of Maker that seems fairly odd and which
>>>> I
>>>> cannot explain at all. When I take the scaffolded genome (about 23000
>>>> scaffolds) I get roughly 9'000 maker approved gene models. Which is
>>>> admittedly a bit on the low side and we have to work on this. However,
>>>> when I break up the scaffolds into contigs at stretches of N longer
>>>> 500bp (about 60'000 contigs) I get about 17'000 maker gene models. Now
>>>> obviously 17'000 is more in the range what I would expect, so I am
>>>> inclined to go with these. I have looked at both annotations and the
>>>> evidence in WebApollo and the evidence alignments are identical for
>>>> both
>>>> runs. The approved genes seem to be the same, except for the additional
>>>> ones in the "contiged" genome version. The additional gene models are
>>>> not necessarily at the ends of the contigs, so I think it has nothing
>>>> to
>>>> do with having the stretches of Ns nearby in the scaffolded genome. Do
>>>> you have any idea why maker comes up with the additional numbers of
>>>> gene
>>>> models and how I could "convince" maker to give me the same gene models
>>>> for the scaffolded assembly?
>>>>
>>>> Cheers,
>>>> Stefan
>>>>
>>>>
>>>>
>>>> --
>>>> Stefan Zoller, PhD
>>>> Bioinformatics
>>>> Genetic Diversity Centre
>>>> ETH Zurich CHN E55.1
>>>> Universitätsstrasse 16
>>>> 8092 Zurich
>>>> Switzerland
>>>>
>>>> Phone: +41 44 632 66 85
>>>> E-Mail:  [hidden email]
>>>> Web: www.gdc.ethz.ch
>>>>
>>>>
>>> _______________________________________________
>>> maker-devel mailing list
>>> [hidden email]
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

--
Stefan Zoller, PhD
Bioinformatics
Genetic Diversity Centre
ETH Zurich CHN E55.1
Universitätsstrasse 16
8092 Zurich
Switzerland

Phone: +41 44 632 66 85
E-Mail:  [hidden email]
Web: www.gdc.ethz.ch


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: diff. numbers of geneson contigs vs. scaffolded genome

Carson Hinton Holt
>1) created a species specific repeat library, or actually several
>versions (e.g., filtered for hits on known plant transposable elements
>etc., or filtering out hits on proper plant proteins), and ran Maker
>with it on a subset of the genome. Whatever version of repeat library I
>use, I get +/- 5% the same number of Maker approved proteins. I get
>slightly more proteins with the "best" species specific repeat library,
>so I think it does make a difference, however not a big one.
>Interestingly, if I turn off the repeat masking totally, I get about 20%
>more Maker approved protein models. So either I am doing something
>totally wrong here or the repeat masking is working quite well with the
>specific repeat libraries.

You expect more proteins if you turn all repeat masking off because
transposons encode real proteins and there will be a lot of them. Some
plant species for example have inflated gene counts because they failed to
properly remove transposons during annotation, and removing these false
models is actually a major goal of many reannotation projects.   Also
because transposons can occur in the middle of a gene or in an intron, not
masking them can actually cause the predictor to not call the surrounding
genes (what you are really interested in), but rather you just a series of
transposons.  Try using RepeatModeler to build the repeat dataset.  It is
not so much that you only want repeats from your species in the dataset so
much as it is adding any novel repeats that will not be in any dataset.
For example, I normally run will all of RepBase together with the novel
repeats identified by RepeatModeler. You want to find everything you can.




>
>2) filtered the non-overlapping ab-initio proteins with PFAM domains
>according to your how-to. This works very nicely, thanks. However, I get
>quite a lot of models with PFAM hits, even when stringently filtering
>for e-value. For example, in the subset of the contiged genome I usually
>get around 300 Maker models. And now I have an additional 180 from the
>"non-overlapping-with-PFAM-domain" when filtering for e-value <1e-20.
>For e-value < 1e-10 it would be 280, almost twice the number of
>proteins. Extrapolating this to the full genome, this would be more than
>32'000 proteins. This seems a bit excessive and I am not sure if I am
>even supposed to use such a stringent e-value filtering. One reason of
>having so many additional proteins I can think of, is that augustus and
>snap are predicting similar non-overlapping models for the same location
>and of course they then both have a PFAM domain. I can actually see this
>for some locations when I load the data in WebApollo. I can think of a
>crude way to select only the "best" model for a location (while
>preferably also considering the already Maker approved protein) but I
>wonder if maybe there is already a solution for this in Maker?

The non-overlapping ab-initio proteins are already non-redundant. They
will not overlap each other or any of the genes already called by MAKER.
Also make sure you have identified novel repeats for your species or these
models will be full of transposons which WILL have PFAM domains. Just
reading the names of identified domains lets you know if it's a repeat
related protein.  Also you must have your gene predictors trained on your
species.  You cannot use a related species as your model if trying to add
genes via PFAM domain content.  This is because you will get fragmented
gene models from the predictors if you are using a related species, and
since there is no overlapping evidence alignment to help correct for this
(these are the unsupported models after all), then you will be adding very
poor models back in.


Thanks,
Carson










>
>In short, I think the repeat masking seems not to be the problem (And I
>think I have put quite some effort in the repeat library creation). On
>the other hand, there are a lot of "good" models in the non-overlapping
>proteins that could be filtered and promoted to proper models, if I only
>could make the right selection.
>
>Maybe, based on these additional informations you could point out
>additional tests, filtering approaches or analyses I could do to home-in
>to the "good" gene models in the non-overlapping gene models (or Maker
>approved gene models in general).
>
>Thanks again for your help!
>Stefan
>
>
>
>On 25.09.14 20:17, Carson Holt wrote:
>> Sorry for the slow reply.  I was trying to locate a script that might be
>> useful for you.
>>
>> I think a species specific repeat libary will be of most benefit here
>> (it's surprising just how influential this step is).  Also note that you
>> should train SNAP and Augustus on your species and are not just using
>> another related species as a stand in.
>>
>> With respect to PFAM domains, on some organisms you may not get a lot of
>> cross species protein alignments because of divergence or assembly
>>issues.
>> This of course makes it harder to support these models with direct
>>protein
>> alignments.  However you can run InterProscan over the
>> non-overlapping.proteins.fasta file produced by MAKER (contains
>> non-redundant rejected models).  Because an HMM is used for domain
>> identification, it can pick up protein domains that would not produce a
>> significant BLAST alignment because of divergence. You can then add
>>models
>> with positive hits for protein domains back into your gene set.
>>
>> This ad hoc procedure usually can only increase gene counts by about 10%
>> though for organisms where it's required. I've attached a script that
>> makes generating results for these genes easier.
>>
>> 1. First you run InterProScan with just PFAM.
>> 2. Then you take the IDs of all models that have a domain in the report
>> and create a list (1 ID per line).
>> 3. Next use the fasta_tool script that comes with MAKER together with
>>the
>> --select flag to separate just the positive hits (ID's in your list)
>>from
>> the non-overlapping.proteins.fasta and
>>non-overlapping.transscripts.fasta
>> files.
>> 4. Use the attached script to separate just the positive hits (your ID
>> list) from the GFF3. The script will upgrade match/match_part results to
>> gene/mRNA/exon/CDS results and print them out for you.
>> 5. Use the fasta_maerge and gff3_merge scripts that come with MAKER to
>> merge the selected/upgraded GFF3 entries and selected FASTA entries back
>> into the original MAKER results.
>>
>> --Carson
>>
>>
>>
>> On 9/23/14, 6:36 AM, "Stefan Zoller" <[hidden email]> wrote:
>>
>>> Please forgive my ignorance, I am not entirely sure if I understand
>>>your
>>> question correctly, but I will try to answer.
>>> As evidence we use:
>>> 1) our own transcriptome (trinity assembled RNAseq, filtering out the
>>> very low expression transcripts).
>>> 2) all swissprot plant proteins, and several protein sets from closely
>>> related plant species downloaded from NCBI.
>>> I am not sure if the ab-initio predictions without evidence have pfamm
>>> domains. Honestly, I would not know how to tell and how to
>>> include/exclude.
>>> I was assuming that we should not have too many Maker approved
>>> predictions without evidence anyway, because we use "keeps_preds=0".
>>> The numbers of gene predictions I mentioned in my email are the
>>> predictions reported by the fasta_merge/gff3_merge scripts in the
>>> "*maker.proteins.fasta". There are of course many more predictions in
>>> e.g., "*maker.augustus_masked.proteins.fasta" (about 68'000 in this
>>>file).
>>>
>>> I hope I am not totally off with my answer.
>>> Cheers, Stefan
>>>
>>>
>>>
>>> On 23.09.14 02:10, Mark Yandell wrote:
>>>> Also are you numbers including the ab-inito predictions without
>>>> evidence that have pfamm domains?
>>>>
>>>> cheers,
>>>>
>>>>
>>>> --mark
>>>>
>>>>
>>>>
>>>> Mark Yandell
>>>> Professor of Human Genetics
>>>> H.A. & Edna Benning Presidential Endowed Chair
>>>> Co-director USTAR Center for Genetic Discovery
>>>> Eccles Institute of Human Genetics
>>>> University of Utah
>>>> 15 North 2030 East, Room 2100
>>>> Salt Lake City, UT 84112-5330
>>>> ph:801-587-7707
>>>>
>>>> ________________________________________
>>>> From: maker-devel [[hidden email]] on behalf of
>>>> Carson Holt [[hidden email]]
>>>> Sent: Monday, September 22, 2014 2:17 PM
>>>> To: [hidden email]; [hidden email]
>>>> Subject: Re: [maker-devel] diff. numbers of geneson contigs vs.
>>>> scaffolded      genome
>>>>
>>>> The contiged assembly is more likely to give spurious hits and
>>>> alignments.
>>>>    They also can be harder to repeat mask.  Also gene predictors can
>>>> behave
>>>> slightly different on small sequences than on longer ones.  If you
>>>>have
>>>> fewer gene models than you expect, your first step should be to
>>>>process
>>>> the scaffolds with CEGMA.  It will give you an estimate of the genomes
>>>> "completeness".  If CEGMA gives a 60% completeness value for example
>>>> then
>>>> you can expect to only recover 60% of the expected number of genes.
>>>>Next
>>>> you should run RepeatModeler of similar software to help generate a
>>>> species specific repeat library.  Under masked repeats can make
>>>> predicting
>>>> genes on longer scaffolds far more difficult for ab initio predictors.
>>>>
>>>> --Carson
>>>>
>>>>
>>>> On 9/19/14, 12:32 AM, "Stefan Zoller" <[hidden email]>
>>>>wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am working on the annotation of a plant genome (about 600MB) and we
>>>>> have a reasonable draft assembly, a fairly good transcriptome and
>>>>>quite
>>>>> a few proteins from related species. We have also extensively trained
>>>>> augustus and are also feeding genmark and snap predictions.
>>>>>
>>>>> Recently I noticed a behavior of Maker that seems fairly odd and
>>>>>which
>>>>> I
>>>>> cannot explain at all. When I take the scaffolded genome (about 23000
>>>>> scaffolds) I get roughly 9'000 maker approved gene models. Which is
>>>>> admittedly a bit on the low side and we have to work on this.
>>>>>However,
>>>>> when I break up the scaffolds into contigs at stretches of N longer
>>>>> 500bp (about 60'000 contigs) I get about 17'000 maker gene models.
>>>>>Now
>>>>> obviously 17'000 is more in the range what I would expect, so I am
>>>>> inclined to go with these. I have looked at both annotations and the
>>>>> evidence in WebApollo and the evidence alignments are identical for
>>>>> both
>>>>> runs. The approved genes seem to be the same, except for the
>>>>>additional
>>>>> ones in the "contiged" genome version. The additional gene models are
>>>>> not necessarily at the ends of the contigs, so I think it has nothing
>>>>> to
>>>>> do with having the stretches of Ns nearby in the scaffolded genome.
>>>>>Do
>>>>> you have any idea why maker comes up with the additional numbers of
>>>>> gene
>>>>> models and how I could "convince" maker to give me the same gene
>>>>>models
>>>>> for the scaffolded assembly?
>>>>>
>>>>> Cheers,
>>>>> Stefan
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Stefan Zoller, PhD
>>>>> Bioinformatics
>>>>> Genetic Diversity Centre
>>>>> ETH Zurich CHN E55.1
>>>>> Universitätsstrasse 16
>>>>> 8092 Zurich
>>>>> Switzerland
>>>>>
>>>>> Phone: +41 44 632 66 85
>>>>> E-Mail:  [hidden email]
>>>>> Web: www.gdc.ethz.ch
>>>>>
>>>>>
>>>> _______________________________________________
>>>> maker-devel mailing list
>>>> [hidden email]
>>>>
>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>--
>Stefan Zoller, PhD
>Bioinformatics
>Genetic Diversity Centre
>ETH Zurich CHN E55.1
>Universitätsstrasse 16
>8092 Zurich
>Switzerland
>
>Phone: +41 44 632 66 85
>E-Mail:  [hidden email]
>Web: www.gdc.ethz.ch
>

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: diff. numbers of geneson contigs vs. scaffolded genome

Carson Hinton Holt
--> Should I filter them by e-value or some other parameter before
promoting them to an "approved" status? If it's the e-value, what
threshold would be preferable?



Given the lack of evidence from aligned proteins or ESTs (and the fact
that ab initio predictors over predict so much), I don't put much stock in
the e-values.  Without some form of evidence supporting them, they are all
pretty much just as likely as any other. The PFAM domain at least provides
an independent form of evidence support.

One thing to note is that some genomes have low gene counts because of
assembly errors.  You can get a good CEGMA score because the conserved
genes CEGMA looks at are very very short compared to most genes, but then
because of assembly issues long genes don't appear well.  In cases like
these you are more likely to end up with fragmented gene models relative
to true gene model.

The honeybee genome is an example.  They went from ~10,000 genes to
~15,000 on the reannotation after improving both their repeat database and
fixing certain assembly issues.

Thanks,
Carson



On 10/1/14, 11:58 AM, "Stefan Zoller" <[hidden email]> wrote:

>Thanks for the swift answer. I just add a few clarifications below,
>because I might have omitted some information.
>On 01.10.14 18:20, Carson Holt wrote:
>>> 1) created a species specific repeat library, or actually several
>>> versions (e.g., filtered for hits on known plant transposable elements
>>> etc., or filtering out hits on proper plant proteins), and ran Maker
>>> with it on a subset of the genome. Whatever version of repeat library I
>>> use, I get +/- 5% the same number of Maker approved proteins. I get
>>> slightly more proteins with the "best" species specific repeat library,
>>> so I think it does make a difference, however not a big one.
>>> Interestingly, if I turn off the repeat masking totally, I get about
>>>20%
>>> more Maker approved protein models. So either I am doing something
>>> totally wrong here or the repeat masking is working quite well with the
>>> specific repeat libraries.
>> You expect more proteins if you turn all repeat masking off because
>> transposons encode real proteins and there will be a lot of them. Some
>> plant species for example have inflated gene counts because they failed
>>to
>> properly remove transposons during annotation, and removing these false
>> models is actually a major goal of many reannotation projects.   Also
>> because transposons can occur in the middle of a gene or in an intron,
>>not
>> masking them can actually cause the predictor to not call the
>>surrounding
>> genes (what you are really interested in), but rather you just a series
>>of
>> transposons.  Try using RepeatModeler to build the repeat dataset.  It
>>is
>> not so much that you only want repeats from your species in the dataset
>>so
>> much as it is adding any novel repeats that will not be in any dataset.
>> For example, I normally run will all of RepBase together with the novel
>> repeats identified by RepeatModeler. You want to find everything you
>>can.
>I have used RepeatModeler and LTRharvester and MITE and have then
>filtered the combined dataset to remove "real" plant proteins that got
>in there accidentally. I am quite happy with the result. And in Maker I
>am also using including the repeat libraries of other plant species. So
>I am pretty much following your advice.
>>> 2) filtered the non-overlapping ab-initio proteins with PFAM domains
>>> according to your how-to. This works very nicely, thanks. However, I
>>>get
>>> quite a lot of models with PFAM hits, even when stringently filtering
>>> for e-value. For example, in the subset of the contiged genome I
>>>usually
>>> get around 300 Maker models. And now I have an additional 180 from the
>>> "non-overlapping-with-PFAM-domain" when filtering for e-value <1e-20.
>>> For e-value < 1e-10 it would be 280, almost twice the number of
>>> proteins. Extrapolating this to the full genome, this would be more
>>>than
>>> 32'000 proteins. This seems a bit excessive and I am not sure if I am
>>> even supposed to use such a stringent e-value filtering. One reason of
>>> having so many additional proteins I can think of, is that augustus and
>>> snap are predicting similar non-overlapping models for the same
>>>location
>>> and of course they then both have a PFAM domain. I can actually see
>>>this
>>> for some locations when I load the data in WebApollo. I can think of a
>>> crude way to select only the "best" model for a location (while
>>> preferably also considering the already Maker approved protein) but I
>>> wonder if maybe there is already a solution for this in Maker?
>> The non-overlapping ab-initio proteins are already non-redundant. They
>> will not overlap each other or any of the genes already called by MAKER.
>> Also make sure you have identified novel repeats for your species or
>>these
>> models will be full of transposons which WILL have PFAM domains. Just
>> reading the names of identified domains lets you know if it's a repeat
>> related protein.  Also you must have your gene predictors trained on
>>your
>> species.  You cannot use a related species as your model if trying to
>>add
>> genes via PFAM domain content.  This is because you will get fragmented
>> gene models from the predictors if you are using a related species, and
>> since there is no overlapping evidence alignment to help correct for
>>this
>> (these are the unsupported models after all), then you will be adding
>>very
>> poor models back in.
>OK, I was not aware of these models not overlapping each other. I must
>have looked at the wrong models in WebApollo then. The old Apollo was so
>much easier to set up...
>I had a look at the names in the interproscan output and less than 5% of
>all the models with domains have a name which is clearly repeat-related
>(e.g., PPR repeat, or G-beta repeat).
>I have also spent a lot of time on training Augustus and SNAP on our
>species. Especially the Augustus predictions look rather good I think.
>So also here I am following rather closely your advice. And I must say I
>am VERY grateful for the extensive help and advice you offer, because,
>being almost a one-man-show, it would not be possible for me to do all
>this work without it.
>
>In the end the "mystery" of having different numbers of models in the
>scaffolded vs. contiged genome is partially solved or at least explained.
>One thing that you could maybe give a quick answer: I will go ahead and
>select some of the non-overlapping ab-initio proteins with PFAM domains.
>Should I filter them by e-value or some other parameter before promoting
>them to an "approved" status? If it's the e-value, what threshold would
>be preferable?
>
>Thanks again!
>Stefan
>
>>
>> Thanks,
>> Carson
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> In short, I think the repeat masking seems not to be the problem (And I
>>> think I have put quite some effort in the repeat library creation). On
>>> the other hand, there are a lot of "good" models in the non-overlapping
>>> proteins that could be filtered and promoted to proper models, if I
>>>only
>>> could make the right selection.
>>>
>>> Maybe, based on these additional informations you could point out
>>> additional tests, filtering approaches or analyses I could do to
>>>home-in
>>> to the "good" gene models in the non-overlapping gene models (or Maker
>>> approved gene models in general).
>>>
>>> Thanks again for your help!
>>> Stefan
>>>
>>>
>>>
>>> On 25.09.14 20:17, Carson Holt wrote:
>>>> Sorry for the slow reply.  I was trying to locate a script that might
>>>>be
>>>> useful for you.
>>>>
>>>> I think a species specific repeat libary will be of most benefit here
>>>> (it's surprising just how influential this step is).  Also note that
>>>>you
>>>> should train SNAP and Augustus on your species and are not just using
>>>> another related species as a stand in.
>>>>
>>>> With respect to PFAM domains, on some organisms you may not get a lot
>>>>of
>>>> cross species protein alignments because of divergence or assembly
>>>> issues.
>>>> This of course makes it harder to support these models with direct
>>>> protein
>>>> alignments.  However you can run InterProscan over the
>>>> non-overlapping.proteins.fasta file produced by MAKER (contains
>>>> non-redundant rejected models).  Because an HMM is used for domain
>>>> identification, it can pick up protein domains that would not produce
>>>>a
>>>> significant BLAST alignment because of divergence. You can then add
>>>> models
>>>> with positive hits for protein domains back into your gene set.
>>>>
>>>> This ad hoc procedure usually can only increase gene counts by about
>>>>10%
>>>> though for organisms where it's required. I've attached a script that
>>>> makes generating results for these genes easier.
>>>>
>>>> 1. First you run InterProScan with just PFAM.
>>>> 2. Then you take the IDs of all models that have a domain in the
>>>>report
>>>> and create a list (1 ID per line).
>>>> 3. Next use the fasta_tool script that comes with MAKER together with
>>>> the
>>>> --select flag to separate just the positive hits (ID's in your list)
>>>> from
>>>> the non-overlapping.proteins.fasta and
>>>> non-overlapping.transscripts.fasta
>>>> files.
>>>> 4. Use the attached script to separate just the positive hits (your ID
>>>> list) from the GFF3. The script will upgrade match/match_part results
>>>>to
>>>> gene/mRNA/exon/CDS results and print them out for you.
>>>> 5. Use the fasta_maerge and gff3_merge scripts that come with MAKER to
>>>> merge the selected/upgraded GFF3 entries and selected FASTA entries
>>>>back
>>>> into the original MAKER results.
>>>>
>>>> --Carson
>>>>
>>>>
>>>>
>>>> On 9/23/14, 6:36 AM, "Stefan Zoller" <[hidden email]>
>>>>wrote:
>>>>
>>>>> Please forgive my ignorance, I am not entirely sure if I understand
>>>>> your
>>>>> question correctly, but I will try to answer.
>>>>> As evidence we use:
>>>>> 1) our own transcriptome (trinity assembled RNAseq, filtering out the
>>>>> very low expression transcripts).
>>>>> 2) all swissprot plant proteins, and several protein sets from
>>>>>closely
>>>>> related plant species downloaded from NCBI.
>>>>> I am not sure if the ab-initio predictions without evidence have
>>>>>pfamm
>>>>> domains. Honestly, I would not know how to tell and how to
>>>>> include/exclude.
>>>>> I was assuming that we should not have too many Maker approved
>>>>> predictions without evidence anyway, because we use "keeps_preds=0".
>>>>> The numbers of gene predictions I mentioned in my email are the
>>>>> predictions reported by the fasta_merge/gff3_merge scripts in the
>>>>> "*maker.proteins.fasta". There are of course many more predictions in
>>>>> e.g., "*maker.augustus_masked.proteins.fasta" (about 68'000 in this
>>>>> file).
>>>>>
>>>>> I hope I am not totally off with my answer.
>>>>> Cheers, Stefan
>>>>>
>>>>>
>>>>>
>>>>> On 23.09.14 02:10, Mark Yandell wrote:
>>>>>> Also are you numbers including the ab-inito predictions without
>>>>>> evidence that have pfamm domains?
>>>>>>
>>>>>> cheers,
>>>>>>
>>>>>>
>>>>>> --mark
>>>>>>
>>>>>>
>>>>>>
>>>>>> Mark Yandell
>>>>>> Professor of Human Genetics
>>>>>> H.A. & Edna Benning Presidential Endowed Chair
>>>>>> Co-director USTAR Center for Genetic Discovery
>>>>>> Eccles Institute of Human Genetics
>>>>>> University of Utah
>>>>>> 15 North 2030 East, Room 2100
>>>>>> Salt Lake City, UT 84112-5330
>>>>>> ph:801-587-7707
>>>>>>
>>>>>> ________________________________________
>>>>>> From: maker-devel [[hidden email]] on behalf of
>>>>>> Carson Holt [[hidden email]]
>>>>>> Sent: Monday, September 22, 2014 2:17 PM
>>>>>> To: [hidden email]; [hidden email]
>>>>>> Subject: Re: [maker-devel] diff. numbers of geneson contigs vs.
>>>>>> scaffolded      genome
>>>>>>
>>>>>> The contiged assembly is more likely to give spurious hits and
>>>>>> alignments.
>>>>>>     They also can be harder to repeat mask.  Also gene predictors
>>>>>>can
>>>>>> behave
>>>>>> slightly different on small sequences than on longer ones.  If you
>>>>>> have
>>>>>> fewer gene models than you expect, your first step should be to
>>>>>> process
>>>>>> the scaffolds with CEGMA.  It will give you an estimate of the
>>>>>>genomes
>>>>>> "completeness".  If CEGMA gives a 60% completeness value for example
>>>>>> then
>>>>>> you can expect to only recover 60% of the expected number of genes.
>>>>>> Next
>>>>>> you should run RepeatModeler of similar software to help generate a
>>>>>> species specific repeat library.  Under masked repeats can make
>>>>>> predicting
>>>>>> genes on longer scaffolds far more difficult for ab initio
>>>>>>predictors.
>>>>>>
>>>>>> --Carson
>>>>>>
>>>>>>
>>>>>> On 9/19/14, 12:32 AM, "Stefan Zoller" <[hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am working on the annotation of a plant genome (about 600MB) and
>>>>>>>we
>>>>>>> have a reasonable draft assembly, a fairly good transcriptome and
>>>>>>> quite
>>>>>>> a few proteins from related species. We have also extensively
>>>>>>>trained
>>>>>>> augustus and are also feeding genmark and snap predictions.
>>>>>>>
>>>>>>> Recently I noticed a behavior of Maker that seems fairly odd and
>>>>>>> which
>>>>>>> I
>>>>>>> cannot explain at all. When I take the scaffolded genome (about
>>>>>>>23000
>>>>>>> scaffolds) I get roughly 9'000 maker approved gene models. Which is
>>>>>>> admittedly a bit on the low side and we have to work on this.
>>>>>>> However,
>>>>>>> when I break up the scaffolds into contigs at stretches of N longer
>>>>>>> 500bp (about 60'000 contigs) I get about 17'000 maker gene models.
>>>>>>> Now
>>>>>>> obviously 17'000 is more in the range what I would expect, so I am
>>>>>>> inclined to go with these. I have looked at both annotations and
>>>>>>>the
>>>>>>> evidence in WebApollo and the evidence alignments are identical for
>>>>>>> both
>>>>>>> runs. The approved genes seem to be the same, except for the
>>>>>>> additional
>>>>>>> ones in the "contiged" genome version. The additional gene models
>>>>>>>are
>>>>>>> not necessarily at the ends of the contigs, so I think it has
>>>>>>>nothing
>>>>>>> to
>>>>>>> do with having the stretches of Ns nearby in the scaffolded genome.
>>>>>>> Do
>>>>>>> you have any idea why maker comes up with the additional numbers of
>>>>>>> gene
>>>>>>> models and how I could "convince" maker to give me the same gene
>>>>>>> models
>>>>>>> for the scaffolded assembly?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Stefan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Stefan Zoller, PhD
>>>>>>> Bioinformatics
>>>>>>> Genetic Diversity Centre
>>>>>>> ETH Zurich CHN E55.1
>>>>>>> Universitätsstrasse 16
>>>>>>> 8092 Zurich
>>>>>>> Switzerland
>>>>>>>
>>>>>>> Phone: +41 44 632 66 85
>>>>>>> E-Mail:  [hidden email]
>>>>>>> Web: www.gdc.ethz.ch
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> maker-devel mailing list
>>>>>> [hidden email]
>>>>>>
>>>>>>
>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o
>>>>>>rg
>>> --
>>> Stefan Zoller, PhD
>>> Bioinformatics
>>> Genetic Diversity Centre
>>> ETH Zurich CHN E55.1
>>> Universitätsstrasse 16
>>> 8092 Zurich
>>> Switzerland
>>>
>>> Phone: +41 44 632 66 85
>>> E-Mail:  [hidden email]
>>> Web: www.gdc.ethz.ch
>>>
>
>--
>Stefan Zoller, PhD
>Bioinformatics
>Genetic Diversity Centre
>ETH Zurich CHN E55.1
>Universitätsstrasse 16
>8092 Zurich
>Switzerland
>
>Phone: +41 44 632 66 85
>E-Mail:  [hidden email]
>Web: www.gdc.ethz.ch
>

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org