Suggestions if too many predicted genes

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Suggestions if too many predicted genes

Quanwei Zhang
Hello:

Thank you for all your previous comments and suggestions. We annotated a new rodent species using the maker2 pipeline. The assembly is about 3.2Gb with N50 24.3Mb. I included all scaffolds longer than 300bp for gene annotation (about 250k scaffolds).

For repeats masking, we also build a species specific library. We used both transcriptome and protein sequences as evidences (including 10k reviewed Mammalian and 340k predicted rodent protein sequences from uniprot). We predicted 28800 genes with AED<1 (the "default" gene set).

For the 28800 predicted proteins, about 90% have AED value less than 0.5, and 74% have domains by "InterProScan". It seems the genome was well annotated, but I still feel  28800 protein coding genes are too many for a rodent species. Do you think this gene set is good for downstream analysis (e.g., gene family expansion analysis, positive selection analysis)? Or can I do further filtering to make the number of genes closer to estimated number (e.g., 22,000)?

Thanks

Best
Quanwei


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Suggestions if too many predicted genes

Daniel Ence-2
Hi Quanwei, I think that your genome assembly probably contains many contigs that are too small to contain full gene sequences. Rather than 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful threshold. This is mentioned in the maker_opts.ctl file with the min_contig paramter: “skip genome contigs below this length (under 10kbp are often useless)”. 

I don’t know how many genes are annotated on small (<10kbp) scaffolds and contigs but excluding those contigs would probably reduce your gene count. These may be fragments or duplicates of genes present on these sequences that weren’t assembled properly.

Also using predicted protein sequences from uniprot as evidence in your annotation is probably not advisable since those sequences are not from genes with experiment evidence. This is the trEMBL vs swiss-prot issue that that you asked about earlier. 

Additionally requiring a minimum protein length as you asked about earlier could also reduce the gene count. 

Ultimately, you may do whatever filtering you find necessary and justifiable for your annotation depending on the biology of your organism and the methods that generated your assembly, and your annotation. 

Hope this helps, 
Daniel

On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <[hidden email]> wrote:

Hello:

Thank you for all your previous comments and suggestions. We annotated a new rodent species using the maker2 pipeline. The assembly is about 3.2Gb with N50 24.3Mb. I included all scaffolds longer than 300bp for gene annotation (about 250k scaffolds).

For repeats masking, we also build a species specific library. We used both transcriptome and protein sequences as evidences (including 10k reviewed Mammalian and 340k predicted rodent protein sequences from uniprot). We predicted 28800 genes with AED<1 (the "default" gene set).

For the 28800 predicted proteins, about 90% have AED value less than 0.5, and 74% have domains by "InterProScan". It seems the genome was well annotated, but I still feel  28800 protein coding genes are too many for a rodent species. Do you think this gene set is good for downstream analysis (e.g., gene family expansion analysis, positive selection analysis)? Or can I do further filtering to make the number of genes closer to estimated number (e.g., 22,000)?

Thanks

Best
Quanwei

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Suggestions if too many predicted genes

Michael Campbell
Hi Quanwei,

The first thing that comes to mind with too many genes is undermasked repeats. You could check the Pfam donmains for things like integrase, GAG proteins, and other transposon related domains. I would also look a bit closer at the genes with AEDs greater than 0.5. Looking and things like average numner of exons per transcript and average gene and transcript lengths can help pick out dodgy genes. You could also do some filtering on the QI values output by MAKER. It is defensible to create a “higher quality” set by limiting it to genes with AEDs less than 0.5 and puting some requirement on the fractions of splice sites confirmed by EST/mRNA-seq alignments. 

Take care,
Mike
On Sep 27, 2017, at 10:54 AM, Daniel Ence <[hidden email]> wrote:

Hi Quanwei, I think that your genome assembly probably contains many contigs that are too small to contain full gene sequences. Rather than 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful threshold. This is mentioned in the maker_opts.ctl file with the min_contig paramter: “skip genome contigs below this length (under 10kbp are often useless)”. 

I don’t know how many genes are annotated on small (<10kbp) scaffolds and contigs but excluding those contigs would probably reduce your gene count. These may be fragments or duplicates of genes present on these sequences that weren’t assembled properly.

Also using predicted protein sequences from uniprot as evidence in your annotation is probably not advisable since those sequences are not from genes with experiment evidence. This is the trEMBL vs swiss-prot issue that that you asked about earlier. 

Additionally requiring a minimum protein length as you asked about earlier could also reduce the gene count. 

Ultimately, you may do whatever filtering you find necessary and justifiable for your annotation depending on the biology of your organism and the methods that generated your assembly, and your annotation. 

Hope this helps, 
Daniel

On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <[hidden email]> wrote:

Hello:

Thank you for all your previous comments and suggestions. We annotated a new rodent species using the maker2 pipeline. The assembly is about 3.2Gb with N50 24.3Mb. I included all scaffolds longer than 300bp for gene annotation (about 250k scaffolds).

For repeats masking, we also build a species specific library. We used both transcriptome and protein sequences as evidences (including 10k reviewed Mammalian and 340k predicted rodent protein sequences from uniprot). We predicted 28800 genes with AED<1 (the "default" gene set).

For the 28800 predicted proteins, about 90% have AED value less than 0.5, and 74% have domains by "InterProScan". It seems the genome was well annotated, but I still feel  28800 protein coding genes are too many for a rodent species. Do you think this gene set is good for downstream analysis (e.g., gene family expansion analysis, positive selection analysis)? Or can I do further filtering to make the number of genes closer to estimated number (e.g., 22,000)?

Thanks

Best
Quanwei

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Suggestions if too many predicted genes

Xabier Vázquez Campos
Hi Quanwei,
Following Michael comment, even if you use Swissprot, there are over 2700 transposases in it. If there is some undermasking, they will show up as evidence.
Cheers,
Xabi

On 28 September 2017 at 01:34, Michael Campbell <[hidden email]> wrote:
Hi Quanwei,

The first thing that comes to mind with too many genes is undermasked repeats. You could check the Pfam donmains for things like integrase, GAG proteins, and other transposon related domains. I would also look a bit closer at the genes with AEDs greater than 0.5. Looking and things like average numner of exons per transcript and average gene and transcript lengths can help pick out dodgy genes. You could also do some filtering on the QI values output by MAKER. It is defensible to create a “higher quality” set by limiting it to genes with AEDs less than 0.5 and puting some requirement on the fractions of splice sites confirmed by EST/mRNA-seq alignments. 

Take care,
Mike

On Sep 27, 2017, at 10:54 AM, Daniel Ence <[hidden email]> wrote:

Hi Quanwei, I think that your genome assembly probably contains many contigs that are too small to contain full gene sequences. Rather than 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful threshold. This is mentioned in the maker_opts.ctl file with the min_contig paramter: “skip genome contigs below this length (under 10kbp are often useless)”. 

I don’t know how many genes are annotated on small (<10kbp) scaffolds and contigs but excluding those contigs would probably reduce your gene count. These may be fragments or duplicates of genes present on these sequences that weren’t assembled properly.

Also using predicted protein sequences from uniprot as evidence in your annotation is probably not advisable since those sequences are not from genes with experiment evidence. This is the trEMBL vs swiss-prot issue that that you asked about earlier. 

Additionally requiring a minimum protein length as you asked about earlier could also reduce the gene count. 

Ultimately, you may do whatever filtering you find necessary and justifiable for your annotation depending on the biology of your organism and the methods that generated your assembly, and your annotation. 

Hope this helps, 
Daniel

On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <[hidden email]> wrote:

Hello:

Thank you for all your previous comments and suggestions. We annotated a new rodent species using the maker2 pipeline. The assembly is about 3.2Gb with N50 24.3Mb. I included all scaffolds longer than 300bp for gene annotation (about 250k scaffolds).

For repeats masking, we also build a species specific library. We used both transcriptome and protein sequences as evidences (including 10k reviewed Mammalian and 340k predicted rodent protein sequences from uniprot). We predicted 28800 genes with AED<1 (the "default" gene set).

For the 28800 predicted proteins, about 90% have AED value less than 0.5, and 74% have domains by "InterProScan". It seems the genome was well annotated, but I still feel  28800 protein coding genes are too many for a rodent species. Do you think this gene set is good for downstream analysis (e.g., gene family expansion analysis, positive selection analysis)? Or can I do further filtering to make the number of genes closer to estimated number (e.g., 22,000)?

Thanks

Best
Quanwei

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org




--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Suggestions if too many predicted genes

Quanwei Zhang
Thank you all for your comments and suggestions. Yes, even when I only use Swissprot I still have 26.5k protein coding genes. As you mentioned one reason may be related to repeat masking, and another one may be because of inclusion of short scaffolds, which further lead to protein fragments.

About the repeat masking, I use the latest Repeatmaker and Repbase (selected Mammalian), I also build species specific repeat libraries following http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic. About transposases I know the Maker pipe line already provided "transposable element proteins". I do not know what else I can do.

About the short scaffolds, in  fact among the 26.5k genes, only about 400 genes are predicted from scaffolds shorter than 10kb. Besides, I know there are some very short proteins (e.g.,the mouse protein RL41 (60s ribosomal protein) has lengh 25). I think short scaffolds may also include some short proteins.

Now, I plan to start from the 26.5k protein coding genes. I think the less reliable ones will be filtered out in downstream analysis. For example, when we construct the gene families, those fragments or falsely predicted proteins will more like to be excluded from gene families.

Thank you all for your suggestions.

Best
Qaunwei
 

2017-09-27 20:32 GMT-04:00 Xabier Vázquez-Campos <[hidden email]>:
Hi Quanwei,
Following Michael comment, even if you use Swissprot, there are over 2700 transposases in it. If there is some undermasking, they will show up as evidence.
Cheers,
Xabi

On 28 September 2017 at 01:34, Michael Campbell <[hidden email]> wrote:
Hi Quanwei,

The first thing that comes to mind with too many genes is undermasked repeats. You could check the Pfam donmains for things like integrase, GAG proteins, and other transposon related domains. I would also look a bit closer at the genes with AEDs greater than 0.5. Looking and things like average numner of exons per transcript and average gene and transcript lengths can help pick out dodgy genes. You could also do some filtering on the QI values output by MAKER. It is defensible to create a “higher quality” set by limiting it to genes with AEDs less than 0.5 and puting some requirement on the fractions of splice sites confirmed by EST/mRNA-seq alignments. 

Take care,
Mike

On Sep 27, 2017, at 10:54 AM, Daniel Ence <[hidden email]> wrote:

Hi Quanwei, I think that your genome assembly probably contains many contigs that are too small to contain full gene sequences. Rather than 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful threshold. This is mentioned in the maker_opts.ctl file with the min_contig paramter: “skip genome contigs below this length (under 10kbp are often useless)”. 

I don’t know how many genes are annotated on small (<10kbp) scaffolds and contigs but excluding those contigs would probably reduce your gene count. These may be fragments or duplicates of genes present on these sequences that weren’t assembled properly.

Also using predicted protein sequences from uniprot as evidence in your annotation is probably not advisable since those sequences are not from genes with experiment evidence. This is the trEMBL vs swiss-prot issue that that you asked about earlier. 

Additionally requiring a minimum protein length as you asked about earlier could also reduce the gene count. 

Ultimately, you may do whatever filtering you find necessary and justifiable for your annotation depending on the biology of your organism and the methods that generated your assembly, and your annotation. 

Hope this helps, 
Daniel

On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <[hidden email]> wrote:

Hello:

Thank you for all your previous comments and suggestions. We annotated a new rodent species using the maker2 pipeline. The assembly is about 3.2Gb with N50 24.3Mb. I included all scaffolds longer than 300bp for gene annotation (about 250k scaffolds).

For repeats masking, we also build a species specific library. We used both transcriptome and protein sequences as evidences (including 10k reviewed Mammalian and 340k predicted rodent protein sequences from uniprot). We predicted 28800 genes with AED<1 (the "default" gene set).

For the 28800 predicted proteins, about 90% have AED value less than 0.5, and 74% have domains by "InterProScan". It seems the genome was well annotated, but I still feel  28800 protein coding genes are too many for a rodent species. Do you think this gene set is good for downstream analysis (e.g., gene family expansion analysis, positive selection analysis)? Or can I do further filtering to make the number of genes closer to estimated number (e.g., 22,000)?

Thanks

Best
Quanwei

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org




--
Xabier Vázquez-Campos, PhD
Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org