question on gene numbers with quality_filter.pl

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

question on gene numbers with quality_filter.pl

Willett, Christopher S
Hello-

We are getting to the final stages (hopefully) of a reannotation of a new assembly of a copepod genome using MAKER and we had some questions about which set of genes to use.  Our latest runs were using Pfam domains to define default vs standard set using the quality_filter.pl script and I had a question about stringency of the filters for this script. It appears that the default is more stringent than the output that we get from MAKER without using this script (all with AED max set to 1). Are there additional filters in this script beyond AED that would cause this? 

Here is what we are seeing if more details would be helpful. With a run with or without the keep_pred turned our final MAKER run gives ~21500 predicted genes with or 15200 without the keep predictions turned on. What I was wondering about was why this 15200 is higher than the default set  (which gives ~14500 genes) after we filter the gff using the -d setting in quality_filter.pl. For completeness the standard set (-s setting) is retaining ~14800 genes and if I filter the 15200 gff file with the default parameters that yields ~14100 genes. So I was curious what else was going on in the filter script beyond AED that would trim out genes?  

The genes sets look pretty good overall and seem like reasonable numbers so we were debating which set to use as our final set. I am also trying a few other analyses in InterProScan to see if that identifies additional genes beyond Pfam for retention but that seems a bit independent from the question above.

Thanks for your help,

Best,

Chris Willett

  


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Research Associate Professor
Department of Biology
CB#3280 Coker Hall
University of North Carolina, Chapel Hill
Chapel Hill, NC, 27599-3280 

Office: 2252 Genome Science Building
phone: 919-843-8663
fax: 919-962-1625



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: question on gene numbers with quality_filter.pl

Michael Campbell
Hi Chris,

This is interesting. -d in quality_filter.pl should only filter out genes based on AED. Is there a chance that you counted transcripts instead of genes? If there is a transcript with an AED of 1 then quality filter should remove it but leave the gene and the transcripts with AEDs less than 1. I can have a look at it if you send me one of the genes (in GFF3 format)  that was filtered out by quality_filter.pl even though it had an AED less than 1. 

Thanks,
Mike


On Sep 29, 2017, at 1:20 PM, Willett, Christopher S <[hidden email]> wrote:

Hello-

We are getting to the final stages (hopefully) of a reannotation of a new assembly of a copepod genome using MAKER and we had some questions about which set of genes to use.  Our latest runs were using Pfam domains to define default vs standard set using the quality_filter.pl script and I had a question about stringency of the filters for this script. It appears that the default is more stringent than the output that we get from MAKER without using this script (all with AED max set to 1). Are there additional filters in this script beyond AED that would cause this? 

Here is what we are seeing if more details would be helpful. With a run with or without the keep_pred turned our final MAKER run gives ~21500 predicted genes with or 15200 without the keep predictions turned on. What I was wondering about was why this 15200 is higher than the default set  (which gives ~14500 genes) after we filter the gff using the -d setting in quality_filter.pl. For completeness the standard set (-s setting) is retaining ~14800 genes and if I filter the 15200 gff file with the default parameters that yields ~14100 genes. So I was curious what else was going on in the filter script beyond AED that would trim out genes?  

The genes sets look pretty good overall and seem like reasonable numbers so we were debating which set to use as our final set. I am also trying a few other analyses in InterProScan to see if that identifies additional genes beyond Pfam for retention but that seems a bit independent from the question above.

Thanks for your help,

Best,

Chris Willett

  


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Research Associate Professor
Department of Biology
CB#3280 Coker Hall
University of North Carolina, Chapel Hill
Chapel Hill, NC, 27599-3280 

Office: 2252 Genome Science Building
phone: 919-843-8663
fax: 919-962-1625


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: question on gene numbers with quality_filter.pl

Willett, Christopher S
Hi Mike-

Thanks for getting back to me. I was using the grep -cP '\tgene\t’ syntax to count the numbers and it seems to be giving me the same numbers I got before when I was counting either the transcripts or the genes in the fasta output files from our original run. I will have to look at the files a bit more to see if I can find some examples of genes that fit what you are suggesting.

Best,

Chris


On Oct 2, 2017, at 9:30 AM, Michael Campbell <[hidden email]> wrote:

Hi Chris,

This is interesting. -d in quality_filter.pl should only filter out genes based on AED. Is there a chance that you counted transcripts instead of genes? If there is a transcript with an AED of 1 then quality filter should remove it but leave the gene and the transcripts with AEDs less than 1. I can have a look at it if you send me one of the genes (in GFF3 format)  that was filtered out by quality_filter.pl even though it had an AED less than 1. 

Thanks,
Mike


On Sep 29, 2017, at 1:20 PM, Willett, Christopher S <[hidden email]> wrote:

Hello-

We are getting to the final stages (hopefully) of a reannotation of a new assembly of a copepod genome using MAKER and we had some questions about which set of genes to use.  Our latest runs were using Pfam domains to define default vs standard set using the quality_filter.pl script and I had a question about stringency of the filters for this script. It appears that the default is more stringent than the output that we get from MAKER without using this script (all with AED max set to 1). Are there additional filters in this script beyond AED that would cause this? 

Here is what we are seeing if more details would be helpful. With a run with or without the keep_pred turned our final MAKER run gives ~21500 predicted genes with or 15200 without the keep predictions turned on. What I was wondering about was why this 15200 is higher than the default set  (which gives ~14500 genes) after we filter the gff using the -d setting in quality_filter.pl. For completeness the standard set (-s setting) is retaining ~14800 genes and if I filter the 15200 gff file with the default parameters that yields ~14100 genes. So I was curious what else was going on in the filter script beyond AED that would trim out genes?  

The genes sets look pretty good overall and seem like reasonable numbers so we were debating which set to use as our final set. I am also trying a few other analyses in InterProScan to see if that identifies additional genes beyond Pfam for retention but that seems a bit independent from the question above.

Thanks for your help,

Best,

Chris Willett

  


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Research Associate Professor
Department of Biology
CB#3280 Coker Hall
University of North Carolina, Chapel Hill
Chapel Hill, NC, 27599-3280 

Office: 2252 Genome Science Building
phone: 919-843-8663
fax: 919-962-1625


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: question on gene numbers with quality_filter.pl

Willett, Christopher S
In reply to this post by Michael Campbell
Hi Mike-

Here is the control file for the last run of MAKER with keep_preds=0 and here is an example of one mRNA retained from the gff file:

Chromosome_6 maker mRNA 556000 557215 . + . ID=maker-Chromosome_6-exonerate_est2genome-gene-5.3-mRNA-1;Parent=maker-Chromosome_6-exonerate_est2genome-gene-5.3;Name=TCALIF_02833-PA;_AED=1.00;_eAED=1.00;_QI=15|0|0|0|1|1|2|75|338;score=100;Alias=TCALIF_02833-PA

Thanks,

Chris




 

On Oct 2, 2017, at 3:19 PM, Michael Campbell <[hidden email]> wrote:

Hi Chris,

Yeah By default MAKER shouldn’t keep any annotation with an AED of 1. I’ve ccd the dev list on this to see if anyone else has any idea why you might get AED 1 genes with keep_preds=0. Could you send me the maker_opts.ctl file for the run. There may be something informative in there.

Thanks,
Mike
On Oct 2, 2017, at 2:32 PM, Willett, Christopher S <[hidden email]> wrote:

Hi Mike-

I was looking at the lists of mRNAs and I think what is happening is that there are still genes retained in our initial output from MAKER that have an AED=1 that are then getting trimmed out of the filtered file. If I am setting the AED threshold equal to 1 in the control file for the MAKER run is that less than one or less than or equal to one for retention? Should these AED=1 genes be making it into the gene and mRNA pools if we have the keep predictions parameter set to 0?

Thanks for your help,

Best,

Chris



On Oct 2, 2017, at 9:30 AM, Michael Campbell <[hidden email]> wrote:

Hi Chris,

This is interesting. -d in quality_filter.pl should only filter out genes based on AED. Is there a chance that you counted transcripts instead of genes? If there is a transcript with an AED of 1 then quality filter should remove it but leave the gene and the transcripts with AEDs less than 1. I can have a look at it if you send me one of the genes (in GFF3 format)  that was filtered out by quality_filter.pl even though it had an AED less than 1. 

Thanks,
Mike


On Sep 29, 2017, at 1:20 PM, Willett, Christopher S <[hidden email]> wrote:

Hello-

We are getting to the final stages (hopefully) of a reannotation of a new assembly of a copepod genome using MAKER and we had some questions about which set of genes to use.  Our latest runs were using Pfam domains to define default vs standard set using the quality_filter.pl script and I had a question about stringency of the filters for this script. It appears that the default is more stringent than the output that we get from MAKER without using this script (all with AED max set to 1). Are there additional filters in this script beyond AED that would cause this? 

Here is what we are seeing if more details would be helpful. With a run with or without the keep_pred turned our final MAKER run gives ~21500 predicted genes with or 15200 without the keep predictions turned on. What I was wondering about was why this 15200 is higher than the default set  (which gives ~14500 genes) after we filter the gff using the -d setting in quality_filter.pl. For completeness the standard set (-s setting) is retaining ~14800 genes and if I filter the 15200 gff file with the default parameters that yields ~14100 genes. So I was curious what else was going on in the filter script beyond AED that would trim out genes?  

The genes sets look pretty good overall and seem like reasonable numbers so we were debating which set to use as our final set. I am also trying a few other analyses in InterProScan to see if that identifies additional genes beyond Pfam for retention but that seems a bit independent from the question above.

Thanks for your help,

Best,

Chris Willett

  


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Research Associate Professor
Department of Biology
CB#3280 Coker Hall
University of North Carolina, Chapel Hill
Chapel Hill, NC, 27599-3280 

Office: 2252 Genome Science Building
phone: 919-843-8663
fax: 919-962-1625


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org





_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

maker_opts.ctl_full8 (7K) Download Attachment