[BioMart Users] Patch for Coding/Peptide Sequence retrieval problems in Biomart 0.6 and 0.7

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[BioMart Users] Patch for Coding/Peptide Sequence retrieval problems in Biomart 0.6 and 0.7

Richard Hayes
Hi all,

Our group has previously written to the list after noticing a bug in Biomart 0.6 for sequence queries involving exons, such as CDS FASTA retrieval. Especially when requesting an entire transcriptome/proteome/etc., a handful of sequences were returned either with exons missing entirely or split across multiple FASTA entries.

With the help of Junjun Zhang, we were able to trace this to a problem with how BioMart versions 0.6 and 0.7 attempt to combine the results of batched SQL queries (e.g., the problem doesn't occur at all if one completely disables batching, which unfortunately introduces a significant performance hit). I have found a fix for the DatasetI.pm module. Essentially, hash keys were not sorted during the previous and current batch dataset attribute merger step, causing improper handling of transcript data when exons happened by chance to be split between SQL query batches.

The attached patch file has been tested successfully on both version 0.6 and version 0.7. Also, I have been able to successfully return correct, complete FASTA data files without any special filter/attribute orderBy constraints. This may be a quirk of our database, as exons for each transcript are batch loaded in the correct exon_rank order. If that is not the case for your data, you should, in addition to applying this patch, use orderBy constraints of "transcript_id_key, exon_rank" (similar to the Ensembl gene dataset configuration) on the "coding" and "peptide" structure exportables in your configuration.

Best regards,

--
Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://www.phytozome.net

_______________________________________________
Users mailing list
[hidden email]
https://lists.biomart.org/mailman/listinfo/users

DatasetI.pm.0.7.sqlBatchingFix.patch (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [BioMart Users] Patch for Coding/Peptide Sequence retrieval problems in Biomart 0.6 and 0.7

Arek Kasprzyk
Hi Richard,
this is great. thank you for the patch

a

On Tue, Sep 27, 2011 at 8:04 PM, Richard Hayes <[hidden email]> wrote:
Hi all,

Our group has previously written to the list after noticing a bug in Biomart 0.6 for sequence queries involving exons, such as CDS FASTA retrieval. Especially when requesting an entire transcriptome/proteome/etc., a handful of sequences were returned either with exons missing entirely or split across multiple FASTA entries.

With the help of Junjun Zhang, we were able to trace this to a problem with how BioMart versions 0.6 and 0.7 attempt to combine the results of batched SQL queries (e.g., the problem doesn't occur at all if one completely disables batching, which unfortunately introduces a significant performance hit). I have found a fix for the DatasetI.pm module. Essentially, hash keys were not sorted during the previous and current batch dataset attribute merger step, causing improper handling of transcript data when exons happened by chance to be split between SQL query batches.

The attached patch file has been tested successfully on both version 0.6 and version 0.7. Also, I have been able to successfully return correct, complete FASTA data files without any special filter/attribute orderBy constraints. This may be a quirk of our database, as exons for each transcript are batch loaded in the correct exon_rank order. If that is not the case for your data, you should, in addition to applying this patch, use orderBy constraints of "transcript_id_key, exon_rank" (similar to the Ensembl gene dataset configuration) on the "coding" and "peptide" structure exportables in your configuration.

Best regards,

--
Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://www.phytozome.net

_______________________________________________
Users mailing list
[hidden email]
https://lists.biomart.org/mailman/listinfo/users



_______________________________________________
Users mailing list
[hidden email]
https://lists.biomart.org/mailman/listinfo/users