[Gmod-ajax] VCF hitting chunkSizeLimit after post-processing?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[Gmod-ajax] VCF hitting chunkSizeLimit after post-processing?

Richard Hayes
Hi,

I have encountered a strange bit of behavior. We have a VCF file that underwent a bit of post-processing to removed some data (SNPs from embargoed data from a larger joint call, essentially).

The unfiltered file contains ~6 million SNPs, and is about 449 Mb when compressed by bgzip. This displays fine in JBrowse v1.11.4.

The filtered file contains ~3.2 million SNPs, and is about 126 Mb when compressed by bgzip.This will not display in JBrowse. I get only the "Too much data. Chunk size 7,329,488 bytes exceeds chunkSizeLimit of 1,000,000. zoom in to see detail." But, this error message persists even when zoomed into a 105 bp region.

Both files were compressed by the same version of bgzip and indexed by the same tabix.

Is this a bug? Why would the more feature dense file render fine? I'm a bit flummoxed. Screen shot attached...

Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://phytozome.jgi.doe.gov

------------------------------------------------------------------------------
Time is money. Stop wasting it! Get your web API in 5 minutes.
www.restlet.com/download
http://p.sf.net/sfu/restlet
_______________________________________________
Gmod-ajax mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-ajax

Screen Shot 2014-05-29 at 3.41.02 PM.png (99K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: VCF hitting chunkSizeLimit after post-processing?

Richard Hayes
After some more testing, I am beginning to think this is a bug with VCFTabix.js or tabix, the software that is used to index these files (I have v0.2.6).

Switching to other reference sequences will display properly. It is only the first chromosome that is affected. This chromosome is 75Mb or so, but the feature density is actually less than that for other species with SNP data and those files all display fine.

Increasing the chunkSizeLimit parameter for this track does get the track to display on all chromosomes, but the console warns, e.g.:

TabixIndexedFileChunkedCache cannot fit 7373922:34375..14189313:53297 (bin 5827) (159,437 > 100,000)

That would seem to implicate the tabix indexing procedure. But, I have that origina, unparsed output file that displays fine.

Does anyone have a good method for testing/viewing a binary .tbi index file?

Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://phytozome.jgi.doe.gov


On Thu, May 29, 2014 at 3:53 PM, Richard Hayes <[hidden email]> wrote:
Hi,

I have encountered a strange bit of behavior. We have a VCF file that underwent a bit of post-processing to removed some data (SNPs from embargoed data from a larger joint call, essentially).

The unfiltered file contains ~6 million SNPs, and is about 449 Mb when compressed by bgzip. This displays fine in JBrowse v1.11.4.

The filtered file contains ~3.2 million SNPs, and is about 126 Mb when compressed by bgzip.This will not display in JBrowse. I get only the "Too much data. Chunk size 7,329,488 bytes exceeds chunkSizeLimit of 1,000,000. zoom in to see detail." But, this error message persists even when zoomed into a 105 bp region.

Both files were compressed by the same version of bgzip and indexed by the same tabix.

Is this a bug? Why would the more feature dense file render fine? I'm a bit flummoxed. Screen shot attached...

Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://phytozome.jgi.doe.gov


------------------------------------------------------------------------------
Time is money. Stop wasting it! Get your web API in 5 minutes.
www.restlet.com/download
http://p.sf.net/sfu/restlet
_______________________________________________
Gmod-ajax mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-ajax
Reply | Threaded
Open this post in threaded view
|

Re: VCF hitting chunkSizeLimit after post-processing?

Richard Hayes
We now have files with minimal header:

##fileformat=VCFv4.1
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  <sample list>

and various slices of the data. 18000 lines of SNPs will display, but 19000 lines hits just over the chunk size limit. Very confusing.

Could there be some sort of collision between tabix index files going on? I do specify the path to the .tbi files in all cases, but all file sets are in one directory.

Does anyone on this list understand the details of Jbrowse tabix and vcf binary data parsing, and how chunk size is determined?


Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://phytozome.jgi.doe.gov


On Fri, May 30, 2014 at 3:37 PM, Richard Hayes <[hidden email]> wrote:
After some more testing, I am beginning to think this is a bug with VCFTabix.js or tabix, the software that is used to index these files (I have v0.2.6).

Switching to other reference sequences will display properly. It is only the first chromosome that is affected. This chromosome is 75Mb or so, but the feature density is actually less than that for other species with SNP data and those files all display fine.

Increasing the chunkSizeLimit parameter for this track does get the track to display on all chromosomes, but the console warns, e.g.:

TabixIndexedFileChunkedCache cannot fit 7373922:34375..14189313:53297 (bin 5827) (159,437 > 100,000)

That would seem to implicate the tabix indexing procedure. But, I have that origina, unparsed output file that displays fine.

Does anyone have a good method for testing/viewing a binary .tbi index file?

Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://phytozome.jgi.doe.gov


On Thu, May 29, 2014 at 3:53 PM, Richard Hayes <[hidden email]> wrote:
Hi,

I have encountered a strange bit of behavior. We have a VCF file that underwent a bit of post-processing to removed some data (SNPs from embargoed data from a larger joint call, essentially).

The unfiltered file contains ~6 million SNPs, and is about 449 Mb when compressed by bgzip. This displays fine in JBrowse v1.11.4.

The filtered file contains ~3.2 million SNPs, and is about 126 Mb when compressed by bgzip.This will not display in JBrowse. I get only the "Too much data. Chunk size 7,329,488 bytes exceeds chunkSizeLimit of 1,000,000. zoom in to see detail." But, this error message persists even when zoomed into a 105 bp region.

Both files were compressed by the same version of bgzip and indexed by the same tabix.

Is this a bug? Why would the more feature dense file render fine? I'm a bit flummoxed. Screen shot attached...

Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://phytozome.jgi.doe.gov



------------------------------------------------------------------------------
Time is money. Stop wasting it! Get your web API in 5 minutes.
www.restlet.com/download
http://p.sf.net/sfu/restlet
_______________________________________________
Gmod-ajax mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-ajax
Reply | Threaded
Open this post in threaded view
|

Re: VCF hitting chunkSizeLimit after post-processing?

Colin
This bug was  confirmed and has been fixed (by richard). The exact manifestation of the bug is unclear (seems to only happen on large chromosomes, or the first chromosome in a file, or other strange conditions) but get this patch if you see errors about maxChunkSize being exceeded or too much data being downloaded for your VCF files.

https://github.com/GMOD/jbrowse/issues/486



-Colin


On Fri, May 30, 2014 at 8:38 PM, Richard Hayes <[hidden email]> wrote:
We now have files with minimal header:

##fileformat=VCFv4.1
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  <sample list>

and various slices of the data. 18000 lines of SNPs will display, but 19000 lines hits just over the chunk size limit. Very confusing.

Could there be some sort of collision between tabix index files going on? I do specify the path to the .tbi files in all cases, but all file sets are in one directory.

Does anyone on this list understand the details of Jbrowse tabix and vcf binary data parsing, and how chunk size is determined?


Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://phytozome.jgi.doe.gov


On Fri, May 30, 2014 at 3:37 PM, Richard Hayes <[hidden email]> wrote:
After some more testing, I am beginning to think this is a bug with VCFTabix.js or tabix, the software that is used to index these files (I have v0.2.6).

Switching to other reference sequences will display properly. It is only the first chromosome that is affected. This chromosome is 75Mb or so, but the feature density is actually less than that for other species with SNP data and those files all display fine.

Increasing the chunkSizeLimit parameter for this track does get the track to display on all chromosomes, but the console warns, e.g.:

TabixIndexedFileChunkedCache cannot fit 7373922:34375..14189313:53297 (bin 5827) (159,437 > 100,000)

That would seem to implicate the tabix indexing procedure. But, I have that origina, unparsed output file that displays fine.

Does anyone have a good method for testing/viewing a binary .tbi index file?

Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://phytozome.jgi.doe.gov


On Thu, May 29, 2014 at 3:53 PM, Richard Hayes <[hidden email]> wrote:
Hi,

I have encountered a strange bit of behavior. We have a VCF file that underwent a bit of post-processing to removed some data (SNPs from embargoed data from a larger joint call, essentially).

The unfiltered file contains ~6 million SNPs, and is about 449 Mb when compressed by bgzip. This displays fine in JBrowse v1.11.4.

The filtered file contains ~3.2 million SNPs, and is about 126 Mb when compressed by bgzip.This will not display in JBrowse. I get only the "Too much data. Chunk size 7,329,488 bytes exceeds chunkSizeLimit of 1,000,000. zoom in to see detail." But, this error message persists even when zoomed into a 105 bp region.

Both files were compressed by the same version of bgzip and indexed by the same tabix.

Is this a bug? Why would the more feature dense file render fine? I'm a bit flummoxed. Screen shot attached...

Richard D. Hayes, Ph.D.
Joint Genome Institute / Lawrence Berkeley National Lab
http://phytozome.jgi.doe.gov



------------------------------------------------------------------------------
Time is money. Stop wasting it! Get your web API in 5 minutes.
www.restlet.com/download
http://p.sf.net/sfu/restlet
_______________________________________________
Gmod-ajax mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-ajax



------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
Gmod-ajax mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-ajax