disk space and file formats

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

disk space and file formats

Patrick Page-McCaw
I'm not a bioinformaticist or programmer so apologies if this is a silly question. I've been occasionally running galaxy on my laptop and on the public server and I love it. The issue that I have is that my workflow requires many steps (what I do is probably very unusual). Each step creates a new large fastq file as the sequences are iteratively trimmed of junk. This fills my laptop and fills the public server with lots of unnecessary very large files.

I've been thinking about the structure of the files and my workflow and it seems to me that a more space efficient system would be to have a single file (or a sql database) on which each tool can work. Most of what I do is remove adapter sequences, extract barcodes, trim by quality, map to the genome and then process my hits by type (exon, intron etc). Since the clean up tools in FASTX aren't written with my problem in mind, it takes several passes to get the sequences trimmed up before mapping.

If I had a file that had a format something like (here as tab delimited):
Header Seq Phred Start Len Barcode etc
Each tool could read the Seq and Phred starting at Start and running Len nucleotides and work on that. The tool could then write a new Start and Len to reflect the trimming it has done[1]. For convenience let me call this an HSPh format.

So it would be a real pain, no doubt, to rewrite all the tools. The little that I can read the tools it seems that the way the input is handled internally varies quite a bit. But it seems to me (naively?) that it would be relatively easy to write a conversion tool that would take the HSPh format and turn it into fastq or fast on the fly for the tools. Since most tools take fastq or fasta, it should be a write once, use many times, plugin. The harder (and slower) part would be mapping the fastq output back onto HSPh format.  But again, this should be a write once, use for many tools plugin. Both of the intermediating files would be deleted when done. Just as a real quick test I thought I would see how long it takes to run sed on a fastq 1.35GB file and it was so fast on my laptop, < 2 minutes, that it was done before I noticed.

Then as people are interested, the tools could be converted to take as input the new format.

It may well be true in these days of $100 terabyte drives, this is not useful, that cycles are limiting, not drive space. But I think if the tools were rewritten to take and write to a HSPh format, processing would be faster too. It seems like some effort has been made to create the tab delimited format and maybe someone is already working on something like this (no doubt better designed).

I may have a comp sci undergrad working in the lab this fall. With help we (well, he) might manage some parts of this. He is apparently quite a talented and hard working C++ programmer. Is it worth while?

thanks

[1] It could even do something like:
Header Seq Phred Start Len Tool Parameter Start Len Tool Parameter Start Len etc
Tool is the tool name, Parameter a list of parameters used, Start and Len would be the latest trim positions. And the last Start Len pair would be the one to use by default for the next tool, but this would keep an edit history without doubling the space needs with each processing cycle. I wouldn't need this but it might be more friendly for users, an "undo" means removing 4 columns. A format like this would probably be better as a sql database.
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: disk space and file formats

Jelle Scholtalbers
Hi Patrick,

the issue you are having is partly related to the idea of Galaxy to
ensure reproducible science and saving each intermediate step and
output files. For example in your current workflow in Galaxy you can
easily do something else with each intermediate file - feed it to a
different tool just to check what the average read length is after
filtering - you can do that even 2 months after your run.
If you how ever insist on keeping disk usage low and don't want to
start programming - as your provided solutions will require - and
aren't too afraid of the commandline you might want to start there.

The thing is, a lot of tools accept either an input file or an input
stream. These same tools also have the ability to either write to an
output file or to an output stream. This way you can "pipe" these
tools together.
e.g. "trimMyFq -i rawinput.fq | removebarcode -i - -n optionN |
filterJunk -i - -o finalOutput.fq"

I don't know which programs you actually use, but the principle is
probably the same ( as long as the tools actually accept streams ).
This example saved you diskspace because from the 3 tools run, only
one actually writes to the disk. On the downside, this also means you
don't have an output file from removeBarcode which you can look at to
see if everything went ok.

If you do want to program or someone else wants to do it, I could
think of a tool that combines your iterative steps and can be run as
one tool - you could even wrap up your 'pipeline' in a script and put
that as a tool in your Galaxy instance and/or in the toolshed.

Cheers,
Jelle



On Fri, Aug 19, 2011 at 6:29 PM, Patrick Page-McCaw
<[hidden email]> wrote:

> I'm not a bioinformaticist or programmer so apologies if this is a silly question. I've been occasionally running galaxy on my laptop and on the public server and I love it. The issue that I have is that my workflow requires many steps (what I do is probably very unusual). Each step creates a new large fastq file as the sequences are iteratively trimmed of junk. This fills my laptop and fills the public server with lots of unnecessary very large files.
>
> I've been thinking about the structure of the files and my workflow and it seems to me that a more space efficient system would be to have a single file (or a sql database) on which each tool can work. Most of what I do is remove adapter sequences, extract barcodes, trim by quality, map to the genome and then process my hits by type (exon, intron etc). Since the clean up tools in FASTX aren't written with my problem in mind, it takes several passes to get the sequences trimmed up before mapping.
>
> If I had a file that had a format something like (here as tab delimited):
> Header  Seq     Phred   Start   Len     Barcode etc
> Each tool could read the Seq and Phred starting at Start and running Len nucleotides and work on that. The tool could then write a new Start and Len to reflect the trimming it has done[1]. For convenience let me call this an HSPh format.
>
> So it would be a real pain, no doubt, to rewrite all the tools. The little that I can read the tools it seems that the way the input is handled internally varies quite a bit. But it seems to me (naively?) that it would be relatively easy to write a conversion tool that would take the HSPh format and turn it into fastq or fast on the fly for the tools. Since most tools take fastq or fasta, it should be a write once, use many times, plugin. The harder (and slower) part would be mapping the fastq output back onto HSPh format.  But again, this should be a write once, use for many tools plugin. Both of the intermediating files would be deleted when done. Just as a real quick test I thought I would see how long it takes to run sed on a fastq 1.35GB file and it was so fast on my laptop, < 2 minutes, that it was done before I noticed.
>
> Then as people are interested, the tools could be converted to take as input the new format.
>
> It may well be true in these days of $100 terabyte drives, this is not useful, that cycles are limiting, not drive space. But I think if the tools were rewritten to take and write to a HSPh format, processing would be faster too. It seems like some effort has been made to create the tab delimited format and maybe someone is already working on something like this (no doubt better designed).
>
> I may have a comp sci undergrad working in the lab this fall. With help we (well, he) might manage some parts of this. He is apparently quite a talented and hard working C++ programmer. Is it worth while?
>
> thanks
>
> [1] It could even do something like:
> Header Seq Phred Start Len Tool Parameter Start Len Tool Parameter Start Len etc
> Tool is the tool name, Parameter a list of parameters used, Start and Len would be the latest trim positions. And the last Start Len pair would be the one to use by default for the next tool, but this would keep an edit history without doubling the space needs with each processing cycle. I wouldn't need this but it might be more friendly for users, an "undo" means removing 4 columns. A format like this would probably be better as a sql database.
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>
>  http://lists.bx.psu.edu/
>

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: disk space and file formats

Edward Kirton
Read QC intermediate files account for most of the storage used on our galaxy site. And it's a real problem that I must solve soon.

My first attempt at taming the beast was to try to create a single read QC tool that did such things as convert qual encoding, qual-end trimming, etc. (very basic functions).  Such a tool could simply be a wrapper around your favorite existing tools, but doesn't keep the intermediate files.  The added benefit is that it runs faster because it only has to queue onto the cluster once.

Sure, one might argue that it's nice to have all the intermediate files just in case you wish to review them, but in practice, I have found this happens relatively infrequently and is too expensive.  If you're a small lab maybe that's fine, but if you generate a lot of sequence, a more production-line approach is reasonable.

I've been toying with the idea of replacing all the fastq datatypes with a single fastq datatype that is sanger-encoded and gzipped.  I think gzipped reads files are about 1/4 of the unpacked version.  Of course, many tools will require a wrapper if they don't accept gzipped input, but that's trivial (and many already support compressed reads).  

However the import tool automatically uncompressed uploaded files so I'd need to do some hacking there to prevent this.  

Heck, what we really need is a nice compact binary format for reads, perhaps which doesn't even store ids (although pairing would need to be recorded).

Thoughts?

On Fri, Aug 19, 2011 at 11:43 AM, Jelle Scholtalbers <[hidden email]> wrote:
Hi Patrick,

the issue you are having is partly related to the idea of Galaxy to
ensure reproducible science and saving each intermediate step and
output files. For example in your current workflow in Galaxy you can
easily do something else with each intermediate file - feed it to a
different tool just to check what the average read length is after
filtering - you can do that even 2 months after your run.
If you how ever insist on keeping disk usage low and don't want to
start programming - as your provided solutions will require - and
aren't too afraid of the commandline you might want to start there.

The thing is, a lot of tools accept either an input file or an input
stream. These same tools also have the ability to either write to an
output file or to an output stream. This way you can "pipe" these
tools together.
e.g. "trimMyFq -i rawinput.fq | removebarcode -i - -n optionN |
filterJunk -i - -o finalOutput.fq"

I don't know which programs you actually use, but the principle is
probably the same ( as long as the tools actually accept streams ).
This example saved you diskspace because from the 3 tools run, only
one actually writes to the disk. On the downside, this also means you
don't have an output file from removeBarcode which you can look at to
see if everything went ok.

If you do want to program or someone else wants to do it, I could
think of a tool that combines your iterative steps and can be run as
one tool - you could even wrap up your 'pipeline' in a script and put
that as a tool in your Galaxy instance and/or in the toolshed.

Cheers,
Jelle



On Fri, Aug 19, 2011 at 6:29 PM, Patrick Page-McCaw
<[hidden email]> wrote:
> I'm not a bioinformaticist or programmer so apologies if this is a silly question. I've been occasionally running galaxy on my laptop and on the public server and I love it. The issue that I have is that my workflow requires many steps (what I do is probably very unusual). Each step creates a new large fastq file as the sequences are iteratively trimmed of junk. This fills my laptop and fills the public server with lots of unnecessary very large files.
>
> I've been thinking about the structure of the files and my workflow and it seems to me that a more space efficient system would be to have a single file (or a sql database) on which each tool can work. Most of what I do is remove adapter sequences, extract barcodes, trim by quality, map to the genome and then process my hits by type (exon, intron etc). Since the clean up tools in FASTX aren't written with my problem in mind, it takes several passes to get the sequences trimmed up before mapping.
>
> If I had a file that had a format something like (here as tab delimited):
> Header  Seq     Phred   Start   Len     Barcode etc
> Each tool could read the Seq and Phred starting at Start and running Len nucleotides and work on that. The tool could then write a new Start and Len to reflect the trimming it has done[1]. For convenience let me call this an HSPh format.
>
> So it would be a real pain, no doubt, to rewrite all the tools. The little that I can read the tools it seems that the way the input is handled internally varies quite a bit. But it seems to me (naively?) that it would be relatively easy to write a conversion tool that would take the HSPh format and turn it into fastq or fast on the fly for the tools. Since most tools take fastq or fasta, it should be a write once, use many times, plugin. The harder (and slower) part would be mapping the fastq output back onto HSPh format.  But again, this should be a write once, use for many tools plugin. Both of the intermediating files would be deleted when done. Just as a real quick test I thought I would see how long it takes to run sed on a fastq 1.35GB file and it was so fast on my laptop, < 2 minutes, that it was done before I noticed.
>
> Then as people are interested, the tools could be converted to take as input the new format.
>
> It may well be true in these days of $100 terabyte drives, this is not useful, that cycles are limiting, not drive space. But I think if the tools were rewritten to take and write to a HSPh format, processing would be faster too. It seems like some effort has been made to create the tab delimited format and maybe someone is already working on something like this (no doubt better designed).
>
> I may have a comp sci undergrad working in the lab this fall. With help we (well, he) might manage some parts of this. He is apparently quite a talented and hard working C++ programmer. Is it worth while?
>
> thanks
>
> [1] It could even do something like:
> Header Seq Phred Start Len Tool Parameter Start Len Tool Parameter Start Len etc
> Tool is the tool name, Parameter a list of parameters used, Start and Len would be the latest trim positions. And the last Start Len pair would be the one to use by default for the next tool, but this would keep an edit history without doubling the space needs with each processing cycle. I wouldn't need this but it might be more friendly for users, an "undo" means removing 4 columns. A format like this would probably be better as a sql database.
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>
>  http://lists.bx.psu.edu/
>

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: disk space and file formats

Peter Cock
On Thu, Sep 1, 2011 at 11:02 PM, Edward Kirton <[hidden email]> wrote:

> Read QC intermediate files account for most of the storage used on our
> galaxy site. And it's a real problem that I must solve soon.
> My first attempt at taming the beast was to try to create a single read QC
> tool that did such things as convert qual encoding, qual-end trimming, etc.
> (very basic functions).  Such a tool could simply be a wrapper around your
> favorite existing tools, but doesn't keep the intermediate files.  The added
> benefit is that it runs faster because it only has to queue onto the cluster
> once.
> Sure, one might argue that it's nice to have all the intermediate files just
> in case you wish to review them, but in practice, I have found this happens
> relatively infrequently and is too expensive.  If you're a small lab maybe
> that's fine, but if you generate a lot of sequence, a more production-line
> approach is reasonable.

Sounds very sensible if you have some frequently repeated multistep
analyses.

> I've been toying with the idea of replacing all the fastq datatypes with a
> single fastq datatype that is sanger-encoded and gzipped.  I think gzipped
> reads files are about 1/4 of the unpacked version.  Of course, many tools
> will require a wrapper if they don't accept gzipped input, but that's
> trivial (and many already support compressed reads).
> However the import tool automatically uncompressed uploaded files so I'd
> need to do some hacking there to prevent this.

Hmm. Probably there are some tasks where a gzip'd FASTQ isn't
ideal, but for the fairly typical case of intreating over the records
it should be fine.

> Heck, what we really need is a nice compact binary format for reads, perhaps
> which doesn't even store ids (although pairing would need to be recorded).
> Thoughts?

What, like a BAM file of unaligned reads? Uses gzip compression, and
tracks the pairing information explicitly :) Some tools will already take
this as an input format, but not all.

Peter

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: disk space and file formats

Edward Kirton
> What, like a BAM file of unaligned reads? Uses gzip compression, and
> tracks the pairing information explicitly :) Some tools will already take
> this as an input format, but not all.

ah, yes, precisely.  i actually think illumina's pipeline produces
files in this format now.
wrappers which create a temporary fastq file would need to be created
but that's easy enough.
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: disk space and file formats

Fields, Christopher J
On Sep 2, 2011, at 3:02 PM, Edward Kirton wrote:

>> What, like a BAM file of unaligned reads? Uses gzip compression, and
>> tracks the pairing information explicitly :) Some tools will already take
>> this as an input format, but not all.
>
> ah, yes, precisely.  i actually think illumina's pipeline produces
> files in this format now.
> wrappers which create a temporary fastq file would need to be created
> but that's easy enough.

My argument against that is the cost of going from BAM -> temp fastq may be prohibitive, e.g. the need to generate very large temp fastq files on the fly as input for various applications may lead one back to just keeping a permanent FASTQ around anyway.  One could probably get better performance out of a simpler format that removes most of the 'AM' parts of BAM.  Or is the idea that the file itself is modified, like a database?  And how would indexing work (BAM uses binning on the match to the reference seq), or does it matter?

I recall hdf5 was planned as an alternate format (PacBio uses it, IIRC), and of course there is NCBI's .sra format.  Anyone using the latter two?

chris


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: disk space and file formats

Peter Cock
On Fri, Sep 2, 2011 at 9:27 PM, Fields, Christopher J
<[hidden email]> wrote:
> On Sep 2, 2011, at 3:02 PM, Edward Kirton wrote:
>
>>> What, like a BAM file of unaligned reads? Uses gzip compression, and
>>> tracks the pairing information explicitly :) Some tools will already take
>>> this as an input format, but not all.
>>
>> ah, yes, precisely.  i actually think illumina's pipeline produces
>> files in this format now.

Oh do they? - that's interesting. Do you have a reference/link?

>> wrappers which create a temporary fastq file would need to be created
>> but that's easy enough.
>
> My argument against that is the cost of going from BAM -> temp
> fastq may be prohibitive, e.g. the need to generate very large
> temp fastq files on the fly as input for various applications may
> lead one back to just keeping a permanent FASTQ around anyway.

True - if you can't update the tools you need to take BAM.
In some cases at least you can pipe the gzipped FASTQ
into alignment tools which accepts FASTQ on stdin, so
there is no temp file per se.

> One could probably get better performance out of a simpler
> format that removes most of the 'AM' parts of BAM.

Yes, but that meaning inventing yet another file format. At least
gzipped FASTQ is quite straightforward.

>
> Or is the idea that the file itself is modified, like a database?

That would be quite a dramatic change from the current
Galaxy workflow system - I doubt that would be acceptable
in general.

> And how would indexing work (BAM uses binning on the
> match to the reference seq), or does it matter?

BAM indexing as done in samtools/picard is only for the aligned
reads - so no help for a BAM file of unaligned reads. You could
use a different indexing system (e.g. by read name) and the
same BAM BGZF block offset system (I've tried this as an
experiment with Biopython's SQLite indexing of sequence files).

However, for tasks taking unaligned reads as input, you
generally just iterate over the reads in the order on disk.

> I recall hdf5 was planned as an alternate format (PacBio uses
> it, IIRC), and of course there is NCBI's .sra format.  Anyone
> using the latter two?

Moving from the custom BGZF modified gzip format used in
BAM to HD5 has been proposed on the samtools mailing list
(as Chris knows), and there is a proof of principle implementation
too in BioHDF, http://www.hdfgroup.org/projects/biohdf/
The SAM/BAM group didn't seem overly enthusiastic though.

For the NCBI's .sra format, there is no open specification, just
their public domain source code:
http://seqanswers.com/forums/showthread.php?t=12054

Regards,

Peter

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: disk space and file formats

Edward Kirton
>>> i actually think illumina's pipeline produces files in this format (unaligned-bam) now.

> Oh do they? - that's interesting. Do you have a reference/link?

i caught wind of this at the recent illumina user's conference but i
asked someone in our sequencing team to confirm and he hadn't heard of
this.  it must be limited to the forthcoming miseq sequencer for the
timebeing, but may make it's way to the big sequencers later.
apparently illumina is thinking about storage as well.  i seem to
recall the speaker saying they won't produce srf files anymore, but
again, this was a talk about the miseq so may not apply to the other
sequencers.

>>> wrappers which create a temporary fastq file would need to be created
>>> but that's easy enough.

>> My argument against that is the cost of going from BAM -> temp
>> fastq may be prohibitive, e.g. the need to generate very large
>> temp fastq files on the fly as input for various applications may
>> lead one back to just keeping a permanent FASTQ around anyway.

> True - if you can't update the tools you need to take BAM.
> In some cases at least you can pipe the gzipped FASTQ
> into alignment tools which accepts FASTQ on stdin, so
> there is no temp file per se.

the tools really do need to support the format; the tmpfile was simply
a workaround.  some tools already support bam, more currently support
fastq.gz.  (someone here made the wrong bet years ago and had adopted
a site-wide fastq.bz2 standard which only recently changed to
fastq.gz.)  but if illumina does start producing bam files in the
future, then we can expect more tools to support that format.  until
they do, probably fastq.gz is a safe bet.

of course there is a computational cost to compressing/uncompressing
files but that's probably better than storing unnecessarily huge
files.  it's a trade-off.

similarly, there's a trade-off involved in limiting read qc tools to a
single/few big tools which wrap several tools, with many options.
users can't play around with read qc but that may be too expensive
(computationally and storage-wise).  for the most part, a standard qc
will do.  one can spend a lot of time and effort to squeeze a bit more
useful data out of a bad library, for example, when they probably
should have just sequenced another library.  i favor leaving the
playing around to the r&d/development/qc team and just offering a
canned/vetted qc solution to the average user.

>> I recall hdf5 was planned as an alternate format (PacBio uses
>> it, IIRC), and of course there is NCBI's .sra format.  Anyone
>> using the latter two?
> Moving from the custom BGZF modified gzip format used in
> BAM to HD5 has been proposed on the samtools mailing list
> (as Chris knows), and there is a proof of principle implementation
> too in BioHDF, http://www.hdfgroup.org/projects/biohdf/
> The SAM/BAM group didn't seem overly enthusiastic though.
> For the NCBI's .sra format, there is no open specification, just
> their public domain source code:
> http://seqanswers.com/forums/showthread.php?t=12054

i believe hdf5 is an indexed data structure which, as you mentioned,
isn't required for unprocessed reads.

since i'm rapidly running out of storage, i think the best immediate
solution for me is to deprecate all the fastq datatypes in favor of a
new fastqsangergz and to bundle the read qc tools to eliminate
intermediate files.  sure, users won't be able to play around with
their data as much, but my disk is 88% full and my cluster has been
100% occupied for 2-months straight, so less choice is probably
better.

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: disk space and file formats

Peter Cock


On Saturday, September 3, 2011, Edward Kirton <[hidden email]> wrote:
> of course there is a computational cost to compressing/uncompressing
> files but that's probably better than storing unnecessarily huge
> files.  it's a trade-off.

It may still be faster due to less IO, probably depends on your hardware.

> since i'm rapidly running out of storage, i think the best immediate
> solution for me is to deprecate all the fastq datatypes in favor of a
> new fastqsangergz and to bundle the read qc tools to eliminate
> intermediate files.  sure, users won't be able to play around with
> their data as much, but my disk is 88% full and my cluster has been
> 100% occupied for 2-months straight, so less choice is probably
> better.

In your position I agree that is a pragmatic choice. You might be able to modify the file upload code to gzip any FASTQ files... that would prevent uncompressed FASTQ getting into new histories.

I wonder if Galaxy would benefit from a new fastqsanger-gzip (etc) datatype? However this seems generally useful (not just for FASTQ) so perhaps a more general mechanism would be better where tool XML files can say which file types they accept and which of those can/must be compressed (possily not just gzip format?).

Peter

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: disk space and file formats

Fields, Christopher J
In reply to this post by Peter Cock
On Sep 2, 2011, at 8:02 PM, Peter Cock wrote:

> On Fri, Sep 2, 2011 at 9:27 PM, Fields, Christopher J
> <[hidden email]> wrote:
>> On Sep 2, 2011, at 3:02 PM, Edward Kirton wrote:
>>
>>>> What, like a BAM file of unaligned reads? Uses gzip compression, and
>>>> tracks the pairing information explicitly :) Some tools will already take
>>>> this as an input format, but not all.
>>>
>>> ah, yes, precisely.  i actually think illumina's pipeline produces
>>> files in this format now.
>
> Oh do they? - that's interesting. Do you have a reference/link?
>
>>> wrappers which create a temporary fastq file would need to be created
>>> but that's easy enough.
>>
>> My argument against that is the cost of going from BAM -> temp
>> fastq may be prohibitive, e.g. the need to generate very large
>> temp fastq files on the fly as input for various applications may
>> lead one back to just keeping a permanent FASTQ around anyway.
>
> True - if you can't update the tools you need to take BAM.
> In some cases at least you can pipe the gzipped FASTQ
> into alignment tools which accepts FASTQ on stdin, so
> there is no temp file per se.

Some applications (Velvet for instance) accept gzipped FASTQ, though they may turn around and dump the data out uncompressed.

>>  One could probably get better performance out of a simpler
>> format that removes most of the 'AM' parts of BAM.
>
> Yes, but that meaning inventing yet another file format. At least
> gzipped FASTQ is quite straightforward.

Yes.

>> Or is the idea that the file itself is modified, like a database?
>
> That would be quite a dramatic change from the current
> Galaxy workflow system - I doubt that would be acceptable
> in general.

My thought as well.

>> And how would indexing work (BAM uses binning on the
>> match to the reference seq), or does it matter?
>
> BAM indexing as done in samtools/picard is only for the aligned
> reads - so no help for a BAM file of unaligned reads. You could
> use a different indexing system (e.g. by read name) and the
> same BAM BGZF block offset system (I've tried this as an
> experiment with Biopython's SQLite indexing of sequence files).
>
> However, for tasks taking unaligned reads as input, you
> generally just iterate over the reads in the order on disk.

I think, unless there is a demonstrable advantage to using unaligned BAM, fastq.gz is the easiest.

>> I recall hdf5 was planned as an alternate format (PacBio uses
>> it, IIRC), and of course there is NCBI's .sra format.  Anyone
>> using the latter two?
>
> Moving from the custom BGZF modified gzip format used in
> BAM to HD5 has been proposed on the samtools mailing list
> (as Chris knows), and there is a proof of principle implementation
> too in BioHDF, http://www.hdfgroup.org/projects/biohdf/
> The SAM/BAM group didn't seem overly enthusiastic though.

Probably not, as it is somewhat a competitor of SAM/BAM (a bit broader in scope, beyond just alignments).  As Peter indicated, I know the BioHDF folks (they are here in town); however, my actual question was whether anyone is actually using HDF5 or SRA in production?  I haven't seen adoption beyond PacBio, but I have seen some things popping up in Galaxy.

> For the NCBI's .sra format, there is no open specification, just
> their public domain source code:
> http://seqanswers.com/forums/showthread.php?t=12054
>
> Regards,
>
> Peter

Simply gzipping FASTQ seems to give better compression that an .lite.sra file (and I'm not a happy user of their SRA toolset).  And of course there is parallel gzip...

chris


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: disk space and file formats

Edward Kirton
In reply to this post by Peter Cock
> In your position I agree that is a pragmatic choice.

Thanks for helping me muddle through my options.

> You might be able to
> modify the file upload code to gzip any FASTQ files... that would prevent
> uncompressed FASTQ getting into new histories.

Right!

> I wonder if Galaxy would benefit from a new fastqsanger-gzip (etc) datatype?
> However this seems generally useful (not just for FASTQ) so perhaps a more
> general mechanism would be better where tool XML files can say which file
> types they accept and which of those can/must be compressed (possily not
> just gzip format?).

Perhaps we can flesh-out what more general solutions would look like...

Imagine the fastq datatypes were left alone and instead there's a
mechanism by which files which haven't been used as input for x days
get compressed by a cron job.  the file server knows how to uncompress
such files on the fly when needed.  For the most part, files are
uncompressed during analysis and are compressed when the files exist
as an archive within galaxy.

An even simpler solution would be an archive/compress button which
users could use when they're done with a history.  Users could still
copy (uncompressed) datasets into a new history for further analysis.

Of course there's also the solution mentioned in the 2010 galaxy
developer's conference about automatic compression at the system
level.  Not a possibility for me, but is attractive.
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: disk space and file formats

Scott Smith
In reply to this post by Peter Cock

On Sep 2, 2011, at 8:02 PM, Peter Cock wrote:

> On Fri, Sep 2, 2011 at 9:27 PM, Fields, Christopher J
> <[hidden email]> wrote:
>> On Sep 2, 2011, at 3:02 PM, Edward Kirton wrote:
>>
>>>> What, like a BAM file of unaligned reads? Uses gzip compression, and
>>>> tracks the pairing information explicitly :) Some tools will already take
>>>> this as an input format, but not all.
>>>
>>> ah, yes, precisely.  i actually think illumina's pipeline produces
>>> files in this format now.
>
> Oh do they? - that's interesting. Do you have a reference/link?

Yeah, here at The Genome Institute at Wash-U, we get Illumina data directly in BAM format, and tries to avoid fastq conversion.  The latest BWA supports a BAM of reads as input, as well as making BAM output.  

Hopefully most tools will go that way.  You can always engineer something with named pipes in the mean time to avoid the read/write to real disk, but that requires some care.

>
>>> wrappers which create a temporary fastq file would need to be created
>>> but that's easy enough.
>>
>> My argument against that is the cost of going from BAM -> temp
>> fastq may be prohibitive, e.g. the need to generate very large
>> temp fastq files on the fly as input for various applications may
>> lead one back to just keeping a permanent FASTQ around anyway.
>
> True - if you can't update the tools you need to take BAM.
> In some cases at least you can pipe the gzipped FASTQ
> into alignment tools which accepts FASTQ on stdin, so
> there is no temp file per se.
>
>>  One could probably get better performance out of a simpler
>> format that removes most of the 'AM' parts of BAM.
>
> Yes, but that meaning inventing yet another file format. At least
> gzipped FASTQ is quite straightforward.
>
>>
>> Or is the idea that the file itself is modified, like a database?
>
> That would be quite a dramatic change from the current
> Galaxy workflow system - I doubt that would be acceptable
> in general.

And mutable data structure like that are harder to manage in a high-throughput environment.

>
>> And how would indexing work (BAM uses binning on the
>> match to the reference seq), or does it matter?
>
> BAM indexing as done in samtools/picard is only for the aligned
> reads - so no help for a BAM file of unaligned reads. You could
> use a different indexing system (e.g. by read name) and the
> same BAM BGZF block offset system (I've tried this as an
> experiment with Biopython's SQLite indexing of sequence files).
>
> However, for tasks taking unaligned reads as input, you
> generally just iterate over the reads in the order on disk.
>
>> I recall hdf5 was planned as an alternate format (PacBio uses
>> it, IIRC), and of course there is NCBI's .sra format.  Anyone
>> using the latter two?
>
> Moving from the custom BGZF modified gzip format used in
> BAM to HD5 has been proposed on the samtools mailing list
> (as Chris knows), and there is a proof of principle implementation
> too in BioHDF, http://www.hdfgroup.org/projects/biohdf/
> The SAM/BAM group didn't seem overly enthusiastic though.
>

HDF5 sounds really great, though I don't think PacBio has the data volume to tax it the way Illumina does.  There was some speculation that HDF5 would be underneath a new BAM standard, but I don't know the status of that.  

We did a few experiments in house with BioHDF in its infancy to see how it compared to BAM and it didn't capture all of the data (it was missing the somewhat critical cigar strings at this time) ...and haven't revisited it since then.  I'm sure it would be effective in storing reads, but starting your own standard when Illumina makes BAMs will probably not ultimately be as useful as going with BAM format.

> For the NCBI's .sra format, there is no open specification, just
> their public domain source code:
> http://seqanswers.com/forums/showthread.php?t=12054

This standard is ...complex, with the associated down-sides.  We only convert things into that format if explicitly required to do so.

>
> Regards,
>
> Peter
>

Best of luck,
Scott

--
Scott Smith
Manager, Application Programming and Development
Analysis Pipeline
The Genome Institute
Washington University School of Medicine


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: disk space and file formats

Paul Gordon
In reply to this post by Fields, Christopher J


Probably not, as it is somewhat a competitor of SAM/BAM (a bit broader in scope, beyond just alignments).  As Peter indicated, I know the BioHDF folks (they are here in town); however, my actual question was whether anyone is actually using HDF5 or SRA in production?  I haven't seen adoption beyond PacBio, but I have seen some things popping up in Galaxy.
  
FWIW, the XSQ format files created by the 5500 Series ABI SOLiD are HDF5, but not BioHDF:

http://solidsoftwaretools.com/gf/download/docmanfileversion/309/1079/XSQ_Webinar_20110215.pdf

-- 
______________
Paul Gordon
Bioinformatics Support Specialist
Alberta Children's Hospital Research Institute
http://www.ucalgary.ca/~gordonp

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: disk space and file formats

Nate Coraor (nate@bx.psu.edu)
In reply to this post by Edward Kirton
Edward Kirton wrote:

> > In your position I agree that is a pragmatic choice.
>
> Thanks for helping me muddle through my options.
>
> > You might be able to
> > modify the file upload code to gzip any FASTQ files... that would prevent
> > uncompressed FASTQ getting into new histories.
>
> Right!
>
> > I wonder if Galaxy would benefit from a new fastqsanger-gzip (etc) datatype?
> > However this seems generally useful (not just for FASTQ) so perhaps a more
> > general mechanism would be better where tool XML files can say which file
> > types they accept and which of those can/must be compressed (possily not
> > just gzip format?).
>
> Perhaps we can flesh-out what more general solutions would look like...
>
> Imagine the fastq datatypes were left alone and instead there's a
> mechanism by which files which haven't been used as input for x days
> get compressed by a cron job.  the file server knows how to uncompress
> such files on the fly when needed.  For the most part, files are
> uncompressed during analysis and are compressed when the files exist
> as an archive within galaxy.

Ideally, there'd just be a column on the dataset table indicating
whether the dataset is compressed or not, and then tools get a new
way to indicate whether they can directly read compressed inputs, or
whether the input needs to be decompressed first.

--nate

>
> An even simpler solution would be an archive/compress button which
> users could use when they're done with a history.  Users could still
> copy (uncompressed) datasets into a new history for further analysis.
>
> Of course there's also the solution mentioned in the 2010 galaxy
> developer's conference about automatic compression at the system
> level.  Not a possibility for me, but is attractive.
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>
>   http://lists.bx.psu.edu/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: disk space and file formats

Peter Cock
On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor <[hidden email]> wrote:

> Edward Kirton wrote:
>> Peter wrote:
>> > I wonder if Galaxy would benefit from a new fastqsanger-gzip (etc) datatype?
>> > However this seems generally useful (not just for FASTQ) so perhaps a more
>> > general mechanism would be better where tool XML files can say which file
>> > types they accept and which of those can/must be compressed (possily not
>> > just gzip format?).
>>
>> Perhaps we can flesh-out what more general solutions would look like...
>>
>> Imagine the fastq datatypes were left alone and instead there's a
>> mechanism by which files which haven't been used as input for x days
>> get compressed by a cron job.  the file server knows how to uncompress
>> such files on the fly when needed.  For the most part, files are
>> uncompressed during analysis and are compressed when the files exist
>> as an archive within galaxy.
>
> Ideally, there'd just be a column on the dataset table indicating
> whether the dataset is compressed or not, and then tools get a new
> way to indicate whether they can directly read compressed inputs, or
> whether the input needs to be decompressed first.
>
> --nate

Yes, that's what I was envisioning Nate.

Are there any schemes other than gzip which would make sense?
Perhaps rather than a boolean column (compressed or not), it
should specify the kind of compression if any (e.g. gzip).

We need something which balances compression efficiency (size)
with decompression speed, while also being widely supported in
libraries for maximum tool uptake.

Peter

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: disk space and file formats

Nate Coraor (nate@bx.psu.edu)
Peter Cock wrote:

> On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor <[hidden email]> wrote:
> > Edward Kirton wrote:
> >> Peter wrote:
> >> > I wonder if Galaxy would benefit from a new fastqsanger-gzip (etc) datatype?
> >> > However this seems generally useful (not just for FASTQ) so perhaps a more
> >> > general mechanism would be better where tool XML files can say which file
> >> > types they accept and which of those can/must be compressed (possily not
> >> > just gzip format?).
> >>
> >> Perhaps we can flesh-out what more general solutions would look like...
> >>
> >> Imagine the fastq datatypes were left alone and instead there's a
> >> mechanism by which files which haven't been used as input for x days
> >> get compressed by a cron job.  the file server knows how to uncompress
> >> such files on the fly when needed.  For the most part, files are
> >> uncompressed during analysis and are compressed when the files exist
> >> as an archive within galaxy.
> >
> > Ideally, there'd just be a column on the dataset table indicating
> > whether the dataset is compressed or not, and then tools get a new
> > way to indicate whether they can directly read compressed inputs, or
> > whether the input needs to be decompressed first.
> >
> > --nate
>
> Yes, that's what I was envisioning Nate.
>
> Are there any schemes other than gzip which would make sense?
> Perhaps rather than a boolean column (compressed or not), it
> should specify the kind of compression if any (e.g. gzip).

Makes sense.

> We need something which balances compression efficiency (size)
> with decompression speed, while also being widely supported in
> libraries for maximum tool uptake.

Yes, and there's a side effect of allowing this: you may decrease
efficiency if the tools used downstream all require decompression, and
you waste a bunch of time decompressing the dataset multiple times.

--nate

>
> Peter
>
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: disk space and file formats

Peter Cock
On Tue, Sep 6, 2011 at 5:00 PM, Nate Coraor <[hidden email]> wrote:

> Peter Cock wrote:
>> On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor <[hidden email]> wrote:
>> > Ideally, there'd just be a column on the dataset table indicating
>> > whether the dataset is compressed or not, and then tools get a new
>> > way to indicate whether they can directly read compressed inputs, or
>> > whether the input needs to be decompressed first.
>> >
>> > --nate
>>
>> Yes, that's what I was envisioning Nate.
>>
>> Are there any schemes other than gzip which would make sense?
>> Perhaps rather than a boolean column (compressed or not), it
>> should specify the kind of compression if any (e.g. gzip).
>
> Makes sense.
>
>> We need something which balances compression efficiency (size)
>> with decompression speed, while also being widely supported in
>> libraries for maximum tool uptake.
>
> Yes, and there's a side effect of allowing this: you may decrease
> efficiency if the tools used downstream all require decompression,
> and you waste a bunch of time decompressing the dataset multiple
> times.

While decompression wastes CPU time and makes things slower,
there is less data IO from disk (which may be network mounted)
which makes things faster. So overall, depending on the setup
and the task at hand, it could be faster.

Is it time to file an issue on bitbucket to track this potential
enhancement?

Peter
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: disk space and file formats

Nate Coraor (nate@bx.psu.edu)
Peter Cock wrote:

> On Tue, Sep 6, 2011 at 5:00 PM, Nate Coraor <[hidden email]> wrote:
> > Peter Cock wrote:
> >> On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor <[hidden email]> wrote:
> >> > Ideally, there'd just be a column on the dataset table indicating
> >> > whether the dataset is compressed or not, and then tools get a new
> >> > way to indicate whether they can directly read compressed inputs, or
> >> > whether the input needs to be decompressed first.
> >> >
> >> > --nate
> >>
> >> Yes, that's what I was envisioning Nate.
> >>
> >> Are there any schemes other than gzip which would make sense?
> >> Perhaps rather than a boolean column (compressed or not), it
> >> should specify the kind of compression if any (e.g. gzip).
> >
> > Makes sense.
> >
> >> We need something which balances compression efficiency (size)
> >> with decompression speed, while also being widely supported in
> >> libraries for maximum tool uptake.
> >
> > Yes, and there's a side effect of allowing this: you may decrease
> > efficiency if the tools used downstream all require decompression,
> > and you waste a bunch of time decompressing the dataset multiple
> > times.
>
> While decompression wastes CPU time and makes things slower,
> there is less data IO from disk (which may be network mounted)
> which makes things faster. So overall, depending on the setup
> and the task at hand, it could be faster.
>
> Is it time to file an issue on bitbucket to track this potential
> enhancement?

Sure.

>
> Peter
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: disk space and file formats

Edward Kirton
In reply to this post by Peter Cock
copied from another thread:

On Thu, Sep 8, 2011 at 7:30 AM, Anton Nekrutenko <[hidden email]> wrote:
What we are thinking of lately is switching to unaligned BAM for everyting. One of the benefits here is the ability to add readgroups from day 1 simplifying multisample analyses down the road.

this seems to be the simplest solution; i like it a lot.  really, only the reads need to be compressed, most other outfiles are tiny by comparison, so a more general solution may be overkill.  and if compression of everything is desired, zfs works well -- another of our sites (LANL) uses this and recommended it to me too.  i just haven't been able to convince my own IT people to go this route for technical reason beyond my attention span.

On Tue, Sep 6, 2011 at 9:05 AM, Peter Cock <[hidden email]> wrote:
On Tue, Sep 6, 2011 at 5:00 PM, Nate Coraor <[hidden email]> wrote:
> Peter Cock wrote:
>> On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor <[hidden email]> wrote:
>> > Ideally, there'd just be a column on the dataset table indicating
>> > whether the dataset is compressed or not, and then tools get a new
>> > way to indicate whether they can directly read compressed inputs, or
>> > whether the input needs to be decompressed first.
>> >
>> > --nate
>>
>> Yes, that's what I was envisioning Nate.
>>
>> Are there any schemes other than gzip which would make sense?
>> Perhaps rather than a boolean column (compressed or not), it
>> should specify the kind of compression if any (e.g. gzip).
>
> Makes sense.
>
>> We need something which balances compression efficiency (size)
>> with decompression speed, while also being widely supported in
>> libraries for maximum tool uptake.
>
> Yes, and there's a side effect of allowing this: you may decrease
> efficiency if the tools used downstream all require decompression,
> and you waste a bunch of time decompressing the dataset multiple
> times.

While decompression wastes CPU time and makes things slower,
there is less data IO from disk (which may be network mounted)
which makes things faster. So overall, depending on the setup
and the task at hand, it could be faster.

Is it time to file an issue on bitbucket to track this potential
enhancement?

Peter


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: disk space and file formats

Fields, Christopher J
The use of (unaligned) BAM for readgroups seems like a good idea.  At the very least it prevents inconsistently hacking this information into the FASTQ descriptor (a common problem with any simple format).

chris

On Sep 8, 2011, at 1:35 PM, Edward Kirton wrote:

> copied from another thread:
>
> On Thu, Sep 8, 2011 at 7:30 AM, Anton Nekrutenko <[hidden email]> wrote:
> What we are thinking of lately is switching to unaligned BAM for everyting. One of the benefits here is the ability to add readgroups from day 1 simplifying multisample analyses down the road.
>
> this seems to be the simplest solution; i like it a lot.  really, only the reads need to be compressed, most other outfiles are tiny by comparison, so a more general solution may be overkill.  and if compression of everything is desired, zfs works well -- another of our sites (LANL) uses this and recommended it to me too.  i just haven't been able to convince my own IT people to go this route for technical reason beyond my attention span.
>
> On Tue, Sep 6, 2011 at 9:05 AM, Peter Cock <[hidden email]> wrote:
> On Tue, Sep 6, 2011 at 5:00 PM, Nate Coraor <[hidden email]> wrote:
> > Peter Cock wrote:
> >> On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor <[hidden email]> wrote:
> >> > Ideally, there'd just be a column on the dataset table indicating
> >> > whether the dataset is compressed or not, and then tools get a new
> >> > way to indicate whether they can directly read compressed inputs, or
> >> > whether the input needs to be decompressed first.
> >> >
> >> > --nate
> >>
> >> Yes, that's what I was envisioning Nate.
> >>
> >> Are there any schemes other than gzip which would make sense?
> >> Perhaps rather than a boolean column (compressed or not), it
> >> should specify the kind of compression if any (e.g. gzip).
> >
> > Makes sense.
> >
> >> We need something which balances compression efficiency (size)
> >> with decompression speed, while also being widely supported in
> >> libraries for maximum tool uptake.
> >
> > Yes, and there's a side effect of allowing this: you may decrease
> > efficiency if the tools used downstream all require decompression,
> > and you waste a bunch of time decompressing the dataset multiple
> > times.
>
> While decompression wastes CPU time and makes things slower,
> there is less data IO from disk (which may be network mounted)
> which makes things faster. So overall, depending on the setup
> and the task at hand, it could be faster.
>
> Is it time to file an issue on bitbucket to track this potential
> enhancement?
>
> Peter
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>
>  http://lists.bx.psu.edu/


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
12