Galaxy on the Cloud/RNA-Seq

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Galaxy on the Cloud/RNA-Seq

dmarti
Galaxy on the Cloud/RNA-Seq Hello,

We are about to get about 200 GB of illumina reads(43 bp) from 20 samples, two groups of 10 animals.  We are hoping to use Galaxy on the Cloud to compare gene expression between the two groups.  First of all, do you think this is possible with the current state of Galaxy Cloud development?  Secondly, we are currently practicing with small drosophila datasets (4 sets of 2 GB each), and over the course of a few days of doing relatively little besides grooming and filtering the data, we had already been charged $60 by Amazon, which we thought was a bit inefficient.  What is the best way to proceed working from one day to the next?  Should one terminate the cluster at Cloud Console and then stop(pause) the cluster at the AWS console, and then restart the instance the next day?  Does one have to reattach all of the EBS volumes before restarting the cluster?  We were just terminating the instance and then bringing it back up and all the data was still there, ie it worked fine, but when we looked after a couple days there were 45 EBS volumes there -  much of it was surely redundant as our data wasn’t very large.   Perhaps we need to take a snapshot and reboot the instance from this? Thank you for any hints regarding this matter, this is all very new to me.  Let me know if you need clarification or more information.

David Martin
[hidden email]

_______________________________________________
galaxy-user mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-user
Reply | Threaded
Open this post in threaded view
|

Re: Galaxy on the Cloud/RNA-Seq

Maximilian Haussler
I'd be interested in why AWS is so expensive for these datasets:

Is it mostly
a) the data transfer between nodes?
b) the data storage on EBS?
c) the CPU time ?
why next-gen analysis is expensive on the cloud? 

Can anyone who is actively using AWS look up the distribution of the total cost on the individual types?

I guess that there is a lot of room for improvement for the different costs, depending on the type of algorithm that you're using.

thanks in advance
Max


On Tue, Nov 23, 2010 at 8:17 PM, David Martin <[hidden email]> wrote:
Hello,

We are about to get about 200 GB of illumina reads(43 bp) from 20 samples, two groups of 10 animals.  We are hoping to use Galaxy on the Cloud to compare gene expression between the two groups.  First of all, do you think this is possible with the current state of Galaxy Cloud development?  Secondly, we are currently practicing with small drosophila datasets (4 sets of 2 GB each), and over the course of a few days of doing relatively little besides grooming and filtering the data, we had already been charged $60 by Amazon, which we thought was a bit inefficient.  What is the best way to proceed working from one day to the next?  Should one terminate the cluster at Cloud Console and then stop(pause) the cluster at the AWS console, and then restart the instance the next day?  Does one have to reattach all of the EBS volumes before restarting the cluster?  We were just terminating the instance and then bringing it back up and all the data was still there, ie it worked fine, but when we looked after a couple days there were 45 EBS volumes there -  much of it was surely redundant as our data wasn’t very large.   Perhaps we need to take a snapshot and reboot the instance from this? Thank you for any hints regarding this matter, this is all very new to me.  Let me know if you need clarification or more information.

David Martin
[hidden email]

_______________________________________________
galaxy-user mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-user



_______________________________________________
galaxy-user mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-user
Reply | Threaded
Open this post in threaded view
|

Re: Galaxy on the Cloud/RNA-Seq

Anton Nekrutenko
In reply to this post by dmarti
David:

For a pilot I would just use our public instance at http://usegalaxy.org to polish up the exact workflow and settings that would give you satisfactory results on a subset of data. This way it would be much easier to figure out where you can "cut-corners" for performance. You will then have a "best-parctise" workflow that you'll be able to rerun on the cloud. 

Use the new ftp-based upload to get datasets into Galaxy.

Thanks!

anton



On Nov 23, 2010, at 3:17 PM, David Martin wrote:

Hello,

We are about to get about 200 GB of illumina reads(43 bp) from 20 samples, two groups of 10 animals.  We are hoping to use Galaxy on the Cloud to compare gene expression between the two groups.  First of all, do you think this is possible with the current state of Galaxy Cloud development?  Secondly, we are currently practicing with small drosophila datasets (4 sets of 2 GB each), and over the course of a few days of doing relatively little besides grooming and filtering the data, we had already been charged $60 by Amazon, which we thought was a bit inefficient.  What is the best way to proceed working from one day to the next?  Should one terminate the cluster at Cloud Console and then stop(pause) the cluster at the AWS console, and then restart the instance the next day?  Does one have to reattach all of the EBS volumes before restarting the cluster?  We were just terminating the instance and then bringing it back up and all the data was still there, ie it worked fine, but when we looked after a couple days there were 45 EBS volumes there -  much of it was surely redundant as our data wasn’t very large.   Perhaps we need to take a snapshot and reboot the instance from this? Thank you for any hints regarding this matter, this is all very new to me.  Let me know if you need clarification or more information.

David Martin
[hidden email]
_______________________________________________
galaxy-user mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-user



_______________________________________________
galaxy-user mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-user
Reply | Threaded
Open this post in threaded view
|

Re: Galaxy on the Cloud/RNA-Seq

Enis Afgan-2
In reply to this post by dmarti
Your approach for terminating a cluster and starting it back up when it's needed should continue to be fine for your purposes. That's the best and pretty much the only way to minimize the cost. 
The reason there are 45 EBS volumes created is because each time you start an instance, a root EBS volume from snapshot 'snap-f3a64f99' is created to serve as the root file system. When you terminate that particular instance, that EBS volume is no longer needed and can be deleted (in the next AMI we build, we will enable deletion of that volume automatically upon instance termination). In other words, feel free to delete all EBS volumes that were created from a snapshot; they can be and are recreated when needed. The only volume that should not be deleted is your data volume. The ID of this volume can be found in your cluster's bucket (cm-<HASH>) in your S3 account in file named persistent_data.txt
As a reference, don't attach/detach EBS volumes manually to running Galaxy Cloud instances because the application will lose track of them and not be able to recover. In addition, always click 'Terminate cluster' on the Galaxy Cloud main UI and wait for it to shutdown all of he services; then *terminate* the master instance from AWS console (don't *stop* the instance).

As far as uploading 200GB of data to a cloud instance and processing it there. In principle, it should work. However, there is a 1TB limit on EBS volumes imposed by Amazon. As a result, and considering the multiple transformation steps your data will have to go through within Galaxy, I am concerned that you will reach that 1TB limit. We will be working on expanding beyond that limit by composing a filesystem from multiple EBS volumes but that's not available yet.

Hope this helps; let us know if you have any more questions,
Enis

On Tue, Nov 23, 2010 at 3:17 PM, David Martin <[hidden email]> wrote:
Hello,

We are about to get about 200 GB of illumina reads(43 bp) from 20 samples, two groups of 10 animals.  We are hoping to use Galaxy on the Cloud to compare gene expression between the two groups.  First of all, do you think this is possible with the current state of Galaxy Cloud development?  Secondly, we are currently practicing with small drosophila datasets (4 sets of 2 GB each), and over the course of a few days of doing relatively little besides grooming and filtering the data, we had already been charged $60 by Amazon, which we thought was a bit inefficient.  What is the best way to proceed working from one day to the next?  Should one terminate the cluster at Cloud Console and then stop(pause) the cluster at the AWS console, and then restart the instance the next day?  Does one have to reattach all of the EBS volumes before restarting the cluster?  We were just terminating the instance and then bringing it back up and all the data was still there, ie it worked fine, but when we looked after a couple days there were 45 EBS volumes there -  much of it was surely redundant as our data wasn’t very large.   Perhaps we need to take a snapshot and reboot the instance from this? Thank you for any hints regarding this matter, this is all very new to me.  Let me know if you need clarification or more information.

David Martin
[hidden email]

_______________________________________________
galaxy-user mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-user



_______________________________________________
galaxy-user mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-user
Reply | Threaded
Open this post in threaded view
|

Re: Galaxy on the Cloud/RNA-Seq

Enis Afgan-2
In reply to this post by Maximilian Haussler
It's just that computing, and cloud computing with that, is expensive. Depending on the usage, either the EBS volumes or the CPU time (i.e., instances) is what will represent majority of the cost. Most likely, it will be the instances, unless you use very few instances for a short period and a lot of storage.

There are a couple of papers I can recall analyzing the cost of science in the cloud, if you want to take a look:
- Deelman E, Singh G, Livny M, Berriman B, Good J: The cost of doing science on the cloud: the Montage example
- Wilkening J, Wilke A, Desai N, Meyer F: Using Clouds for Metagenomics: A Case Study


Enis

On Tue, Nov 23, 2010 at 3:43 PM, Maximilian Haussler <[hidden email]> wrote:
I'd be interested in why AWS is so expensive for these datasets:

Is it mostly
a) the data transfer between nodes?
b) the data storage on EBS?
c) the CPU time ?
why next-gen analysis is expensive on the cloud? 

Can anyone who is actively using AWS look up the distribution of the total cost on the individual types?

I guess that there is a lot of room for improvement for the different costs, depending on the type of algorithm that you're using.

thanks in advance
Max


On Tue, Nov 23, 2010 at 8:17 PM, David Martin <[hidden email]> wrote:
Hello,

We are about to get about 200 GB of illumina reads(43 bp) from 20 samples, two groups of 10 animals.  We are hoping to use Galaxy on the Cloud to compare gene expression between the two groups.  First of all, do you think this is possible with the current state of Galaxy Cloud development?  Secondly, we are currently practicing with small drosophila datasets (4 sets of 2 GB each), and over the course of a few days of doing relatively little besides grooming and filtering the data, we had already been charged $60 by Amazon, which we thought was a bit inefficient.  What is the best way to proceed working from one day to the next?  Should one terminate the cluster at Cloud Console and then stop(pause) the cluster at the AWS console, and then restart the instance the next day?  Does one have to reattach all of the EBS volumes before restarting the cluster?  We were just terminating the instance and then bringing it back up and all the data was still there, ie it worked fine, but when we looked after a couple days there were 45 EBS volumes there -  much of it was surely redundant as our data wasn’t very large.   Perhaps we need to take a snapshot and reboot the instance from this? Thank you for any hints regarding this matter, this is all very new to me.  Let me know if you need clarification or more information.

David Martin
[hidden email]

_______________________________________________
galaxy-user mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-user



_______________________________________________
galaxy-user mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-user



_______________________________________________
galaxy-user mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-user
Reply | Threaded
Open this post in threaded view
|

Re: Galaxy on the Cloud/RNA-Seq

Maximilian Haussler
OK OK, cloud computing is expensive.

But I also know from my own experience that you can cut I/O by a factor of 10-20 and CPU by a factor of ten as well:
- use bowtie for mapping (but index is quite big): saves a lot of CPU
- compress input fastq files (reduces size to 1/5) and read only compressed files
- extreme solution: strip all quality values from fastq (reduces size to 1/4)
- remove all file-concatenation steps
- pipe into samtools to convert to bam immediately after mapping, always save in bam format
- strip all unmapped reads directly with samtools -F4

but I wonder how much that would save in the end...?

cheers
Max
--
Maximilian Haussler
Tel: +447574246789
http://www.manchester.ac.uk/research/maximilian.haussler/


On Tue, Nov 23, 2010 at 9:02 PM, Enis Afgan <[hidden email]> wrote:
It's just that computing, and cloud computing with that, is expensive. Depending on the usage, either the EBS volumes or the CPU time (i.e., instances) is what will represent majority of the cost. Most likely, it will be the instances, unless you use very few instances for a short period and a lot of storage.

There are a couple of papers I can recall analyzing the cost of science in the cloud, if you want to take a look:
- Deelman E, Singh G, Livny M, Berriman B, Good J: The cost of doing science on the cloud: the Montage example
- Wilkening J, Wilke A, Desai N, Meyer F: Using Clouds for Metagenomics: A Case Study


Enis

On Tue, Nov 23, 2010 at 3:43 PM, Maximilian Haussler <[hidden email]> wrote:
I'd be interested in why AWS is so expensive for these datasets:

Is it mostly
a) the data transfer between nodes?
b) the data storage on EBS?
c) the CPU time ?
why next-gen analysis is expensive on the cloud? 

Can anyone who is actively using AWS look up the distribution of the total cost on the individual types?

I guess that there is a lot of room for improvement for the different costs, depending on the type of algorithm that you're using.

thanks in advance
Max


On Tue, Nov 23, 2010 at 8:17 PM, David Martin <[hidden email]> wrote:
Hello,

We are about to get about 200 GB of illumina reads(43 bp) from 20 samples, two groups of 10 animals.  We are hoping to use Galaxy on the Cloud to compare gene expression between the two groups.  First of all, do you think this is possible with the current state of Galaxy Cloud development?  Secondly, we are currently practicing with small drosophila datasets (4 sets of 2 GB each), and over the course of a few days of doing relatively little besides grooming and filtering the data, we had already been charged $60 by Amazon, which we thought was a bit inefficient.  What is the best way to proceed working from one day to the next?  Should one terminate the cluster at Cloud Console and then stop(pause) the cluster at the AWS console, and then restart the instance the next day?  Does one have to reattach all of the EBS volumes before restarting the cluster?  We were just terminating the instance and then bringing it back up and all the data was still there, ie it worked fine, but when we looked after a couple days there were 45 EBS volumes there -  much of it was surely redundant as our data wasn’t very large.   Perhaps we need to take a snapshot and reboot the instance from this? Thank you for any hints regarding this matter, this is all very new to me.  Let me know if you need clarification or more information.

David Martin
[hidden email]

_______________________________________________
galaxy-user mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-user



_______________________________________________
galaxy-user mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-user




_______________________________________________
galaxy-user mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-user
Reply | Threaded
Open this post in threaded view
|

Re: Galaxy on the Cloud/RNA-Seq

Hiram Clawson
In reply to this post by Enis Afgan-2
It costs about $500 per month to run a single AMI instance with several CPUs

--Hiram
_______________________________________________
galaxy-user mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-user
Reply | Threaded
Open this post in threaded view
|

Re: Galaxy on the Cloud/RNA-Seq

Enis Afgan-2
In reply to this post by Enis Afgan-2
Martin,

On Tue, Nov 23, 2010 at 4:52 PM, Martin, David A. <[hidden email]> wrote:

Enis,

Thank you for the help.   So, assume I start up a new instance and specify 1 master with 100 GB and 5 slave nodes, several EBS volumes will be created.  Now when I finish working in galaxy.  I should first terminate the instance from the cloud console, and then terminate the master and slaves from the AWS console, right?  Now, several EBS volumes are still up and can be deleted, except for one.  To identify which volume has the data, I should look in my S3 bitbucket at the file persistent.txt, no?  Is this file in a snapshot that is automatically created when I terminate an instance through cloud console? 

Once you've finished your work with Galaxy for the time being, yes, click Terminate cluster on the Galaxy Cloud console. That will stop all of the services running on the cluster and also terminated all of the worker nodes. Once you see 'Cluster shut down...' at the bottom of the cluster status log on the Galaxy Cloud console, from the AWS console, terminate the master instance. That is the only instance that should still be running at that point. 
Then, you can delete all of the EBS volumes that that were created from a snapshot. All these EBS volumes should be 15GB in size and created from snapshot 'snap-f3a64f99' (there should be 6 of them based on your example: 1 from the master and 5 from workers). That should be it. You don't really need to go digging through the persistent_data.txt file in the S3 bucket because your data volume should be the only one that's still available at that point, plus you can always pick it out from the rest by looking at its size (100GB in your example).


Regarding the 1 TB limit, I am thinking that intermediate files can be moved out of the persistent files as they are no longer needed and saved somewhere else, so that galaxy is never working with more that 1 TB...  I am unsure how tricky this will be but I suppose it is possible in principle?  My conception of the workflow is limited, but I think we need to convert(groom) to sanger, map with tophat/bowtie, and then use cufflinks for comparing expression.  I am trying to practice ahead of time to see what kind of files/sizes these steps generate and figure out how this can work on the cloud.  Thanks again.

I guess that could work but realize that you'll have to ssh to the instance and clean up the datasets by hand.

 
Enis


-----Original Message-----
From: Enis Afgan [[hidden email]]
Sent: Tue 11/23/2010 2:55 PM
To: Martin, David A.
Cc: [hidden email]
Subject: Re: [galaxy-user] Galaxy on the Cloud/RNA-Seq

Your approach for terminating a cluster and starting it back up when it's
needed should continue to be fine for your purposes. That's the best and
pretty much the only way to minimize the cost.
The reason there are 45 EBS volumes created is because each time you start
an instance, a root EBS volume from snapshot 'snap-f3a64f99' is created to
serve as the root file system. When you terminate that particular instance,
that EBS volume is no longer needed and can be deleted (in the next AMI we
build, we will enable deletion of that volume automatically upon instance
termination). In other words, feel free to delete all EBS volumes that were
created from a snapshot; they can be and are recreated when needed. The only
volume that should not be deleted is your data volume. The ID of this volume
can be found in your cluster's bucket (cm-<HASH>) in your S3 account in file
named persistent_data.txt
As a reference, don't attach/detach EBS volumes manually to running Galaxy
Cloud instances because the application will lose track of them and not be
able to recover. In addition, always click 'Terminate cluster' on the Galaxy
Cloud main UI and wait for it to shutdown all of he services; then
*terminate* the master instance from AWS console (don't *stop* the
instance).

As far as uploading 200GB of data to a cloud instance and processing it
there. In principle, it should work. However, there is a 1TB limit on EBS
volumes imposed by Amazon. As a result, and considering the multiple
transformation steps your data will have to go through within Galaxy, I am
concerned that you will reach that 1TB limit. We will be working on
expanding beyond that limit by composing a filesystem from multiple EBS
volumes but that's not available yet.

Hope this helps; let us know if you have any more questions,
Enis

On Tue, Nov 23, 2010 at 3:17 PM, David Martin <[hidden email]> wrote:

>  Hello,
>
> We are about to get about 200 GB of illumina reads(43 bp) from 20 samples,
> two groups of 10 animals.  We are hoping to use Galaxy on the Cloud to
> compare gene expression between the two groups.  First of all, do you think
> this is possible with the current state of Galaxy Cloud development?
>  Secondly, we are currently practicing with small drosophila datasets (4
> sets of 2 GB each), and over the course of a few days of doing relatively
> little besides grooming and filtering the data, we had already been charged
> $60 by Amazon, which we thought was a bit inefficient.  What is the best way
> to proceed working from one day to the next?  Should one terminate the
> cluster at Cloud Console and then stop(pause) the cluster at the AWS
> console, and then restart the instance the next day?  Does one have to
> reattach all of the EBS volumes before restarting the cluster?  We were just
> terminating the instance and then bringing it back up and all the data was
> still there, ie it worked fine, but when we looked after a couple days there
> were 45 EBS volumes there -  much of it was surely redundant as our data
> wasn't very large.   Perhaps we need to take a snapshot and reboot the
> instance from this? Thank you for any hints regarding this matter, this is
> all very new to me.  Let me know if you need clarification or more
> information.
>
> David Martin
> [hidden email]
>
> _______________________________________________
> galaxy-user mailing list
> [hidden email]
> http://lists.bx.psu.edu/listinfo/galaxy-user
>
>



_______________________________________________
galaxy-user mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-user