Recommended Specs for Production System

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Recommended Specs for Production System

Ryan Golhar
Hi all - So, I been asked to provide specs for a production Galaxy
system to support approximately 20-30 users.  Most of these users are
new to bioinformatics and very new to NGS.  I'm targeting a user base
that will use a light to moderate amount of NGS data.

I've looked at the the Produce Server Wiki page stuff, but I'm curious
what everyone else is using or recommends? How big of a compute cluster,
how much storage, proxy/web server configurations, etc, etc.

If you had to deploy a production system, based on what you know, what
would you choose?

Ryan
--
CONFIDENTIALITY NOTICE: This email communication may contain private,
confidential, or legally privileged information intended for the sole
use of the designated and/or duly authorized recipient(s). If you are
not the intended recipient or have received this email in error, please
notify the sender immediately by email and permanently delete all copies
of this email including all attachments without reading them. If you are
the intended recipient, secure the contents in a manner that conforms to
all applicable state and/or federal requirements related to privacy and
confidentiality of such information.

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

golharam.vcf (398 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Recommended Specs for Production System

Hans-Rudolf Hotz


On 04/07/2011 11:40 PM, Ryan Golhar wrote:

> Hi all - So, I been asked to provide specs for a production Galaxy
> system to support approximately 20-30 users. Most of these users are new
> to bioinformatics and very new to NGS. I'm targeting a user base that
> will use a light to moderate amount of NGS data.
>
> I've looked at the the Produce Server Wiki page stuff, but I'm curious
> what everyone else is using or recommends? How big of a compute cluster,
> how much storage, proxy/web server configurations, etc, etc.
>
> If you had to deploy a production system, based on what you know, what
> would you choose?
>

Hi Ryan


I would go for a single (multicore) box. With just 20-30 users who are
'new to bioinformatics' you will hardly ever have more than 3 users
using Galaxy at the same time - you can always limit the number of
concurrent galaxy jobs in the universe_wsgi.ini file
('local_job_queue_workers').

Since you are expecting NGS data, having the right amount of RAM would
be my biggest concern. What do you mean by "light to moderate amount of
NGS data"? are you talking about the number of samples to process or are
you talking about the individual size of the sample. The latter will
have an impact on the required amount of RAM. On the other hand both
will have an impact on the amount of storage required.

You have to make the calculations for required storage and RAM first,
but this is independent of whether you use Galaxy or not. The only risk
when offering NGS tools via galaxy it might be to easy to run them
resulting in a lot of 'garbage' or redundant NGS processing. That's why
it is important to disable anonymous access so you can track who is
doing what.

Using external authentication is very handy. However, it does restrict
you to users already in you 'network'. We are using it, and it is
sometimes annoying, as I can't have temporary guest accounts - our IT
guys would have to create a new 'member' of our institute for every
guest....


Hope this helps, Hans


> Ryan
>
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>
>    http://lists.bx.psu.edu/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Recommended Specs for Production System

Assaf Gordon-2
In reply to this post by Ryan Golhar
Hello Ryan,

Ryan Golhar wrote, On 04/07/2011 05:40 PM:
> Hi all - So, I been asked to provide specs for a production Galaxy
> system to support approximately 20-30 users.  Most of these users
> are new to bioinformatics and very new to NGS.  I'm targeting a user
> base that will use a light to moderate amount of NGS data.

I discovered that those "new to NGS" cause the most amount of damage :)
running bowtie/tophat/bwa over and over again, and even "moderate" amounts of NGS data can be become taxing very quickly (by grooming :) ).

> I've looked at the the Produce Server Wiki page stuff, but I'm
> curious what everyone else is using or recommends? How big of a
> compute cluster, how much storage, proxy/web server configurations,
> etc, etc.

We're using a 16 core, 32GB RAM, ~15TB storage server and it's good for most "regular" galaxy operations, but severely lacking for NGS mapping.
We are moving onto a 48 core, 128GB, 34TB storage and hope it'll be somewhat better (still not enough for heavy NGS usage).
We're also running some jobs on an SGE cluster.

----
Storage: NGS data sizes grows way faster than storage sizes, so planning is hard.

What would you call a "moderate amount" of NGS data ?

Let's say a single Illumina lane is our unit of choice.
A paired-end run with 72-cycles yielding 35M reads (reasonable in today's terms) gives ~15GB per lane.
A paired-end run with 100-cycles on a HiSeq machine will hopefully yield 200M reads (in the near future?) - each lane will be ~100GB.
Those numbers are uncompressed FASTQ files (galaxy can't handle compressed data at the moment).
Of course your users could be doing just single-end 36-cycles - but don't plan for the best-case scenario.
With sequencing costs dropping rapidly, your users will do more sequencing than you expect.

Now, take the size of one lane of data (15GB, 100GB, whatever), and look at your expected galaxy pipeline:
1. upload the files (size*1)
2. groom the files (argggg. size*1)
3. map with something, get an unsorted SAM file (size*3 to size*5, depending on mapping parameters)
4. Convert to BAM (size*1, luckily it's compressed)
5. use those aligned reads for annotation or similar (size*1)
etc. etc.

If you use Galaxy's library management, then you'll want to keep all your FASTQ files somehow available for Galaxy at all times - meaning more storage.

The only way we're able to keep storage at 15TB is by aggressively deleting user's datasets (as Hans wrote),
but plan for "lane size"*10 at least for temporary storage, and how many lanes you're going to handle at once before you start deleting files.
and probably even that wouldn't be enough :(

-----
Processes:

The servers processes that you should plan for are:
1 galaxy process for job-runner
2 or 3 galaxy processes for web-fronts
1 process of postgres
1 process of apache
optionally 1 process of galaxy-reports
you'll also want to leave some free CPUs for SSH access, CRON jobs and other peripherals.
Postgres & apache are multithreaded, but it usually balances out with light load on the web/DB front from galaxy (even with 30 users).
So all in all, I'd recommend reserving 5 to 8 CPU cores to just galaxy and daemons (reserving means: never using those cores for galaxy jobs).
You can do with less cores, but then response times might suffer (and it's annoying when you click "show saved histories" and the page takes 20 seconds to load...).

If other people are using different calculations, I'm more than happy to hear.

------
Galaxy jobs:

Compared with NGS related jobs (i.e. mapping), most galaxy jobs are very simple and light (even if they take long to complete).

Plan by estimating how much time a common pipeline takes for a single lane:
Let's say I have a 48-core server, with 8 cores reserved for Galaxy/deamons.
that leaves me with 40 cores.
I can plan for 1 galaxy job with 40 threads (bowtie/tophat/bwa/etc.), or 2 jobs with 20 threads, or 4 jobs with 10 threads, etc.
How much time do you want your users to wait for their jobs to complete ?
with only 10 threads, 4 users can run jobs at the same time (but each job would take longer).
with 20 threads, only 2 users can run jobs at the same time (but hopefully it will be faster).

As Hans wrote, most of the time just a few users are actively using your galaxy server at any single point in time.
But if two users started a mapping job that will take 4 hours to complete, and three hours later another user wants to start a mapping job - hell have to wait...

So it's a balancing act between providing users fast results, and keeping the costs down (with fewer cores/nodes).

I don't know of a good text book answer here, it depends on what your users are willing to accept:
If you got a whole flowcell (7 or 8 lanes, one for each user) - the first two/four users can run jobs immediately, the rest will have to wait for 5 hours - is that acceptable ? if not - get more nodes.

If you buy more than one node (i.e. not just one machine with 48 cores), I'd recommend going for fewer nodes with many cores (as opposed to many nodes with just 4 cores). It seems most common tools today make better use of threads (requiring SMP) then using MPI or similar non-shared-memory processing.


Comments are always welcomed,
 -gordon
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Recommended Specs for Production System

Assaf Gordon-2
Assaf Gordon wrote, On 04/08/2011 10:07 AM:

> Processes:
>
> The servers processes that you should plan for are:
> 1 galaxy process for job-runner
> 2 or 3 galaxy processes for web-fronts
> 1 process of postgres
> 1 process of apache
> optionally 1 process of galaxy-reports
> you'll also want to leave some free CPUs for SSH access, CRON jobs and other peripherals.
> Postgres & apache are multithreaded, but it usually balances out with light load on the web/DB front from galaxy (even with 30 users).
> So all in all, I'd recommend reserving 5 to 8 CPU cores to just galaxy and daemons (reserving means: never using those cores for galaxy jobs).
> You can do with less cores, but then response times might suffer (and it's annoying when you click "show saved histories" and the page takes 20 seconds to load...).
>
Forgot to mention SGE/PBS: you definitely want to use them (even if you're using a single machine),
because the local job runner doesn't take into account multi-threaded programs when scheduling jobs.
So another core is needed for the SGE scheduler daemons (sge_qmaster and sge_execd).


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Recommended Specs for Production System

Nate Coraor (nate@bx.psu.edu)
Assaf Gordon wrote:

> Forgot to mention SGE/PBS: you definitely want to use them (even if you're using a single machine),
> because the local job runner doesn't take into account multi-threaded programs when scheduling jobs.
> So another core is needed for the SGE scheduler daemons (sge_qmaster and sge_execd).

I haven't tested, but it's entirely possible that the SGE daemons could
happily share cores with other processes.  I'd be surprised if they
spent a whole lot of time on-CPU.

A cluster runner is recommended for other reasons, too - restartability
of the Galaxy process is one of the big ones.

--nate
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Recommended Specs for Production System

Sean Davis
On Fri, Apr 8, 2011 at 10:26 AM, Nate Coraor <[hidden email]> wrote:
> Assaf Gordon wrote:
>
>> Forgot to mention SGE/PBS: you definitely want to use them (even if you're using a single machine),
>> because the local job runner doesn't take into account multi-threaded programs when scheduling jobs.
>> So another core is needed for the SGE scheduler daemons (sge_qmaster and sge_execd).
>
> I haven't tested, but it's entirely possible that the SGE daemons could
> happily share cores with other processes.  I'd be surprised if they
> spent a whole lot of time on-CPU.

We run SGE for NGS and do not find a need to set aside cores for the
daemons.  That said, if you do have an active cluster (more than a
couple of machines), the SGE master node does benefit from having a
core set aside.

Sean

> A cluster runner is recommended for other reasons, too - restartability
> of the Galaxy process is one of the big ones.
>
> --nate
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>
>  http://lists.bx.psu.edu/
>

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Recommended Specs for Production System

Dave Walton
In reply to this post by Assaf Gordon-2
This is very close to our config, except -
We run all of this on a 4 core Virtual Machine running SUSE Linux Enterprise
Server 11 (x86_64) with 16 GB of memory.

Instead of SGE our HPC cluster uses Torque/Moab for scheduling.

Also, we've set up a separate IO Node for upload of data files from the file
system and FTP (correct me if I mis-spoke Glen).

Also, instead of apache we run nginx for our httpd server as it was easy to
get off-loading of file upload and download working with that server.

We're not seeing a heavy load from users at this point, but this has worked
pretty well for us so far.

Hope this helps,

Dave


On 4/8/11 10:21 AM, "Assaf Gordon" <[hidden email]> wrote:

> Assaf Gordon wrote, On 04/08/2011 10:07 AM:
>> Processes:
>>
>> The servers processes that you should plan for are:
>> 1 galaxy process for job-runner
>> 2 or 3 galaxy processes for web-fronts
>> 1 process of postgres
>> 1 process of apache
>> optionally 1 process of galaxy-reports
>> you'll also want to leave some free CPUs for SSH access, CRON jobs and other
>> peripherals.
>> Postgres & apache are multithreaded, but it usually balances out with light
>> load on the web/DB front from galaxy (even with 30 users).
>> So all in all, I'd recommend reserving 5 to 8 CPU cores to just galaxy and
>> daemons (reserving means: never using those cores for galaxy jobs).
>> You can do with less cores, but then response times might suffer (and it's
>> annoying when you click "show saved histories" and the page takes 20 seconds
>> to load...).
>>
> Forgot to mention SGE/PBS: you definitely want to use them (even if you're
> using a single machine),
> because the local job runner doesn't take into account multi-threaded programs
> when scheduling jobs.
> So another core is needed for the SGE scheduler daemons (sge_qmaster and
> sge_execd).
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>
>   http://lists.bx.psu.edu/


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Recommended Specs for Production System

Glen Beane

On Apr 8, 2011, at 11:01 AM, Dave Walton wrote:

> This is very close to our config, except -
> We run all of this on a 4 core Virtual Machine running SUSE Linux Enterprise
> Server 11 (x86_64) with 16 GB of memory.
>
> Instead of SGE our HPC cluster uses Torque/Moab for scheduling.
>
> Also, we've set up a separate IO Node for upload of data files from the file
> system and FTP (correct me if I mis-spoke Glen).
>
> Also, instead of apache we run nginx for our httpd server as it was easy to
> get off-loading of file upload and download working with that server.
>
> We're not seeing a heavy load from users at this point, but this has worked
> pretty well for us so far.
>
> Hope this helps,
>
> Dave
>


The only reason we offload the upload jobs somewhere other than our HPC cluster is that our cluster nodes do not see the outside world.  Our IT folks did not really want to change the network configuration, so we installed TORQUE on a spare Linux server, mounted our galaxy network storage on it,  and we setup some upload specific job runners that send those jobs to that node.  If you have NAT setup on your cluster you probably don't need to worry about that.

We have pretty "fat" cluster nodes (128GB RAM and 32 cores) since we run a lot of  multi-threaded jobs on the cluster but not a lot of MPI jobs.  Our NGS tools are typically configured to use 16-32 threads.




>
> On 4/8/11 10:21 AM, "Assaf Gordon" <[hidden email]> wrote:
>
>> Assaf Gordon wrote, On 04/08/2011 10:07 AM:
>>> Processes:
>>>
>>> The servers processes that you should plan for are:
>>> 1 galaxy process for job-runner
>>> 2 or 3 galaxy processes for web-fronts
>>> 1 process of postgres
>>> 1 process of apache
>>> optionally 1 process of galaxy-reports
>>> you'll also want to leave some free CPUs for SSH access, CRON jobs and other
>>> peripherals.
>>> Postgres & apache are multithreaded, but it usually balances out with light
>>> load on the web/DB front from galaxy (even with 30 users).
>>> So all in all, I'd recommend reserving 5 to 8 CPU cores to just galaxy and
>>> daemons (reserving means: never using those cores for galaxy jobs).
>>> You can do with less cores, but then response times might suffer (and it's
>>> annoying when you click "show saved histories" and the page takes 20 seconds
>>> to load...).
>>>
>> Forgot to mention SGE/PBS: you definitely want to use them (even if you're
>> using a single machine),
>> because the local job runner doesn't take into account multi-threaded programs
>> when scheduling jobs.
>> So another core is needed for the SGE scheduler daemons (sge_qmaster and
>> sge_execd).
>>
>>
>> ___________________________________________________________
>> Please keep all replies on the list by using "reply all"
>> in your mail client.  To manage your subscriptions to this
>> and other Galaxy lists, please use the interface at:
>>
>>  http://lists.bx.psu.edu/
>

--
Glen L. Beane
Senior Software Engineer
The Jackson Laboratory
(207) 288-6153





___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/