collections with more than 25,000 items

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

collections with more than 25,000 items

Jochen Bick
Hi,

is there any limit to run BLAST jobs from a collection of single FASTA
files? I started a job but is does not get executed... its just sending
for about an hour.

Cheers Jochen

--
ETH Zurich
*Jochen Bick*
Animal Physiology
Institute of Agricultural Sciences
Postal address: Universitätstrasse 2 / LFW B 58.1
Office: Tannenstrasse 1 / TAN D 6.2
8092 Zurich, Switzerland

Phone +41 44 632 28 25
[hidden email] <mailto:[hidden email]>
www.ap.ethz.ch
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/
Reply | Threaded
Open this post in threaded view
|

Re: collections with more than 25,000 items

Peter Cock
If there are any limits, it would be down to the Galaxy Admin's job
settings - something generic with collections.

Personally I've not done this - I tend to concatenate FASTA files
to make large files with multiple sequences instead.

(And then we have the optional task splitting enabled so that Galaxy
breaks up the multiple-sequence FASTA file into chunks which
get shared out on our cluster for better throughput before
concatenating the output back into a single file.)

Peter
On Thu, Aug 30, 2018 at 3:37 PM Jochen Bick <[hidden email]> wrote:

>
> Hi,
>
> is there any limit to run BLAST jobs from a collection of single FASTA
> files? I started a job but is does not get executed... its just sending
> for about an hour.
>
> Cheers Jochen
>
> --
> ETH Zurich
> *Jochen Bick*
> Animal Physiology
> Institute of Agricultural Sciences
> Postal address: Universitätstrasse 2 / LFW B 58.1
> Office: Tannenstrasse 1 / TAN D 6.2
> 8092 Zurich, Switzerland
>
> Phone +41 44 632 28 25
> [hidden email] <mailto:[hidden email]>
> www.ap.ethz.ch
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/
Reply | Threaded
Open this post in threaded view
|

Re: collections with more than 25,000 items

Peter Cock
There is a sweet spot for splitting your BLAST query fasta file
by sequence - one big file with 25000 sequences is not great,
but one sequence per file is the worst possible option.

This is due to all the extra overheads, you would have 25000
jobs submitted to the cluster, each of which would load the
BLAST binary and database off disk etc. And there are also
going to be Galaxy overheads with a large collection as well.

I would suggest somewhere around 500 to 1000 gene sequences
per FASTQ query file is likely a safe choice. If you have very
long sequences (e.g. chromosomes or contigs), then use less.

As to the number of threads for each BLAST job, more is better,
but what to pick will depend on your cluster and how often there
are threads free on nodes. I would suggest trying 4, 8 or 16 threads.

I hope that helps.

Peter


On Thu, Aug 30, 2018 at 3:50 PM Jochen Bick <[hidden email]> wrote:

>
> Thanks Peter,
>
> so my idea was to split my problem into single blast jobs and run them
> only on one core...
> So my file has 25000 sequences and I'm blasting them against all NCBI
> proteins (nr). This just take to long time. I guess because the database
> is also very big? I tested this on the first 10 sequences and it took
> about 10mins. But maybe this is still not faster than running all at once?
> How many cores would you give such a job?
>
> Cheers Jochen
>
> On 30.08.2018 16:44, Peter Cock wrote:
> > If there are any limits, it would be down to the Galaxy Admin's job
> > settings - something generic with collections.
> >
> > Personally I've not done this - I tend to concatenate FASTA files
> > to make large files with multiple sequences instead.
> >
> > (And then we have the optional task splitting enabled so that Galaxy
> > breaks up the multiple-sequence FASTA file into chunks which
> > get shared out on our cluster for better throughput before
> > concatenating the output back into a single file.)
> >
> > Peter
> > On Thu, Aug 30, 2018 at 3:37 PM Jochen Bick <[hidden email]> wrote:
> >>
> >> Hi,
> >>
> >> is there any limit to run BLAST jobs from a collection of single FASTA
> >> files? I started a job but is does not get executed... its just sending
> >> for about an hour.
> >>
> >> Cheers Jochen
> >>
> >> --
> >> ETH Zurich
> >> *Jochen Bick*
> >> Animal Physiology
> >> Institute of Agricultural Sciences
> >> Postal address: Universitätstrasse 2 / LFW B 58.1
> >> Office: Tannenstrasse 1 / TAN D 6.2
> >> 8092 Zurich, Switzerland
> >>
> >> Phone +41 44 632 28 25
> >> [hidden email] <mailto:[hidden email]>
> >> www.ap.ethz.ch
> >> ___________________________________________________________
> >> Please keep all replies on the list by using "reply all"
> >> in your mail client.  To manage your subscriptions to this
> >> and other Galaxy lists, please use the interface at:
> >>   https://lists.galaxyproject.org/
> >>
> >> To search Galaxy mailing lists use the unified search at:
> >>   http://galaxyproject.org/search/
>
> --
> ETH Zurich
> *Jochen Bick*
> Animal Physiology
> Institute of Agricultural Sciences
> Postal address: Universitätstrasse 2 / LFW B 58.1
> Office: Tannenstrasse 1 / TAN D 6.2
> 8092 Zurich, Switzerland
>
> Phone +41 44 632 28 25
> [hidden email] <mailto:[hidden email]>
> www.ap.ethz.ch
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/
Reply | Threaded
Open this post in threaded view
|

Re: collections with more than 25,000 items

Jochen Bick
Thanks this helps a lot!

Cheers Jochen

On 30.08.2018 17:14, Peter Cock wrote:

> There is a sweet spot for splitting your BLAST query fasta file
> by sequence - one big file with 25000 sequences is not great,
> but one sequence per file is the worst possible option.
>
> This is due to all the extra overheads, you would have 25000
> jobs submitted to the cluster, each of which would load the
> BLAST binary and database off disk etc. And there are also
> going to be Galaxy overheads with a large collection as well.
>
> I would suggest somewhere around 500 to 1000 gene sequences
> per FASTQ query file is likely a safe choice. If you have very
> long sequences (e.g. chromosomes or contigs), then use less.
>
> As to the number of threads for each BLAST job, more is better,
> but what to pick will depend on your cluster and how often there
> are threads free on nodes. I would suggest trying 4, 8 or 16 threads.
>
> I hope that helps.
>
> Peter
>
>
> On Thu, Aug 30, 2018 at 3:50 PM Jochen Bick <[hidden email]> wrote:
>>
>> Thanks Peter,
>>
>> so my idea was to split my problem into single blast jobs and run them
>> only on one core...
>> So my file has 25000 sequences and I'm blasting them against all NCBI
>> proteins (nr). This just take to long time. I guess because the database
>> is also very big? I tested this on the first 10 sequences and it took
>> about 10mins. But maybe this is still not faster than running all at once?
>> How many cores would you give such a job?
>>
>> Cheers Jochen
>>
>> On 30.08.2018 16:44, Peter Cock wrote:
>>> If there are any limits, it would be down to the Galaxy Admin's job
>>> settings - something generic with collections.
>>>
>>> Personally I've not done this - I tend to concatenate FASTA files
>>> to make large files with multiple sequences instead.
>>>
>>> (And then we have the optional task splitting enabled so that Galaxy
>>> breaks up the multiple-sequence FASTA file into chunks which
>>> get shared out on our cluster for better throughput before
>>> concatenating the output back into a single file.)
>>>
>>> Peter
>>> On Thu, Aug 30, 2018 at 3:37 PM Jochen Bick <[hidden email]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> is there any limit to run BLAST jobs from a collection of single FASTA
>>>> files? I started a job but is does not get executed... its just sending
>>>> for about an hour.
>>>>
>>>> Cheers Jochen
>>>>
>>>> --
>>>> ETH Zurich
>>>> *Jochen Bick*
>>>> Animal Physiology
>>>> Institute of Agricultural Sciences
>>>> Postal address: Universitätstrasse 2 / LFW B 58.1
>>>> Office: Tannenstrasse 1 / TAN D 6.2
>>>> 8092 Zurich, Switzerland
>>>>
>>>> Phone +41 44 632 28 25
>>>> [hidden email] <mailto:[hidden email]>
>>>> www.ap.ethz.ch
>>>> ___________________________________________________________
>>>> Please keep all replies on the list by using "reply all"
>>>> in your mail client.  To manage your subscriptions to this
>>>> and other Galaxy lists, please use the interface at:
>>>>   https://lists.galaxyproject.org/
>>>>
>>>> To search Galaxy mailing lists use the unified search at:
>>>>   http://galaxyproject.org/search/
>>
>> --
>> ETH Zurich
>> *Jochen Bick*
>> Animal Physiology
>> Institute of Agricultural Sciences
>> Postal address: Universitätstrasse 2 / LFW B 58.1
>> Office: Tannenstrasse 1 / TAN D 6.2
>> 8092 Zurich, Switzerland
>>
>> Phone +41 44 632 28 25
>> [hidden email] <mailto:[hidden email]>
>> www.ap.ethz.ch

--
ETH Zurich
*Jochen Bick*
Animal Physiology
Institute of Agricultural Sciences
Postal address: Universitätstrasse 2 / LFW B 58.1
Office: Tannenstrasse 1 / TAN D 6.2
8092 Zurich, Switzerland

Phone +41 44 632 28 25
[hidden email] <mailto:[hidden email]>
www.ap.ethz.ch
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/
Reply | Threaded
Open this post in threaded view
|

Re: collections with more than 25,000 items

Mohammad Heydarian-2
In reply to this post by Peter Cock
We initially had issues with Collections containing thousands of datasets that was related to the limit of jobs in the Slurm queue - excessively increasing this limit fixed our issue. 


Cheers, 
Mo Heydarian



On Thu, Aug 30, 2018 at 11:18 AM Peter Cock <[hidden email]> wrote:
There is a sweet spot for splitting your BLAST query fasta file
by sequence - one big file with 25000 sequences is not great,
but one sequence per file is the worst possible option.

This is due to all the extra overheads, you would have 25000
jobs submitted to the cluster, each of which would load the
BLAST binary and database off disk etc. And there are also
going to be Galaxy overheads with a large collection as well.

I would suggest somewhere around 500 to 1000 gene sequences
per FASTQ query file is likely a safe choice. If you have very
long sequences (e.g. chromosomes or contigs), then use less.

As to the number of threads for each BLAST job, more is better,
but what to pick will depend on your cluster and how often there
are threads free on nodes. I would suggest trying 4, 8 or 16 threads.

I hope that helps.

Peter


On Thu, Aug 30, 2018 at 3:50 PM Jochen Bick <[hidden email]> wrote:
>
> Thanks Peter,
>
> so my idea was to split my problem into single blast jobs and run them
> only on one core...
> So my file has 25000 sequences and I'm blasting them against all NCBI
> proteins (nr). This just take to long time. I guess because the database
> is also very big? I tested this on the first 10 sequences and it took
> about 10mins. But maybe this is still not faster than running all at once?
> How many cores would you give such a job?
>
> Cheers Jochen
>
> On 30.08.2018 16:44, Peter Cock wrote:
> > If there are any limits, it would be down to the Galaxy Admin's job
> > settings - something generic with collections.
> >
> > Personally I've not done this - I tend to concatenate FASTA files
> > to make large files with multiple sequences instead.
> >
> > (And then we have the optional task splitting enabled so that Galaxy
> > breaks up the multiple-sequence FASTA file into chunks which
> > get shared out on our cluster for better throughput before
> > concatenating the output back into a single file.)
> >
> > Peter
> > On Thu, Aug 30, 2018 at 3:37 PM Jochen Bick <[hidden email]> wrote:
> >>
> >> Hi,
> >>
> >> is there any limit to run BLAST jobs from a collection of single FASTA
> >> files? I started a job but is does not get executed... its just sending
> >> for about an hour.
> >>
> >> Cheers Jochen
> >>
> >> --
> >> ETH Zurich
> >> *Jochen Bick*
> >> Animal Physiology
> >> Institute of Agricultural Sciences
> >> Postal address: Universitätstrasse 2 / LFW B 58.1
> >> Office: Tannenstrasse 1 / TAN D 6.2
> >> 8092 Zurich, Switzerland
> >>
> >> Phone +41 44 632 28 25
> >> [hidden email] <mailto:[hidden email]>
> >> www.ap.ethz.ch
> >> ___________________________________________________________
> >> Please keep all replies on the list by using "reply all"
> >> in your mail client.  To manage your subscriptions to this
> >> and other Galaxy lists, please use the interface at:
> >>   https://lists.galaxyproject.org/
> >>
> >> To search Galaxy mailing lists use the unified search at:
> >>   http://galaxyproject.org/search/
>
> --
> ETH Zurich
> *Jochen Bick*
> Animal Physiology
> Institute of Agricultural Sciences
> Postal address: Universitätstrasse 2 / LFW B 58.1
> Office: Tannenstrasse 1 / TAN D 6.2
> 8092 Zurich, Switzerland
>
> Phone +41 44 632 28 25
> [hidden email] <mailto:[hidden email]>
> www.ap.ethz.ch
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/