resubmission on out of memory

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

resubmission on out of memory

Matthias Bernt
Dear list,

I recall that its possible to configure a tool can such that out of
memory conditions (and run time) can be recognized (by regexp matching
on stadout/stderr). Can this be used to trigger job resubmission on the
cluster?

Could someone please point me to some kind of documentation, if this is
the case?

Best,
Matthias

--

-------------------------------------------
Matthias Bernt
Bioinformatics Service
Molekulare Systembiologie (MOLSYB)
Helmholtz-Zentrum für Umweltforschung GmbH - UFZ/
Helmholtz Centre for Environmental Research GmbH - UFZ
Permoserstraße 15, 04318 Leipzig, Germany
Phone +49 341 235 482296,
[hidden email], www.ufz.de

Sitz der Gesellschaft/Registered Office: Leipzig
Registergericht/Registration Office: Amtsgericht Leipzig
Handelsregister Nr./Trade Register Nr.: B 4703
Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board:
MinDirig Wilfried Kraus
Wissenschaftlicher Geschäftsführer/Scientific Managing Director:
Prof. Dr. Dr. h.c. Georg Teutsch
Administrative Geschäftsführerin/ Administrative Managing Director:
Prof. Dr. Heike Graßmann
-------------------------------------------
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/
Reply | Threaded
Open this post in threaded view
|

Re: resubmission on out of memory

John Chilton-4
Something like this is possible with some caveats. It is possible to
detect memory and walltime errors - but not based on regex in tools
but instead by the job runner. So the SLURM runner implements
detection of out of memory errors and timeout I think - I don't think
most of the other runners do.

When I started hacking on this feature, there was no documentation for
it and I wanted to understand how it worked and verify that it worked
so I wrote a test case. The problem is the test case tests a bunch of
different features all at once - so it will be a lot to walk through
and you will need to understand dynamic job destinations and such:

https://github.com/galaxyproject/galaxy/blob/dev/test/integration/resubmission_job_conf.xml
https://github.com/galaxyproject/galaxy/commit/0559cff6e94b250ddd98275b119ab51b36491e34

That said let me see if I can come up with a simple example:


<job_conf>
<plugins>
<!-- setup a slurm runner or update another runner to detect these
conditions and set it up here -->
</plugins>

<destinations default="small_fast_host">
  <destination id="small_fast_host" runner="slurm">
    <param name="native_specification>SHORT_WALLTIME_SMALL_MEMORY_OPTS_FOR_YOUR_CLUSTER</param>
    <resubmit condition="walltime_reached" destination="longer_walltime_dest" />
    <resubmit condition="memory_limit_reached"
destination="bigger_memory_dest" />
    <resubmit condition="seconds_running &lt; 5 and attempts &lt 3"
delay="attempt * 1.5" destination="small_fast_host" />
  </destination>
  <destination id="longer_walltime_dest" runner="slurm">
    <param name="native_specification>LONGER_WALLTIME_FOR_YOUR_CLUSTERS</param>
 </destination>
  <destination id="bigger_memory_dest" runner="slurm">
    <param name="native_specification>BIGGER_MEMORY_FOR_YOUR_CLUSTERS</param>
  </destination>
</destination>

<tools />
</job_conf>

Here you would fill in native_specifications for your various runners
to redirect jobs as needed. Everything is going through an initial
destination (though you could parameterize this and have any number of
initial destinations). That destinations is going to resubmit under 3
different conditions - if a walltime error is detected by the job
runner - it will resubmit to a destination that you have to configure
with a longer walltime (with id="longer_walltime_dest") - perhaps this
is a different cluster with longer wait times and corresponding longer
walltimes. Likewise if a memory error is detected - it will resubmit
to "bigger_memory_dest" (perhaps a special part of your cluster with
larger memory servers or a large shared memory machine). Finally to
show off some coolness I added - if the job fails right away (within
the first 5 seconds) - it will delay the job a bit and then retry to
submit up to 5 times. This may be good at working around random
cluster failures during submissions if things get busy.

The test case covers allowing users to supply parameters to assist
with finding destinations and controlling resubmission as well dynamic
destinations and how they may interact with these concepts.

Like you mentioned - it would be wonderful if tools could look at
their output and determine if memory problems were encountered - I
guess this is tracked here
(https://github.com/galaxyproject/galaxy/issues/3107). It is a medium
priority for me - so I may get to it at some point. This sort of thing
is important when scaling up analyses.

-John



On Tue, Sep 19, 2017 at 4:26 PM, Matthias Bernt <[hidden email]> wrote:

> Dear list,
>
> I recall that its possible to configure a tool can such that out of memory
> conditions (and run time) can be recognized (by regexp matching on
> stadout/stderr). Can this be used to trigger job resubmission on the
> cluster?
>
> Could someone please point me to some kind of documentation, if this is the
> case?
>
> Best,
> Matthias
>
> --
>
> -------------------------------------------
> Matthias Bernt
> Bioinformatics Service
> Molekulare Systembiologie (MOLSYB)
> Helmholtz-Zentrum für Umweltforschung GmbH - UFZ/
> Helmholtz Centre for Environmental Research GmbH - UFZ
> Permoserstraße 15, 04318 Leipzig, Germany
> Phone +49 341 235 482296,
> [hidden email], www.ufz.de
>
> Sitz der Gesellschaft/Registered Office: Leipzig
> Registergericht/Registration Office: Amtsgericht Leipzig
> Handelsregister Nr./Trade Register Nr.: B 4703
> Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board: MinDirig
> Wilfried Kraus
> Wissenschaftlicher Geschäftsführer/Scientific Managing Director:
> Prof. Dr. Dr. h.c. Georg Teutsch
> Administrative Geschäftsführerin/ Administrative Managing Director:
> Prof. Dr. Heike Graßmann
> -------------------------------------------
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>  https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>  http://galaxyproject.org/search/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/
Reply | Threaded
Open this post in threaded view
|

Bug in "View all histories", history "disappears"

Christopher Previti
Dear all,

I noticed a bug in Galaxy (17.05), specifically when using "View all histories"  to switch between different histories.
It happened to me multiple times that the history I switched away from was not shown anymore.
Searching for the history by name made it show up again, but still not ideal.
I have about 10 histories that I cycle through on a regular basis. Doesn't the system support that many?

Cheers,
Christopher


--
Dr. Christopher Previti
Genomics and
Proteomics Core Facility
High Throughput Sequencing (W190)
Bioinformatician

German Cancer Research Center (DKFZ)
Foundation under Public Law
Im Neuenheimer Feld 580
69120 Heidelberg
Germany
Room: B2.102 (INF580/TP3)
Phone: +49 6221 42-4434

christopher.previti@...
www.dkfz.de

Management Board: Prof. Dr. Michael Baumann, Prof. Dr. Josef Puchta
VAT-ID No.: DE143293537

Vertraulichkeitshinweis: Diese Nachricht ist ausschließlich für die Personen bestimmt, an die sie adressiert ist.
Sie kann vertrauliche und/oder nur für den/die Empfänger bestimmte Informationen enthalten. Sollten Sie nicht
der bestimmungsgemäße Empfänger sein, kontaktieren Sie bitte den Absender und löschen Sie die Mitteilung.
Jegliche unbefugte Verwendung der Informationen in dieser Nachricht ist untersagt.



    

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/
Reply | Threaded
Open this post in threaded view
|

Re: Bug in "View all histories", history "disappears"

Marius van den Beek
Hello Christopher,

this is a longstanding bug that we recently fixed (initial description https://github.com/galaxyproject/galaxy/issues/3519, fix https://github.com/galaxyproject/galaxy/pull/4610 ),
This will be included in galaxy release 17.09, but the fix has also been backported to relaese 16.10 and newer. If you update your galaxy instance to the
latest commit this problem should be gone.

Best,
Marius

On 27 September 2017 at 09:26, Previti <[hidden email]> wrote:
Dear all,

I noticed a bug in Galaxy (17.05), specifically when using "View all histories"  to switch between different histories.
It happened to me multiple times that the history I switched away from was not shown anymore.
Searching for the history by name made it show up again, but still not ideal.
I have about 10 histories that I cycle through on a regular basis. Doesn't the system support that many?

Cheers,
Christopher


--
Dr. Christopher Previti
Genomics and
Proteomics Core Facility
High Throughput Sequencing (W190)
Bioinformatician

German Cancer Research Center (DKFZ)
Foundation under Public Law
Im Neuenheimer Feld 580
69120 Heidelberg
Germany
Room: B2.102 (INF580/TP3)
Phone: <a href="tel:+49%206221%20424434" value="+496221424434" target="_blank">+49 6221 42-4434

christopher.previti@...
www.dkfz.de

Management Board: Prof. Dr. Michael Baumann, Prof. Dr. Josef Puchta
VAT-ID No.: DE143293537

Vertraulichkeitshinweis: Diese Nachricht ist ausschließlich für die Personen bestimmt, an die sie adressiert ist.
Sie kann vertrauliche und/oder nur für den/die Empfänger bestimmte Informationen enthalten. Sollten Sie nicht
der bestimmungsgemäße Empfänger sein, kontaktieren Sie bitte den Absender und löschen Sie die Mitteilung.
Jegliche unbefugte Verwendung der Informationen in dieser Nachricht ist untersagt.



    

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/