Plans for workflow & parallelisation work?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Plans for workflow & parallelisation work?

Peter van Heusden
Hi there

I see from the PR landing in Galaxy and the comments on things like issue #1701 (https://github.com/galaxyproject/galaxy/issues/1701) that there's lots of work happening on the workflow side of Galaxy. This is an area of interest at SANBI too, so we'd like to coordinate development efforts as much as possible. To this end:

1) Are there forks to track so we can see what new code is landing?
2) Is there a roadmap for workflow work or perhaps can we have a Hangout to talk about this?
3) Specifically in terms of workflows and parallelisation: are there any plans to work on running workflows as opposed to just generating lots of jobs? I know this is a major change to how Galaxy works - it would mean something like submitting a workflow specification to a job runner that is located on the cluster, and then returning the results of workflow execution. 
4) Currently parallelisation in Galaxy is supported using two mechanisms: collections and dataset splitters/tasks. Are there plans on extending and harmonising Galaxy's parallelisation capabilities?

Thanks,
Peter



___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: Plans for workflow & parallelisation work?

Peter Cock
On Mon, Feb 22, 2016 at 7:57 AM, Peter van Heusden <[hidden email]> wrote:
> Hi there
>
> ...
>
> 4) Currently parallelisation in Galaxy is supported using two mechanisms:
> collections and dataset splitters/tasks. Are there plans on extending and
> harmonising Galaxy's parallelisation capabilities?

I'm not sure there is anything formal, but chatting to John and others
at GCC2015 we recognised that the split/merge capabilities in the
Python datatype classes have a lot of functional overlap between
splitting and merging for datasets into collections.

https://wiki.galaxyproject.org/Events/GCC2015/BoFs/DataSplittingAndParallelism

One idea we mooted was defining (pseudo) tools for dataset splitting
and merging using the existing datatype classes, with similar integration
into the framework as the datatype converter tools.

i.e. You could in principle merge a collection of text files using the
text datatype's merge functionality (which is essentially a cat
command).

There are a lot of details to think about, particularly for splitting
where currently tool wrappers using parallelisation have some
control (e.g. split a large FASTA file into chunks of 1000 sequences),
which might need to be exposed in any UI for creating a collection
from a single file.

Peter
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
Reply | Threaded
Open this post in threaded view
|

Re: Plans for workflow & parallelisation work?

John Chilton-4
In reply to this post by Peter van Heusden
Peter -

My plans for pre-GCC workflow work are sort of outlined in this issue:
https://github.com/galaxyproject/planemo/issues/408 (I want an
abstract for GCC and BOSC like "Planemo – A Scientific Workflow SDK").

I've been doing most of my work out of this branch
https://github.com/galaxyproject/galaxy/compare/dev...common-workflow-language:cwl.
It has my work in progress on CWL support, collection operations
(rejected once from Galaxy here
https://github.com/galaxyproject/galaxy/pull/1313) but these are so
important I'm going to take another stab at pushing them into Galaxy,
and work on expression tools to produce values that will hopefully tie
back into workflows as connections for non-data parameters - both as
Galaxy native enties and CWL based enties.

There have been some completely valid complaints about the background
workflow scheduling being slow and buggy, these will need to be fixed
by 16.04 since all workflows will be executed this way as of then. I
hope also to take another pass at subworkflows - better tracking of
sources, allowing upgrading subworkflow steps, fixing glaring bugs
like https://github.com/galaxyproject/galaxy/issues/1739.

Peter C. mentioned splitting and joining files into/from collections
in workflows based on the datatype methods (so hooking into
parallelism) - I have some initial WIP on this here
https://github.com/jmchilton/galaxy/commit/c4d93acdb3b0f89b970b7c3d17b965be8ab3ba30
as part of this branch
https://github.com/jmchilton/galaxy/tree/split_merge_collections. I
spent a couple hours on it - I think if I spent a day or two on it I'd
have a usable prototype to hack on - I don't remember thinking there
were any big hurdles I was encountering in doing that. (So the answer
to your last question is a definitive yes.)

Sam started a bunch of work here with completely replacing the
workflow form with an API driven one here
https://github.com/galaxyproject/galaxy/pull/1249. I know he hopes to
have that done in 16.04 - it will allow us to delete a bunch of paths
through the workflow code and should allow future developments to be
made more rapidly. It will ensure everything is coming through the API
also - which means Galaxy's test coverage of workflow stuff will be
much higher (given our depth of workflow API tests).

I'm happy to have a hangout to discuss this more, I consider the
planemo issue something of a roadmap for what I want to work on in the
first half of 2016 - but I might get pulled away or told the project
has other priorities.

As for scheduling workflows instead of jobs - this is intriguing and
really would probably be needed to get streaming working well in
Galaxy. So I would say - I want to work on it someday - but I probably
won't get to it in 2016. If others want to hack on it, that is
fantastic but it is also a difficult feat.  (At least scheduling out
and optimizing pieces of the workflow, Kyle Ellrott, Dannon, and I had
some interesting ideas about scheduling whole workflows on local
Galaxy instances running on a cluster and just collecting the outputs
- that would be significantly more doable given I sort of sculpted the
changes made to backgrounding workflows to preserve things for doing
that - though the work left is probably still a hard task).

Hope this helps.

-John

On Mon, Feb 22, 2016 at 7:57 AM, Peter van Heusden <[hidden email]> wrote:

> Hi there
>
> I see from the PR landing in Galaxy and the comments on things like issue
> #1701 (https://github.com/galaxyproject/galaxy/issues/1701) that there's
> lots of work happening on the workflow side of Galaxy. This is an area of
> interest at SANBI too, so we'd like to coordinate development efforts as
> much as possible. To this end:
>
> 1) Are there forks to track so we can see what new code is landing?
> 2) Is there a roadmap for workflow work or perhaps can we have a Hangout to
> talk about this?
> 3) Specifically in terms of workflows and parallelisation: are there any
> plans to work on running workflows as opposed to just generating lots of
> jobs? I know this is a major change to how Galaxy works - it would mean
> something like submitting a workflow specification to a job runner that is
> located on the cluster, and then returning the results of workflow
> execution.
> 4) Currently parallelisation in Galaxy is supported using two mechanisms:
> collections and dataset splitters/tasks. Are there plans on extending and
> harmonising Galaxy's parallelisation capabilities?
>
> Thanks,
> Peter
>
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/