Enable all-vs-all collection analysis patterns. #17366

jmchilton · 2024-01-26T17:35:21Z

Galaxy matches corresponding datasets when multiple collections are used to map over a tool - this is a variation of a dot product pattern when mapping over collections. This can be easily be adapted to perform all-vs-all mapping if we first produce two new input collections containing the Cartesian product (in math terms) or Cross Join (in SQL terms) of the inputs where every combination is lined up in some corresponding element between the first and second list.

These are available in a list(n) x list (m) -> list(nxm) version (a Cartesian product that produces two flat lists) and a list(n) x list(m) -> list(n):list(m) version (a cross product that produces two nested lists).

There are some cool pictures put together by Seven Bridges that demonstrate the CWL variants of these concepts. The middle part semantics are different but the inputs and resulting structures are the same:

Nested:

Flat:

After two lists have been run through one of these two tools - the result is two new lists that can be passed into another tool to perform all-against-all operations using Galaxy's normal collection mapping semantics. There is no extra work or thought needed here - it really is as simple as running the respective collections through one of these tools and then passing the output corresponding to the input to the next tool and the result will be an all-vs-all operation.

The choice of which tool to use will depend on how you want to continue to process the all-against-all results after the next step in an analysis. My sense is the flat version is "easier" to think about and pick through manually and the nested version preserves more structure if additional collection operation tools will be used to filter or aggregate the results.

Some considerations:

Naming?

I have been calling them cross products because that is what CWL calls them - but they are Cartesian products not cross products. I guess SQL uses the terminology "Cross Join" which makes sense. I think part of the confusion is the related terminology and that mathematically they both are often represented with a big "X" symbol - but mathematically this operation is definitely not a cross product 😢.

I've called the tools cross products in this version cut at the PR but I think we should abandon the CWL naming and come up with more exact terminology.

Apply Rules?

I do not believe the Apply Rules tool semanatics would allow these operations but certainly the Apply Rules tool could be used to convert the result of the flat version to the nested version or vice versa - so no metadata is really lost per se between the two versions. I think it is still worth including both versions though - they both have utility (both for instance are baked into CWL's workflow semantics - https://docs.sevenbridges.com/docs/about-parallelizing-tool-executions#nested-cross-product) and avoiding requiring complex Apply Rules programs for simple workflows is probably ideal.

One Tool vs Two?

Marius and I agree that few simpler tools for these kinds of operations are better. The tool help can be more focused and avoiding the conditional and conditional outputs make the static analysis done for instance by the workflow editor simpler.

Editor/Tool Options vs Collection Operation Options

TODO: we've not generally gone in this direction and probably should steer clear.

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

mvdbeek · 2024-02-26T15:36:00Z

I gave this a try and it looks nice, I'd be in favor of merging this. https://github.com/galaxyproject/galaxy/compare/dev...mvdbeek:galaxy:cross_product?expand=1 contains a test and fixes for running database operation tools within conditional steps, can you pull this in if it looks good to you @jmchilton ?

mvdbeek · 2024-04-11T11:46:10Z

@jmchilton could you rebase and take it out of draft if there's nothing more to add ?

These are available in a list(n) x list (m) -> list(nxm) version (a cross product that produces two flat lists) and a list(n) x list(m) -> list(n):list(m) version (a cross product that produces two nested lists). After two lists have been run through one of these two tools - the result is two new lists that can be passed into another tool to perform all-against-all operations using Galaxy's normal collection mapping semantics. The choice of which to use will depend on how you want to continue to process the all-against-all results after the next step in an analysis. My sense is the flat version is "easier" to think about and pick through manually and the nested version perserves more structure if additional collection operation tools will be used to filter or aggregate the results. Some considerations: Apply Rules? I do not believe the Apply Rules tool semanatics would allow these operations but certainly the Apply Rules tool could be used to convert the result of the flat version to the nested version or vice versa - so no metadata is really lost per se between the two versions. I think it is still worth including both versions though - they both have utility (both for instance are baked into CWL's workflow semantics - https://docs.sevenbridges.com/docs/about-parallelizing-tool-executions#nested-cross-product) and avoiding requiring complex Apply Rules programs for simple workflows is probably ideal. One Tool vs Two? Marius and I agree that few simpler tools for these kinds of operations are better. The tool help can be more focused and avoiding the conditional and conditional outputs make the static analysis done for instance by the workflow editor simpler.

jmchilton · 2024-04-15T16:02:52Z

The tools have no help - but if you're willing to merge without that I'm willing to pull it out of draft 😅.

mvdbeek · 2024-04-15T16:20:14Z

Ah, a minimal help text would probably be a good idea 😆, while we figure out something graphical for the output section ?

jmchilton added kind/enhancement kind/feature area/dataset-collections labels Jan 26, 2024

jmchilton force-pushed the cross_product branch from eb8909b to 749462b Compare February 28, 2024 17:38

jmchilton force-pushed the cross_product branch from 749462b to 1702688 Compare March 25, 2024 18:58

jmchilton and others added 3 commits April 15, 2024 12:00

Add API test that fails because outputs aren't skipped

bbfc604

Mark outputs skipped if necessary

8807db4

jmchilton force-pushed the cross_product branch from 1702688 to 8807db4 Compare April 15, 2024 16:02

jmchilton marked this pull request as ready for review April 15, 2024 16:02

github-actions bot added this to the 24.1 milestone Apr 15, 2024

mvdbeek merged commit 0448539 into galaxyproject:dev May 14, 2024
54 checks passed

jdavcs added the highlight Included in user-facing release notes at the top label May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable all-vs-all collection analysis patterns. #17366

Enable all-vs-all collection analysis patterns. #17366

jmchilton commented Jan 26, 2024

mvdbeek commented Feb 26, 2024

mvdbeek commented Apr 11, 2024

jmchilton commented Apr 15, 2024

mvdbeek commented Apr 15, 2024

Enable all-vs-all collection analysis patterns. #17366

Enable all-vs-all collection analysis patterns. #17366

Conversation

jmchilton commented Jan 26, 2024

Some considerations:

Naming?

Apply Rules?

One Tool vs Two?

Editor/Tool Options vs Collection Operation Options

How to test the changes?

License

mvdbeek commented Feb 26, 2024

mvdbeek commented Apr 11, 2024

jmchilton commented Apr 15, 2024

mvdbeek commented Apr 15, 2024