Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable all-vs-all collection analysis patterns. #17366

Merged
merged 3 commits into from
May 14, 2024

Conversation

jmchilton
Copy link
Member

Galaxy matches corresponding datasets when multiple collections are used to map over a tool - this is a variation of a dot product pattern when mapping over collections. This can be easily be adapted to perform all-vs-all mapping if we first produce two new input collections containing the Cartesian product (in math terms) or Cross Join (in SQL terms) of the inputs where every combination is lined up in some corresponding element between the first and second list.

These are available in a list(n) x list (m) -> list(nxm) version (a Cartesian product that produces two flat lists) and a list(n) x list(m) -> list(n):list(m) version (a cross product that produces two nested lists).

There are some cool pictures put together by Seven Bridges that demonstrate the CWL variants of these concepts. The middle part semantics are different but the inputs and resulting structures are the same:

Nested:

Flat:

After two lists have been run through one of these two tools - the result is two new lists that can be passed into another tool to perform all-against-all operations using Galaxy's normal collection mapping semantics. There is no extra work or thought needed here - it really is as simple as running the respective collections through one of these tools and then passing the output corresponding to the input to the next tool and the result will be an all-vs-all operation.

The choice of which tool to use will depend on how you want to continue to process the all-against-all results after the next step in an analysis. My sense is the flat version is "easier" to think about and pick through manually and the nested version preserves more structure if additional collection operation tools will be used to filter or aggregate the results.

Some considerations:

Naming?

I have been calling them cross products because that is what CWL calls them - but they are Cartesian products not cross products. I guess SQL uses the terminology "Cross Join" which makes sense. I think part of the confusion is the related terminology and that mathematically they both are often represented with a big "X" symbol - but mathematically this operation is definitely not a cross product 😢.

I've called the tools cross products in this version cut at the PR but I think we should abandon the CWL naming and come up with more exact terminology.

Apply Rules?

I do not believe the Apply Rules tool semanatics would allow these operations but certainly the Apply Rules tool could be used to convert the result of the flat version to the nested version or vice versa - so no metadata is really lost per se between the two versions. I think it is still worth including both versions though - they both have utility (both for instance are baked into CWL's workflow semantics - https://docs.sevenbridges.com/docs/about-parallelizing-tool-executions#nested-cross-product) and avoiding requiring complex Apply Rules programs for simple workflows is probably ideal.

One Tool vs Two?

Marius and I agree that few simpler tools for these kinds of operations are better. The tool help can be more focused and avoiding the conditional and conditional outputs make the static analysis done for instance by the workflow editor simpler.

Editor/Tool Options vs Collection Operation Options

TODO: we've not generally gone in this direction and probably should steer clear.

How to test the changes?

(Select all options that apply)

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@mvdbeek
Copy link
Member

mvdbeek commented Feb 26, 2024

I gave this a try and it looks nice, I'd be in favor of merging this. https://github.com/galaxyproject/galaxy/compare/dev...mvdbeek:galaxy:cross_product?expand=1 contains a test and fixes for running database operation tools within conditional steps, can you pull this in if it looks good to you @jmchilton ?

@mvdbeek
Copy link
Member

mvdbeek commented Apr 11, 2024

@jmchilton could you rebase and take it out of draft if there's nothing more to add ?

jmchilton and others added 3 commits April 15, 2024 12:00
These are available in a list(n) x list (m) -> list(nxm) version (a cross product that produces two flat lists) and a list(n) x list(m) -> list(n):list(m) version (a cross product that produces two nested lists).

After two lists have been run through one of these two tools - the result is two new lists that can be passed into another tool to perform all-against-all operations using Galaxy's normal collection mapping semantics.

The choice of which to use will depend on how you want to continue to process the all-against-all results after the next step in an analysis. My sense is the flat version is "easier" to think about and pick through manually and the nested version perserves more structure if additional collection operation tools will be used to filter or aggregate the results.

Some considerations:

Apply Rules?

I do not believe the Apply Rules tool semanatics would allow these operations but certainly the Apply Rules tool could be used to convert the result of the flat version to the nested version or vice versa - so no metadata is really lost per se between the two versions. I think it is still worth including both versions though - they both have utility (both for instance are baked into CWL's workflow semantics - https://docs.sevenbridges.com/docs/about-parallelizing-tool-executions#nested-cross-product) and avoiding requiring complex Apply Rules programs for simple workflows is probably ideal.

One Tool vs Two?

Marius and I agree that few simpler tools for these kinds of operations are better. The tool help can be more focused and avoiding the conditional and conditional outputs make the static analysis done for instance by the workflow editor simpler.
@jmchilton jmchilton marked this pull request as ready for review April 15, 2024 16:02
@jmchilton
Copy link
Member Author

The tools have no help - but if you're willing to merge without that I'm willing to pull it out of draft 😅.

@github-actions github-actions bot added this to the 24.1 milestone Apr 15, 2024
@mvdbeek
Copy link
Member

mvdbeek commented Apr 15, 2024

Ah, a minimal help text would probably be a good idea 😆, while we figure out something graphical for the output section ?

@mvdbeek mvdbeek merged commit 0448539 into galaxyproject:dev May 14, 2024
54 checks passed
@jdavcs jdavcs added the highlight Included in user-facing release notes at the top label May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dataset-collections highlight Included in user-facing release notes at the top kind/enhancement kind/feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants