-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable all-vs-all collection analysis patterns. #17366
Conversation
I gave this a try and it looks nice, I'd be in favor of merging this. https://github.com/galaxyproject/galaxy/compare/dev...mvdbeek:galaxy:cross_product?expand=1 contains a test and fixes for running database operation tools within conditional steps, can you pull this in if it looks good to you @jmchilton ? |
eb8909b
to
749462b
Compare
749462b
to
1702688
Compare
@jmchilton could you rebase and take it out of draft if there's nothing more to add ? |
These are available in a list(n) x list (m) -> list(nxm) version (a cross product that produces two flat lists) and a list(n) x list(m) -> list(n):list(m) version (a cross product that produces two nested lists). After two lists have been run through one of these two tools - the result is two new lists that can be passed into another tool to perform all-against-all operations using Galaxy's normal collection mapping semantics. The choice of which to use will depend on how you want to continue to process the all-against-all results after the next step in an analysis. My sense is the flat version is "easier" to think about and pick through manually and the nested version perserves more structure if additional collection operation tools will be used to filter or aggregate the results. Some considerations: Apply Rules? I do not believe the Apply Rules tool semanatics would allow these operations but certainly the Apply Rules tool could be used to convert the result of the flat version to the nested version or vice versa - so no metadata is really lost per se between the two versions. I think it is still worth including both versions though - they both have utility (both for instance are baked into CWL's workflow semantics - https://docs.sevenbridges.com/docs/about-parallelizing-tool-executions#nested-cross-product) and avoiding requiring complex Apply Rules programs for simple workflows is probably ideal. One Tool vs Two? Marius and I agree that few simpler tools for these kinds of operations are better. The tool help can be more focused and avoiding the conditional and conditional outputs make the static analysis done for instance by the workflow editor simpler.
1702688
to
8807db4
Compare
The tools have no help - but if you're willing to merge without that I'm willing to pull it out of draft 😅. |
Ah, a minimal help text would probably be a good idea 😆, while we figure out something graphical for the output section ? |
Galaxy matches corresponding datasets when multiple collections are used to map over a tool - this is a variation of a dot product pattern when mapping over collections. This can be easily be adapted to perform all-vs-all mapping if we first produce two new input collections containing the Cartesian product (in math terms) or Cross Join (in SQL terms) of the inputs where every combination is lined up in some corresponding element between the first and second list.
These are available in a list(n) x list (m) -> list(nxm) version (a Cartesian product that produces two flat lists) and a list(n) x list(m) -> list(n):list(m) version (a cross product that produces two nested lists).
There are some cool pictures put together by Seven Bridges that demonstrate the CWL variants of these concepts. The middle part semantics are different but the inputs and resulting structures are the same:
Nested:
Flat:
After two lists have been run through one of these two tools - the result is two new lists that can be passed into another tool to perform all-against-all operations using Galaxy's normal collection mapping semantics. There is no extra work or thought needed here - it really is as simple as running the respective collections through one of these tools and then passing the output corresponding to the input to the next tool and the result will be an all-vs-all operation.
The choice of which tool to use will depend on how you want to continue to process the all-against-all results after the next step in an analysis. My sense is the flat version is "easier" to think about and pick through manually and the nested version preserves more structure if additional collection operation tools will be used to filter or aggregate the results.
Some considerations:
Naming?
I have been calling them cross products because that is what CWL calls them - but they are Cartesian products not cross products. I guess SQL uses the terminology "Cross Join" which makes sense. I think part of the confusion is the related terminology and that mathematically they both are often represented with a big "X" symbol - but mathematically this operation is definitely not a cross product 😢.
I've called the tools cross products in this version cut at the PR but I think we should abandon the CWL naming and come up with more exact terminology.
Apply Rules?
I do not believe the Apply Rules tool semanatics would allow these operations but certainly the Apply Rules tool could be used to convert the result of the flat version to the nested version or vice versa - so no metadata is really lost per se between the two versions. I think it is still worth including both versions though - they both have utility (both for instance are baked into CWL's workflow semantics - https://docs.sevenbridges.com/docs/about-parallelizing-tool-executions#nested-cross-product) and avoiding requiring complex Apply Rules programs for simple workflows is probably ideal.
One Tool vs Two?
Marius and I agree that few simpler tools for these kinds of operations are better. The tool help can be more focused and avoiding the conditional and conditional outputs make the static analysis done for instance by the workflow editor simpler.
Editor/Tool Options vs Collection Operation Options
TODO: we've not generally gone in this direction and probably should steer clear.
How to test the changes?
(Select all options that apply)
License