-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move file virtualization in toil-wdl-runner to task boundaries #5028
base: master
Are you sure you want to change the base?
Move file virtualization in toil-wdl-runner to task boundaries #5028
Conversation
…luation. Also gets rid of monkeypatching in favor of a manual function call
…tartup and carry through mappings
…04-wdl-virtualize-only-at-task-boundaries
…zation in TaskWrapper for MiniWDL parity
…alid coerced-to-null files and raise if exception found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that by leaving relative URLs from the input JSON as relative URLs in WDL File
s and then trying to figure out where they were meant to be relative to later, we've introduced a whole universe of cross-talk that we don't want to deal with. Maybe we can poll for where they ought to come from and turn them into absolute URLs before handing them to the workflow?
I like that we're ripping out the coercion monkey-patch and also nonexistent:
, which simplifies things. But are we sure that we retain support for using Toil's supported URI schemes when MiniWDL reads a file? File the_file = "gs://google-bucket-name/filename.txt"
ought to work if Toil has Google Storage support installed, even if MiniWDL doesn't know about reading Google Storage URLs. I think the coercion hook was doing that for us and I can't really tell if something is taking its place. If we're losing that feature and we think it is worth it, we should know we're losing it.
I'd also like to see a new comment at the top describing how the system works now, and what invariants or abstractions it imposes that new code is going to need to follow to keep it working. File
holds a URL or leader-local path at the workflow level. When we actually read it with read_lines()
at the workflow level, does MiniWDL fetch it? If we read_lines()
it twice, do we have a cache? If we load it into the file store when it enters a task, and it is passed out of the task unmodified, do we see the original URL in the task output (because everything at workflow level is always an original URL) or the imported file store one?
Is talking about "virtualized" (visible to WDL code) and "devirtualized" (visible to Python open()
) file names still the right way to understand what the code is doing? Or should we be using different terms or concepts to think about the packing and unpacking that happens when taking Bindings
in and out of tasks?
I also like the idea of wdl_options
to hold the global settings for the run, kind of like the CWL RuntimeContext
. But I think it needs a type so we can have a better handle on what's allowed to be in there.
src/toil/wdl/wdltoil.py
Outdated
if isinstance(value, WDL.Value.File): | ||
pass | ||
elif isinstance(value, WDL.Value.Array) and isinstance(expected_type, WDL.Type.Array): | ||
for elem, orig_elem in zip(value.value, original_value.value): | ||
map_over_files_in_value_check_null_type(elem, orig_elem, expected_type.item_type) | ||
elif isinstance(value, WDL.Value.Map) and isinstance(expected_type, WDL.Type.Map): | ||
for pair, orig_pair in zip(value.value, original_value.value): | ||
# The key of the map cannot be optional or else it is not serializable, so we only need to check the value | ||
map_over_files_in_value_check_null_type(pair[1], orig_pair[1], expected_type.item_type[1]) | ||
elif isinstance(value, WDL.Value.Pair) and isinstance(expected_type, WDL.Type.Pair): | ||
map_over_files_in_value_check_null_type(value.value[0], original_value.value[0], expected_type.left_type) | ||
map_over_files_in_value_check_null_type(value.value[1], original_value.value[1], expected_type.right_type) | ||
elif isinstance(value, WDL.Value.Struct) and isinstance(expected_type, WDL.Type.StructInstance): | ||
for (k, v), (_, orig_v) in zip(value.value.items(), original_value.value.items()): | ||
# The parameters method for WDL.Type.StructInstance returns the values rather than the dictionary | ||
# While dictionaries are ordered, this should be more robust; the else branch should never be hit | ||
if expected_type.members is not None: | ||
map_over_files_in_value_check_null_type(v, orig_v, expected_type.members[k]) | ||
elif isinstance(value, WDL.Value.Null): | ||
if not expected_type.optional: | ||
raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), original_value.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having one version of this traversal logic is what the map function was supposed to be for. But I guess here we're traversing two mirrored structures in parallel so we can't use the other code.
@stxue1 Another thing I thought of is, if we need to make We also might need to revise the batch system documentation, and maybe the inheritance hierarchy, since you would need all batch systems, even those coming from plugins, to always have the local job logic. |
We think this will also fix #4992. |
…04-wdl-virtualize-only-at-task-boundaries
…github.com:DataBiosphere/toil into issues/5004-wdl-virtualize-only-at-task-boundaries
…and permanent devirtualization of URLs
… stdlibs and combine_bindings issue with new virtualization representation
We seem to have a commandline options for CWL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the WDL options setup is not designed quite right, and it also looks like we're able to import but then forget about whole copies of files.
src/toil/wdl/wdltoil.py
Outdated
# execution_dir: Directory to use as the working directory for workflow code. | ||
# enforce_existence: If true, then if a file is detected as nonexistent, raise an error. Else, let it pass through | ||
# share_files_with: If set to an existing standard library instance, use the same file upload and download paths as it. | ||
WDL_OPTIONS = TypedDict('WDL_OPTIONS', {"execution_dir": NotRequired[str], "container": NotRequired[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be WDLOptions
since it is a type name, not a constant.
src/toil/wdl/wdltoil.py
Outdated
@@ -443,7 +460,17 @@ async def toil_read_source(uri: str, path: List[str], importer: Optional[WDL.Tre | |||
raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), uri) | |||
|
|||
|
|||
|
|||
def virtualized_equal(file1: WDL.Value.Base, file2: WDL.Value.Base) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file1
and file2
aren't necessarily intended to themselves be File
s, right? They are meant to be any WDL value that maybe contains File
s. So probably the variable names should be something that reflects that.
command_wdl_options: WDL_OPTIONS = {"task_path": task_path} | ||
if workdir_in_container is not None: | ||
command_wdl_options["execution_dir"] = workdir_in_container | ||
command_library = ToilWDLStdLibTaskCommand(file_store, task_container, wdl_options=command_wdl_options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here also we're not matching the pattern of the WDL options being a sort of global or at least mostly-global context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it so the command_wdl_options
will modify a copy of the task wdl options
Co-authored-by: Adam Novak <anovak@soe.ucsc.edu>
Co-authored-by: Adam Novak <anovak@soe.ucsc.edu>
…github.com:DataBiosphere/toil into issues/5004-wdl-virtualize-only-at-task-boundaries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still a couple things I would change, but I think this is fine.
# WDL options to pass into the WDL jobs and standard libraries | ||
# task_path: Dotted WDL name of the part of the workflow this library is working for. | ||
# execution_dir: Directory to use as the working directory for workflow code. | ||
# container: The type of container to use when executing a WDL task. Carries through the value of the commandline --container option | ||
WDL_Context = TypedDict('WDL_Context', {"execution_dir": NotRequired[str], "container": NotRequired[str], | ||
"task_path": str, "namespace": str}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have docs for namespace.
src/toil/wdl/wdltoil.py
Outdated
|
||
def virtualized_equal(value1: WDL.Value.Base, value2: WDL.Value.Base) -> bool: | ||
""" | ||
Check if two WDL values are equal when taking account file virtualization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check if two WDL values are equal when taking account file virtualization. | |
Check if two WDL values are equal when taking into account file virtualization. |
src/toil/wdl/wdltoil.py
Outdated
self.container = container | ||
|
||
# Revert the _read and _write functions to the parent WDL.StdLib.Base implementation | ||
# This is because the task command standard library is used in MiniWDL's internals when executing a task | ||
# which we don't have much control over (miniwdl will create it's own file objects that represent files within the container) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# which we don't have much control over (miniwdl will create it's own file objects that represent files within the container) | |
# which we don't have much control over (miniwdl will create its own file objects that represent files within the container) |
# task_path: Dotted WDL name of the part of the workflow this library is working for. | ||
# execution_dir: Directory to use as the working directory for workflow code. | ||
# container: The type of container to use when executing a WDL task. Carries through the value of the commandline --container option | ||
WDL_Context = TypedDict('WDL_Context', {"execution_dir": NotRequired[str], "container": NotRequired[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't use an _
in the name here; it's a different convention than all the other types.
…y-at-task-boundaries' into issues/5004-wdl-virtualize-only-at-task-boundaries
@stxue1 It looks like conformance test 27 ("Legacy test for type_pair_with_files") is failing because it is trying to devirtualize a file to the same path twice. I think it can't deal with how that test sends the same path twice in the input JSON, so there are two |
This should solve #5004
This also reverts #4994 and introduces another fix for #4988; there should be behavioral parity with miniwdl
This should also resolve #5031
This is a bit of an overhaul of how we handle files in toil-wdl-runner; instead of immediately virtualizing upon seeing a
File
type (coerced or not), the WDL value is kept as a path until the last second. Only before tasks are sent off, the files will be virtualized. This way, the WDL side functionality should still be the same while allowing for File to String coercion as we no longer replace all File paths with our virtualized representation immediately.Monkeypatching is removed as we no longer virtualize at the coercion step, but before/after function calls and manually at task boundaries.
I don't think this will increase the amount of IO. One edge case is if the user changes the filename right before being put into a function; toil will virtualize the new filename for that function call and keep both the original and new virtualized file around in the jobstore until the job is completed. It may be worth removing the new virtualized file from the jobstore after use, but I'm not sure where to hook in the behavior.
The downside is that all jobs that are marked as local must always be local in the future. I explicitly set the flag in the class constructors where I could.I think the virtualization behavior should ensure local jobs on workers should work.The function
evaluate_bindings_from_decls
is now used for all decl parsing/evaluation to reduce duplicative code.Changelog Entry
To be copied to the draft changelog by merger:
Reviewer Checklist
issues/XXXX-fix-the-thing
in the Toil repo, or from an external repo.camelCase
that want to be insnake_case
.docs/running/{cliOptions,cwl,wdl}.rst
Merger Checklist