From 794f96be6c2d426b684c319d9a73051143834f99 Mon Sep 17 00:00:00 2001 From: "Michael R. Crusoe" <1330696+mr-c@users.noreply.github.com> Date: Mon, 17 Aug 2020 15:49:53 +0200 Subject: [PATCH] initial import of WPI-{data,control}+CWL analysis (#2) * initial import of WPI-data+CWL analysis * add interstital links * evaluate CWL v1.2 against the WPI control patterns --- README.md | 2 +- workflow_patterns_initiative/README.md | 18 ++ .../control/README.md | 254 ++++++++++++++++++ workflow_patterns_initiative/data/README.md | 138 ++++++++++ 4 files changed, 411 insertions(+), 1 deletion(-) create mode 100644 workflow_patterns_initiative/README.md create mode 100644 workflow_patterns_initiative/control/README.md create mode 100644 workflow_patterns_initiative/data/README.md diff --git a/README.md b/README.md index 422e143..633bc84 100644 --- a/README.md +++ b/README.md @@ -13,7 +13,7 @@ _Some patterns commonly encountered when writing CWL workflows_ - [Manipulating a list of files using expressions](#manipulating-a-list-of-files-using-expressions) - [Link input files to working directory](#link-input-files-to-working-directory) - [How to handle port type mismatches](#how-to-handle-port-type-mismatches) - +- [Which of the Workflow Patterns Initiative patterns does CWL support?](workflow_patterns_initiative/README.md) ## Manifest file via Javascript diff --git a/workflow_patterns_initiative/README.md b/workflow_patterns_initiative/README.md new file mode 100644 index 0000000..680d572 --- /dev/null +++ b/workflow_patterns_initiative/README.md @@ -0,0 +1,18 @@ +http://www.workflowpatterns.com/ + +Each sub-folder analyzes the features of CWL against a category of patterns found in the Workflow Patterns Initiative's website. + +- [Control](control/README.md) patterns +- Resource patterns +- [Data](data/README.md) patterns +- Exception Handling patterns +- Presentation patterns +- Event Log Imperfection patterns + + +Note: the WPI's definition of a workflow is much broader than the type of workflows CWL aims to describe. + +CWL targets workflows made of command line tools with no real-time communication between steps nor outside services. +WPI's definition of workflows includes (business) process modeling which have many control-flow features. + +Therefore these are not to be treated as "to do" lists for CWL, but a way to describ and demonstrate CWL's features and abilities in the larger workflow space. diff --git a/workflow_patterns_initiative/control/README.md b/workflow_patterns_initiative/control/README.md new file mode 100644 index 0000000..44f40a9 --- /dev/null +++ b/workflow_patterns_initiative/control/README.md @@ -0,0 +1,254 @@ + +# WPI's [Workflow Control Patterns](http://www.workflowpatterns.com/patterns/control/) and CWL + +Note: CWL is a data-driven (dataflow) workflow standard. These patterns use control-flow language and thus we speak of the equivalent data-flow constructs. +There is no concept of "thread of control" in CWL. + +Of the 43 WPI Workflow Control patterns, 8 patterns are supported by CWL v1.2 or earlier and 35 patterns are unsupported. + +[Back to the list of WPI pattern categories](../README.md) + +## "Basic Control Flow Patterns" + +* Pattern 1 (Sequence) + +Yes. CWL is a DAG based workflow language with explicit dependencies between steps. + +CWL versions: all + +* Pattern 2 (Parallel Split) + +Yes. Anytime a step in a CWL workflow has all of its inputs available it is allowed to be executed. + +CWL versions: all + +* Pattern 3 (Synchronization) + +Yes. One can have a step in a CWL workflow that requires inputs from the result of multiple other steps. +Downstream steps that require the output of that step (and their descendents) will not be available for execution until the combination step has finished. + +CWL versions: all + +* Pattern 4 (Exclusive choice) + +Yes. No explicit construct, but can be achieved by marking the two downstream steps with the `when` CWL conditional step +execution feature where the logic for one step's `when` field is the inverse of the other step's `when` field. + +If there is a downstream workflow step that would inherit an value from one of the exclusive choice steps into the same incoming input port, then `pickValue: the_only_non_null` may be helpful to ensure that both "exclusive" choices didn't execute due to a misconfiguration in their `when` logics. + +CWL versions: v1.2+ + +* Pattern 5 (Simple Merge) + +Yes. Any workflow can include another workflow as a step, thus allowing re-use. + +CWL versions: all + +## "Advanced Branching and Synchronization Patterns" + +* Pattern 6 (Multi-Choice) + +No, as there is not a construct in CWL to implement this directly, as required by the WPI definition of this pattern. +Can be emulated using the CWL v1.2 `when` similar to the method to implement Pattern 4 (Exclusive choice), but with 3 or more steps using matched `when` markers + +* Pattern 7 (Structured Synchronizing Merge) + +No, as there is not a construct in CWL to implement this directly, as required by the WPI definition of this pattern. +One way to achieve this is to hide the choice inside a sub-workflow which will provide a stable set of outputs to connect to other steps. +As this pattern implies the conditional execution of prior steps, the use of the `pickValue` construct in the merge step will likely be useful. + +* Pattern 8 (Multi-Merge) + +No, as CWL does not have a construct for the WPI Multi-Choice pattern, which is a pre-requisite for the WPI definition of the Multi-Merge pattern. + +However, any CWL step can rely on inputs from multiple other steps. This functionality is available in all CWL versions. + +* Pattern 9 (Structured Discriminator) + +No. CWL does not support canceling the execution of tasks according to some criteria. + +* Pattern 28 (Blocking Discriminator) + +No. State in CWL is read-only, therefore this pattern is not supported as it would be impossible to reset the discriminator. +This pattern is likely not possible in CWL for other reasons as well. + +* Pattern 29 (Cancelling Discriminator) + +No. CWL does not support canceling the execution of tasks according to some criteria. +Additionally, all variables in CWL are read-only so there is no ability to "reset [a] construct". + +* Pattern 30 (Structured Partial Join) + +No. CWL does not support canceling the execution of tasks according to some dynamic criteria. + +* Pattern 31 (Blocking Partial Join) + +No. CWL does not support re-ordering of tasks. + +* Pattern 32 (Cancelling Partial Join) + +No. CWL does not support canceling the execution of tasks according to some criteria. + +* Pattern 33 (Generalized AND-Join) + +No. CWL is a dataflow workflow language, not a control-flow workflow language. All inputs must be known at the time of workflow enactment. + +* Pattern 37 (Local Synchronizing Merge) + +No. CWL is a dataflow workflow language, not a control-flow workflow language. + +A CWL step can not accumulate inputs (or continue to emit outputs) over time. +CWL steps are only executed once (unless `scatter` is used) and the results of that step are only made available once execution has finished. + +See also the evaluation of CWL against Pattern 7 (Structured Synchronizing Merge). + +* Pattern 38 (General Synchronizing Merge) + +Yes, as this was easy to implement due to two factors related to CWL being a dataflow workflow language and not a control-flow workflow language: + +1. CWL steps are only executed once (unless `scatter` is used) and the results of that step are only made available once execution has finished. + +2. A CWL step is available for execution when all of the upstream steps have been executed (or permanently skipped due to the use of the `when` construct). + +CWL versions: + without conditional steps: all + with contional steps: v1.2+ + +* Pattern 41 (Thread Merge) + +No, CWL has no concept of threads at the workflow language level. The underlying applications may use POSIX threads, but that is not managed by CWL. + +* Pattern 42 (Thread Split) + +No, CWL has no concept of threads at the workflow language level. The underlying applications may use POSIX threads, but that is not managed by CWL. + +## "Multiple Instance Patterns" + +* Pattern 12 (Multiple Instances without Synchronization) + +No. While CWL has the `scatter` construct, using this implies an implicit "gather"ing of the results before downstreams steps can use the results. +This means that execution of subsequent steps is delayed for the slowest execution of the the `scatter`ed tasks, even if that particular result isn't needed right away. + +Additionally, CWL does not have a `loop` construct, which would be another way one could implement this pattern. + +* Pattern 13 (Multiple Instances with a priori Design-Time Knowledge) + +No. There is no CWL construct to execute a step or task a specific N number of times, where N is a concrete number (like 23) specified by the workflow author. + +Additionally, CWL does not have a `loop` construct, which would be another way one could implement this pattern. + +However: this pattern could be emulated in a CWL editor or language that converts to CWL syntax by creating explicit steps (23 in our example) in the CWL workflow description. + +* Pattern 14 (Multiple Instances with a priori Run-Time Knowledge) + +Yes. CWL has the `scatter` construct which allows a compact directive that a given step is to be executed multiple times where most inputs are fixed except for those `scatter`ed specified inputs. + +* Pattern 15 (Multiple Instances without a priori Run-Time Knowledge) + +No. CWL lacks any construct that can spawn additional tasks dynamically according to criteria. + +* Pattern 34 (Static Partial Join for Multiple Instances) + +No. CWL does not have the ability to allow execution of a downstream step when N of M tasks have completed. + +* Pattern 35 (Cancelling Partial Join for Multiple Instances) + +No. CWL does not support canceling the execution of tasks according to some criteria. + +* Pattern 36 (Dynamic Partial Join for Multiple Instances) + +No. CWL lacks any construct that can spawn additional tasks dynamically according to criteria. + +## "State-based Patterns" + +* Pattern 16 (Deferred Choice) + +No. The CWL v1.2+ conditional workflow step ability is only based upon explict inputs to the workflow step. + +Additionally, in CWL there is no concept of an "operating environment". + +* Pattern 17 (Interleaved Parallel Routing) + +No. Ordering of tasks in CWL is not fixed, but the dependecy graph is explicit. + +There is no CWl construct to limit the number of parallel tasks being executed, that is up to the workflow engine. + +* Pattern 18 (Milestone) + +No. In CWL, parameter values are read-only; they can not change over time. + +It is not possible to query the state of another CWL task. + +A step in a CWL workflow can only recieve specific outputs from other steps that have already completed, and this wiring is fixed prior to workflow execution. + +* Pattern 39 (Critical Section) + +No. Sub sections of CWL workflows can not be bidirectionally connected. One could depend on another (by putting the other in a sub-workflow, if it isn't one already). + +It is not possible to query the state of another CWL task. + +A step in a CWL workflow can only recieve specific outputs from other steps that have already completed, and this wiring is fixed prior to workflow execution. + +* Pattern 40 (Interleaved Routing) + +No. Ordering of tasks in CWL is not fixed, but the dependecy graph is explicit. + +There is no CWl construct to limit the number of parallel tasks being executed, that is up to the workflow engine. + +## "Cancellation and Force Completion Patterns" + +* Pattern 19 (Cancel Task) + +No. CWL does not support canceling the execution of tasks according to some criteria. + +* Pattern 20 (Cancel Case) + +No. CWL does not support canceling the execution of tasks according to some criteria. + +Additionally, CWL does not have the concept of a "case". + +* Pattern 25 (Cancel Region) + +No. CWL does not support canceling the execution of tasks according to some criteria. + +* Pattern 26 (Cancel Multiple Instance Task) + +No. CWL does not support canceling the execution of tasks according to some criteria. + +* Pattern 27 (Complete Multiple Instance Task) + +No. CWL does not support canceling the execution of tasks according to some criteria. + +## "Iteration Patterns" + +* Pattern 10 (Arbitrary Cycles) + +No. CWL has no loop construct. + +* Pattern 21 (Structured Loop) + +No. CWL has no loop construct. + +* Pattern 22 (Recursion) + +No. CWL does not support recursion. + +## "Termination Patterns" + +* Pattern 11 (Implicit Termination) + +Yes. CWL workflows are finished when all of the required outputs are available. + +* Pattern 43 (Explicit Termination) + +No. CWL does not have an explicit "end node" construct. + +## "Trigger Patterns" + +* Pattern 23 (Transient Trigger) + +No. CWL does not support the concept of triggers or signals. + +* Pattern 24 (Persistent Trigger) + +No. CWL does not support the concept of triggers or signals. diff --git a/workflow_patterns_initiative/data/README.md b/workflow_patterns_initiative/data/README.md new file mode 100644 index 0000000..74cbc41 --- /dev/null +++ b/workflow_patterns_initiative/data/README.md @@ -0,0 +1,138 @@ + +# WPI's [Workflow Data Patterns](http://www.workflowpatterns.com/patterns/data/) and CWL + +Of the 40 WPI Workflow Data patterns, 17 patterns are supported by CWL v1.2 or earlier and 23 patterns are unsupported. + +[Back to the list of WPI pattern categories](../README.md) + +* Pattern 1 (Task Data) + +"Data can be explicitly declared at task level with task level scoping" -> Yes, you can have a variable or value defined in a CommandLineTool, ExpressionTool, or Operation that is not visible to other CWL Processes. They are initialized with a value provided by the Process author. + +* Pattern 2 (Block Data) + +"Data can be explicitly declared at block task level with block task level scoping" -> Yes, one can declare an extra variable at the step level or sub-workflow level. + +"Facilities exist for formal parameter passing to and from a block" -> sub-Processes can inherit a step level variable when it is connected to one of the Process’s inputs. They cannot access any given step-level input by name or any other mechanism, only by its own defined input parameters. Arbitrary access to block level parameters from another block is not possible in CWL. The only outputs from a step are the outputs from the underlying Process that have been marked for export. For sub-workflows, their steps may connect workflow-level data (inputs parameters) to a specific sub-Processes by identifiers. The only data from a sub-workflow that is available to a sibling Process or the parent Workflow are the explicit output parameters that connect to specific step outputs in the sub-Workflow. + +* Pattern 3 (Scope Data) + +No. CWL does not support Scope Data as scopes are defined as not creating a new address space; the closest approximation (a CWL sub-Workflow) does create a new address space. + +* Pattern 4 (Multiple Instance Data) + +"The data element is capable of being replicated or partitioned across multiple tasks": Yes. + +"Each of these data instances exist in their own address space": Yes. + +A Process can be run with unique data inputs multiple times in a CWL workflow and multiple times in the same step when using the ‘scatter’ feature. Each run of that Process has access to only the specific data that was connected to it, and not to the data from other runs. + +"The instances are able to be accessed from a higher level in the process hierarchy" Only after execution is finished, and only for the explicit output values, yes. + +* Pattern 5 (Case Data) + +"Direct tool support for data elements at case level with case level scoping. Case data visible to all components of the case.": No, in CWL each Process only has access to its own defined inputs. Those inputs might be connected to Workflow-scope inputs (or the outputs from other steps) but there is no “global” or case namespace accessible from all levels of a workflow. + +* Pattern 6 (Folder Data) No. As CWL does not have a Case Data concept, it can not have a Folder Data concept either. + +* Pattern 7 (Workflow Data) + +"Workflow data visible to all components of a workflow". Yes, all steps can reference workflow level inputs (including inputs with default values that are rarely overridden by users) in setting up the inputs to their Process. However, inside those Processes there is no visibility into the step level or workflow level namespaces. Only those data values that have been propagated to the Process via one of its own named inputs are available. + +* Pattern 8 (Environment Data) + +No. There is no CWL construct for Environment Data. If network access is allowed by the executor, then the applications executed by CommandLineTools can access external data, though this is not advised for repeatability and resiliency reasons. Other sources of data gathering besides IP network access by CommandLineTools are possible, but not guaranteed by the CWL standards. They would certainly not be portable and such are not recommended. + +* Pattern 9 (Task to Task) + +Yes, CWL supports Task to Task data communication via Distinct data channels. That is, CWL Processes can inherit data from another Process and their outputs can in turn become inputs to other Processes, all defined by the `in` mapping in each CWL Workflow step. Concurrency is only an issue when `InplaceUpdateRequirement` is implemented. + +* Pattern 10 (Block Task to Sub-Workflow Decomposition) + +"Data elements available to a block task are able to be passed to or are accessible in the associated sub-workflow". Yes, step level inputs are connected to sub-Workflows. + +"There is some degree of control over which elements at block task level are made accessible in the sub-workflow" Yes, only explicit step level values are connected to the pre-defined sub-workflow inputs. + +* Pattern 11 (Sub-Workflow Decomposition to Block Task) + +"Data elements at sub-workflow level can be passed to or made accessible in the corresponding block task" Yes, via the ‘outputs’ section in the CWL (sub-)Workflow step definition. + +* Pattern 12 (To Multiple Instance Task) + +"Multiple instance tasks directly supported". Yes. + +"Data elements can be passed from an atomic task to all instances of a multiple instance task". Yes, via ‘scatter’. + +"Workflow handles synchronization of data passing and any necessary data replication" Yes, this is a defined responsibility of a CWL compliant workflow engine + +"Facilities are available to allocate sections of an aggregate data element to specific task instances". Yes. + +"Data elements in each task instance are independent of those in other task instances". Yes + +* Pattern 13 (From Multiple Instance Task) + +"Multiple instance tasks directly supported". Yes + +"Data elements can be aggregated from multiple task instances and forwarded to subsequent task instance(s)" Yes, via an implicit gather after the execution of a CWL Workflow step marked with “scatter”. + +"Workflow handles synchronization of data passing and any necessary data replication". Yes, this is a defined responsibility of a CWL compliant workflow engine + +* Pattern 14 (Case to Case) + +No, CWL explicitly and purposefully does not support interactions between concurrently executed Processes. CWL is not a service orchestration language. There is no CWL construct to say that two or more Processes should execute simultaneously or be overlapping. The software run by CommandLineTool processes might communicate data with an external service that might allow for exchanging data between CommandLineTool Processes that might be coincidentally executing concurrently. However, this is highly not recommended nor it is portable. There is a proposal (NOTE: https://github.com/common-workflow-language/cwltool#running-mpi-based-tools-that-need-to-be-launched) for an addition to a future version of the CWL standards of a "MPIRequirement’ that is available via a special flag in the CWL reference runner that achieves a similar functionality for a single CommandLineTool that should be executed on many nodes concurrently with intra-node communication set up by a system compliant with the MPI standard. + +* Pattern 15 (Task to Environment - Push), Pattern 16 (Environment to Task - Pull), Pattern 19 (Case to Environment - Push), Pattern 20 (Environment to Case - Pull), Pattern 23 (Workflow to Environment - Push), Pattern 24 (Environment to Workflow - Pull) + +No. This is not a feature of CWL. A CommandLineTool Process could communicate with an external service or resource via IP networking, but this is not explicitly supported by CWL and is not recommended. Interactions with stateful services are features of other workflow languages, like the now-defunct (NOTE: https://lists.apache.org/thread.html/r19322d54fd6aae5778aff46717dea2fbd37c3b64571300ad9cee0191%40%3Cdev.taverna.apache.org%3E +) Taverna, but they entail significant complexity and implementation costs to handle error states and other common challenges. + +* Pattern 17 (Environment to Task - Push), Pattern 18 (Task to Environment - Pull), Pattern 21 (Environment to Case - Push), Pattern 22 (Case to Environment - Pull), Pattern 25 (Environment to Workflow - Push), Pattern 26 (Workflow to Environment - Pull) + +No. While the "NetworkAccess" requirement enables a CommandLineTool to be marked as requiring network (IP) access the CWL standards state that “Enabling network access does not imply a publically routable IP address or the ability to accept inbound connections.” + +* Pattern 27 (Data Transfer by Value - Incoming) + +"Workflow components are able to accept data elements passed to them as value": Yes. The Processes connected to each step in a CWL workflow receive data by value to each of their required input parameters and zero or more of their optional input parameters. + +* Pattern 28 (Data Transfer by Value - Outgoing) + +"Workflow components are able to pass data elements to subsequent components by value": Yes, the results from a CWL Process are available as named outputs for connecting to the inputs of other Processes (and as final Workflow outputs). + +* Pattern 29 (Data Transfer - Copy In/Copy Out) + +No. In CWL we do not overwrite prior data except when the optional ‘InplaceUpdateRequirement’ is used. Even then there isn’t a copy, the specified Files and Directories are directly modified by the CommandLineTool. + +* Pattern 30 (Data Transfer by Reference - Unlocked), Pattern 31 (Data Transfer by Reference - With Lock) + +No. While the CWL object model does use IRIs/URIs ‘locations’ to identify specific File and Directories, those underlying bitstreams are invariant unless ‘InplaceUpdateRequirement’ is used. Those ‘locations’ are transformed to local file paths just prior to CommandLineTool execution. + +* Pattern 32 (Data Transformation - Input) + +Yes. There are many opportunities to transform data in CWL. In CWL Workflow step definitions once can extract a subset or perform another transformation using the ‘valueFrom’ field and a CWL expression or CWL parameter reference. + +* Pattern 33 (Data Transformation - Output) + +No. As a workaround, another Process (ExpressionTool, CommandLineTool, or sub-Workflow) can be used to modify an output before further use. Within a CommandLineTool Process the ‘outputEval’ field can do modifications. + +* Pattern 34 (Task Precondition - Data Existence) + +"Direct precondition support for evaluation data element existence at task instance level“: Yes. This is the basis for task dependency in CWL. + +* Pattern 35 (Task Precondition - Data Value) + +No. Once a Process has completed then all of its named outputs are available to sibling Workflow steps. While these can be evaluated using the new `when` field, this only decides if execution can take place. It does not delay execution as there is no capacity to update the inputs later; only to make new inputs under different steps. + +* Pattern 36 (Task Postcondition - Data Existence), Pattern 37 (Task Postcondition - Data Value) + +No. CommandLineTools and ExpressionTools terminate when their underlying tools finish execution. (sub-)Workflows terminate when all steps required for the outputs have finished. While the outputs can be examined in a CWL Expression and an exception thrown if they do not meet the given requirements, this does not cause continued execution or re-execution, but a permanent failure of the Process. + +* Pattern 38 (Event-Based Task Trigger) , Pattern 39 (Data-Based Task Trigger) + +No, this is not part of the CWL standards. But another system could initiate the execution of a CWL Process based upon an event outside the CWL Process description itself (or via an unofficial extension to the CWL standards). + +* Pattern 40 (Data-Based Routing) + +* "Any data element accessible at case level can be utilised in a routing construct". Any output from a sibling CWL step can be used to decide if a CWL step should be executed, yes. “Direct workflow support”: Yes, using the “when” field in a CWL workflow step definition. + +"Support for both exclusive choice and multi-choice constructs" These can be emulated, but not directly enforced. +