Skip to content

Latest commit

 

History

History
289 lines (207 loc) · 29.6 KB

6._Manage_Pipeline.md

File metadata and controls

289 lines (207 loc) · 29.6 KB

6. Manage Pipeline

Pipelines represent a sequence of tasks that are executed along with each other in order to process some data. They help to automate complex tasks that consist of many sub-tasks.

This chapter describes Pipeline space GUI and the main working scenarios.

Pipeline object GUI

As far as the pipeline is one of CP objects which stored in "Library" space, the Pipeline workspace is separated into two panes:

  1. "Hierarchy" view pane
  2. "Details" view pane.

Note: also you can view general information and some details about the specific pipeline via CLI. See 14.4 View pipeline definitions via CLI.

"Details" view pane

The "Details" view pane displays content of a selected object. In case of a pipeline, you will see:

  • a list of pipeline versions with a description of last update and date of the last update;
  • specific space's controls.

CP_ManagePipeline

"Details" controls

Control Description
Displays icon This icon includes:
  • "Attributes" control (1) opens Attributes pane. Here you can see a list of "key=value" attributes of the pipeline. For more info see here.
    Note: If the selected pipeline has any defined attribute, Attributes pane is shown by default.
  • Issues shows/hides the issues of the current pipeline to discuss. To learn more see here.
"Gear" icon This control (2) allows to rename, change the description, delete and edit repository settings i.e. Repository address and Token
Git repository "Git repository" control (3) shows a git repository address where pipeline versions are stored, which could be copied and pasted into a browser address field:
CP_ManagePipeline
If for your purposes ssh protocol is required, you may click the HTTPS/SSH selector and choose the SSH item:
CP_ManagePipeline
In that case, you will get a reformatted SSH address:
CP_ManagePipeline
Also a user can work directly with git from the console on the running node. For more information, how to configure Git client to work with the Cloud Pipeline, see here.
To clone a pipeline a user shall have READ permissions, to push WRITE permission are also needed. For more info see 13. Permissions.
Release "Release" control (4) is used to tag a particular pipeline version with a name. A draft pipeline version has the control only.
Note: you can edit the last pipeline version only.
Run Each pipeline version item of the selected pipeline's list has a "Run" control (5) to launch a pipeline version.

Pipeline versions GUI

Pipeline version interface displays full information about a pipeline version: supporting documentation, code files, and configurations, history of version runnings, etc.

Pipeline controls

The following buttons are available to manage this space.

Control Description
Run This button launch a pipeline version. When a user clicks the button, the "Launch a pipeline" page opens.
"Gear" icon This control allows to rename, change the description, delete and edit repository settings i.e. Repository address and Token.
Git repository Shows a git repository address where pipeline versions are stored, which could be copied and pasted in a browser line.
Also a user can work directly with git from the console on the running node. For more information, how to configure Git client to work with the Cloud Pipeline, see here.
To clone a pipeline a user shall have READ permissions, to push WRITE permission is also needed. For more info see 13. Permissions.

Pipeline launching page

"Launch a pipeline" page shows parameters of a default configuration the pipeline version. This page has the same view as a "Configuration" tab of a pipeline version. Here you can select any other configuration from the list and/or change parameters for this specific run (changes in configuration will be applied only to this specific run).

Pipeline version tabs

Pipeline version space dramatically differs from the Pipeline space. You can open it: just click on it.
The whole information is organized into the following tabs in "Details" view pane.

DOCUMENTS

The "Documents" tab contains documentation associated with the pipeline, e.g. README, pipeline description, etc. See an example here.
NoteREADME.md file is created automatically and contains default text which could be easily edited by a user.

Documents tab controls

CP_ManagePipeline

The following buttons are available to manage this section:

Control Description
Upload (a) This control (a) allows to upload documentation files.
Delete (b) "Delete" control (b) helps to delete a file.
Rename (c) To rename a file a user shall use a "Rename" control (c).
Download (d) This control (d) allows downloading pipeline documentation file to your local machine.
Edit (e) "Edit" control (e) helps a user to edit any text files (e.g. README) here in a text editor using a markdown language

CODE

This section contains a list of scripts to run a pipeline. Here you can create new files, folders and upload files here. Each script file could be edited (see details here).
Note: .json configuration file can also be edited in the Configuration tab via GUI.

Code tab controls

CP_ManagePipeline

The following controls are available:

Control Description
Plus button (a) This control is to create a new folder in a pipeline version. The folder's name shall be specified.
+ New file (b) To create a new file in the current folder.
Upload (c) To upload files from your local file system to a pipeline version.
Rename (d) Each file or folder has a "Rename" control which allows renaming a file/folder.
Delete (e) Each file or folder has a "Delete" control which deletes a file/folder.

The list of system files

All newly created pipelines have at least 2 starting files no matter what pipeline template you've chosen.
Only newly created DEFAULT pipeline has 1 starting file (config.json).

main_file

This file contains a pipeline scenario. By default, it is named after a pipeline, but this may be changed in the configuration file.
Note: the main_file is usually an entry point to start pipeline execution.
To create your own scenario the default template of the main file shall be edited (see details here).

Example: below is the piece of the main_file of the Gromacs pipeline:
CP_ManagePipeline

config.json

This file contains pipeline execution parameters. You can not rename or delete it because of it's used in pipeline scripts and they will not work without it.
Note: it is advised that pipeline execution settings are modified via CONFIGURATION tab (e.g. if you want to change default settings for pipeline execution) or via Launch pipeline page (e.g. if you want to change pipeline settings for a current run). Manual config.json editing should be used only for advanced users (primarily developers) since json format is not validated in this case.
Note: all attributes from config.json are available as environment variables for pipeline execution.

The config.json file for every pipeline template have the following settings:

Setting Description
main_file A name of the main file for that pipeline.
instance_size Instance type in terms of the specific Cloud Provider that specifies an amount of RAM in Gb, CPU and GPU cores number (e.g. m4.xlarge for AWS EC2 instance).
instance_disk An instance's disk size in Gb.
docker_image A name of the Docker image that will be used in the current pipeline.
cmd_template Command line template that will be executed at the running instance in the pipeline.

cmd_template can use environment variables:
- to address the main_file parameter value, use the following construction - [main_file]
- to address all other parameters, usual Linux environment variables style shall be used (e.g. $docker_image)
parameters Pipeline execution parameters (e.g. path to the data storage with input data). A parameter has a name and set of attributes. There are possible keys for each parameter:
- "type" - key specifies a type for current parameter,
- "value" - key specifies default value for parameter,
- "required" - key specifies whether this parameter must be set ("required": true) or might not ("required": false)
- "no_override" - key specifies whether the parameter default value be immutable in the detached configuration that uses this pipeline:
    - if a parameter has a default value and "no_override" is true - that parameter field will be read-only
    - if a parameter has a default value and "no_override" is false or not set - that parameter field will be writable
    - if a parameter has no default value - "no_override" option is ignored and that parameter field will be writable

Example: config.json file of the Gromacs pipeline:
CP_ManagePipeline

Note: In addition to main_file and config.json you can add any number of files to the CODE section and combine it in one whole scenario.

CONFIGURATION

This section represents pipeline execution parameters which are set in config.json file. The parameters can be changed here and config.json file will be changed respectively. See how to edit configuration here.

CP_ManagePipeline

A configuration specifies:

Section Control Description
Name Pipeline and its configuration names.
Estimated price per hour Control shows machine hours prices. If you navigate mouse to "info" icon, you'll see the maximum, minimum and average price for a particular pipeline version run as well as the price per hour.
Exec environment This section lists execution environment parameters.
Docker image A name of a Docker image to use for a pipeline execution (e.g. "library/gromacs-gpu").
Node type An instance type in terms of the specific Cloud Provider: CPU, RAM, GPU (e.g. 2 CPU cores, 8 Gb RAM, 0 GPU cores).
Disk Size of a disk in gigabytes, that will be attached to the instance in Gb.
Configure cluster button On-click, pop-up window will be shown:
CP_ManagePipeline
Here you can configure cluster or auto-scaled cluster. Cluster is a collection of instances which are connected so that they can be used together on a task. In both cases, a number of additional worker nodes with some main node as cluster head are launching (total number of pipelines = "number of working nodes" + 1). See v.0.14 - 7.2. Launch Detached Configuration for details.

In case of using cluster, an exact count of worker nodes is directly defined by the user before launching the task and could not changing during the run.

In case of using auto-scaled cluster, a max count of worker nodes is defined by the user before launching the task but really used count of worker nodes can change during the run depending on the jobs queue load. See Appendix C. Working with autoscaled cluster runs for details.

For configure cluster:
  • in opened window click Cluster button
  • specify a number of child nodes (workers' count)
  • if you want to use GridEngine server for the cluster, tick the Enable GridEngine checkbox. Setting of that checkbox automatically adds the CP_CAP_SGE system parameter with value true.
    Note: you may set this and other system parameters manually - see the example of using system parameters here.
  • if you want to use Apache Spark for the cluster, tick the Enable Apache Spark checkbox. Setting of that checkbox automatically adds the CP_CAP_SPARK system parameter with value true. See the example of using Apache Spark here.
  • if you want to use Slurm for the cluster, tick the Enable Slurm checkbox. Setting of that checkbox automatically adds the CP_CAP_SLURM system parameter with value true. See the example of using Slurm here.
  • click OK button:
  • CP_ManagePipeline
When user selects Cluster option, information on total cluster resources is shown. Resources are calculated as (CPU/RAM/GPU)*(NUMBER_OF_WORKERS+1):
CP_ManagePipeline

For configure auto-scaled cluster:
  • in opened window click Auto-scaled cluster button
  • specify a number of child nodes (workers' count) in field Auto-scaled up to and click OK button:
    CP_ManagePipeline
    Note: that number is meaning total count of "auto-scaled" nodes - it is the max count of worker nodes that could be attached to the main node to work together as cluster. These nodes will be attached to the cluster only in case if some jobs are in waiting state longer than a specific time. Also these nodes will be dropped from the cluster in case when jobs queue is empty or all jobs are running and there are some idle nodes longer than a specific time.
    Note: about timeout periods for scale-up and scale-down of auto-scaled cluster see here.
  • additionally you may enable hybrid mode for the auto-scaled cluster - it allows to scale-up the cluster (attach a worker node) with the instance type distinct of the master - worker is being picked up based on the amount of unsatisfied CPU requirements of all pending jobs. For that behavior, set the "Enable Hybrid cluster" checkbox. For more details see here.
  • additionally you may specify a number of "persistent" child nodes (workers' count) - click Setup default child nodes count, input Default child nodes number and click Ok button:
    CP_ManagePipeline
    • These default child nodes will be never "scaled-down" during the run regardless of jobs queue load. In the example above, total count of "auto-scaled" nodes - 3, and 1 of them is "persistent".
      Note: total count of child nodes always must be greater than count of default ("persistent") child nodes.
    • if you don't want to use default ("persistent") child nodes in your auto-scaled cluster - click Reset button opposite the Default child nodes field.
  • additionally you may choose a price type for workers that will be attached during the run - via the "Workers price type" dropdown list - workers' price type can be automatically the same as the master node type (by default) or forcibly specified regardless on the master's type:
    CP_ManagePipeline
When user selects Auto-scaled cluster, information on total cluster resources is shown as interval - from the "min" configuration to "max" configuration:
"min" configuration resources are calculated as (CPU/RAM/GPU)*(NUMBER_OF_DEFAULT_WORKERS+1)
"max" configuration resources are calculated as (CPU/RAM/GPU)*(TOTAL_NUMBER_OF_WORKERS+1)

E.g. for auto-scaled cluster with 2 child nodes and without default ("persistent") child nodes (NUMBER_OF_DEFAULT_WORKERS = 0; TOTAL_NUMBER_OF_WORKERS = 2):
CP_ManagePipeline

E.g. for auto-scaled cluster with 2 child nodes and 1 default ("persistent") child node (NUMBER_OF_DEFAULT_WORKERS = 1; TOTAL_NUMBER_OF_WORKERS = 2):
CP_ManagePipeline

Note: in some specific configurations such as hybrid autoscaling clusters amount of resources can vary beyond the shown interval.
Note: if you don't want to use any cluster - click Single node button and then click OK button.
Cloud Region A specific region for a compute node placement.
Please note, if a non-default region is selected - certain CP features may be unavailable:
  • FS mounts usage from the another region (e.g. "EU West" region cannot use FS mounts from the "US East"). Regular storages will be still available
  • If a specific tool, used for a run, requires an on-premise license server (e.g. monolix, matlab, schrodinger, etc.) - such instances shall be run in a region, that hosts those license servers.
Note: if a specific platform deployment has a number of Cloud Providers registered (e.g. AWS+Azure, GCP+Azure) - in that control Cloud Provider auxiliary icons also will be displayed, e.g.:
CP_ManagePipeline
For a single-Provider deployments only Cloud Region icons are displayed.
Run capabilities Allows to launch a pipeline with a pre-configured additional software package(s), e.g. Docker-In-Docker, Singularity and others. For details see here.
CP_ManagePipeline
Advanced
Price type Choose Spot or On-demand type of instance. You can look information about price types hovering "Info" icon and based on it make your choice.
Timeout (min) After this time pipeline will shut down (optional). Before the shut down, all the contents of the $ANALYSIS_DIR directory will be copied to output storages.
Limit mounts Allow to specify storages that should be mounted. For details see here.
Cmd template A shell command that will be executed to start a pipeline.
"Start idle" The flag sets cmd_template to sleep infinity. For more information about starting a job in this mode refer to 15. Interactive services.
Parameters This section lists pipeline specific parameters that can be used during a pipeline run. Pipeline parameters can be assigned the following types:
  • String - generic scalar value (e.g. Sample name).
  • Boolean - boolean value.
  • Path - path in a data storage hierarchy.
  • Input - path in a data storage hierarchy. During pipeline initialization, this path will be used to download input data on the calculation node for processing from a storage.
  • Output - path in a data storage hierarchy. During pipeline finalization, this path will be used to upload resulting data to a storage.
  • Common - path in a data storage hierarchy. Similar to "Input" type, but this data will not be erased from a calculation node, when a pipeline is finished (this is useful for reference data, as it can be reused by further pipeline runs that share the same reference).
Note: You can use Project attribute values as parameters for the Run:
  1. Click an empty parameter value field.
  2. Enter "project."
  3. In the drop-down list select the Project attribute value:
    CP_ManagePipeline
Add parameter This control helps to add an additional parameter to a configuration.

Configuration tab controls

Control Description
Add To create a customized configuration for the pipeline, click the + ADD button in the upper-right corner of the screen. For more details see here.
Save This button saves changes in a configuration.

HISTORY

This section contains information about all the current pipeline version's runs. Runs info is organized into a table with the following columns:

CP_ManagePipeline

  • Run - each record of that column contains two rows: in upper - run name that consists of pipeline name and run id, in bottom - Cloud Region.

Note: if a specific platform deployment has a number of Cloud Providers registered (e.g. AWS+Azure, GCP+Azure) - corresponding text information also has a Provider name, e.g.:
CP_ManagePipeline

  • Parent-run - id of the run that executed current run (this field is non-empty only for runs that are executed by other runs).
  • Pipeline - each record of that column contains two rows: in upper - pipeline name, in bottom - pipeline version.
  • Docker image - base docker image name.
  • Started - time pipeline started running.
  • Completed - time pipeline finished execution.
  • Elapsed - each record of that column contains two rows: in upper - pipeline running time, in bottom - run's estimated price, which is calculated based on the run duration, region and instance type.
  • Owner - user who launched run.

You can filter runs by clicking the filter icon. By using the filter control you can choose whether display runs for current pipeline version or display runs for all pipeline versions.

History tab controls

Control Description
PAUSE (a) To pause running pipeline press this control. This control is available only for on-demand instances.
STOP (b) To stop running pipeline press this control.
LOG (c) "Log" control opens detailed information about the run. You'll be redirected to "Runs" space (see 11. Manage Runs).
RESUME (d) To resume pausing pipeline press this control. This control is available only for on-demand instances.
TERMINATE (e) To terminate node without waiting of the pipeline resuming. This control is available only for on-demand instances, which were paused.
RERUN (f) This control reruns completed pipeline's runs.

Pipeline run's states

Icons at the left represent the current state of the pipeline runs:  

  • CP_ManagePipeline - Queued state ("sandglass" icon) - a run is waiting in the queue for the available compute node.
  • CP_ManagePipeline - Initializing state ("rotating" icon) - a run is being initialized.
  • CP_ManagePipeline - Pulling state ("download" icon) - now pipeline Docker image is downloaded to the node.
  • CP_ManagePipeline - Running state ("play" icon) - a pipeline is running. The node is appearing and pipeline input data is being downloaded to the node before the "InitializeEnvironment" service task appears.
  • CP_ManagePipeline - Paused state ("pause" icon) - a run is paused. At this moment compute node is already stopped but keeps it's state. Such run may be resumed.
  • CP_ManagePipeline - Success state ("OK" icon) - successful pipeline execution.
  • CP_ManagePipeline - Failed state ("caution" icon) - unsuccessful pipeline execution.
  • CP_ManagePipeline - Stopped state ("clock" icon) - a pipeline manually stopped.

CP_ManagePipeline

Also, help tooltips are provided when hovering a run state icon, e.g.:
CP_ManagePipeline
CP_ManagePipeline

STORAGE RULES

This section displays a list of rules used to upload data to the output data storage, once pipeline finished. It helps to store only data you need and minimize the amount of interim data in data storages.

CP_ManagePipeline

Info is organized into a table with the following columns:

  • Mask column contains a relative path from the $ANALYSIS_DIR folder (see Default environment variables section below for more information). Mask uses bash syntax to specify the data that you want to upload from the $ANALYSIS_DIR. Data from the specified path will be uploaded to the bucket from the pipeline node.
    Note: by default whole $ANALYSIS_DIR folder is uploaded to the cloud bucket (default Mask is - "*"). For example, "*.txt*" mask specifies that all files with .txt extension need to be uploaded from the $ANALYSIS_DIR to the data storage.
    Note: Be accurate when specifying masks - if wildcard mask ("*") is specified, all files will be uploaded, no matter what additional masks are specified.
  • The Created column shows date and time of rules creation.
  • Move to Short-Term Storage column indicates whether pipeline output data will be moved to a short-term storage.

Storage rules tab control

Control Description
Add new rule (a) This control allows adding a new data managing rule.
Delete (b) To delete a data managing rule press this control.

GRAPH

This section represents the sequence of pipeline tasks as a directed graph.

Tasks are graph vertices, edges represent execution order. A task can be executed only when all input edges - associated tasks - are completed (see more information about creating a pipeline with GRAPH section here).
Note: only for Luigi and WDL pipelines.
Note: If main_file has mistakes, pipeline workflow won't be visualized.

CP_ManagePipeline

Graph tab controls

When a PipelineBuilder graph is loaded, the following layout controls become available to the user.

CP_ManagePipeline

Control Description
Save saves changes.
Revert reverts all changes to the last saving.
Layout performs graph linearization, make it more readable.
Fit zooms graph to fit the screen.
Show links enables/disables workflow level links to the tasks. It is disabled by default, as for large workflows it overwhelms the visualization.
Zoom out zooms graph out.
Zoom in zooms graph in.
Search element allows to find specific object at the graph.
Fullscreen expands graph to the full screen.

Default environment variables

Pipeline scripts (e.g. main_file) use default environmental variables for pipeline execution. These variables are set in internal CP scripts:

  • RUN_ID - pipeline run ID.
  • PIPELINE_NAME - pipeline name.
  • COMMON_DIR - directory where pipeline common data (parameter with "type": "common") will be stored.
  • ANALYSIS_DIR - directory where output data of the pipeline (parameter with "type": "output") will be stored.
  • INPUT_DIR - directory where input data of the pipeline (parameter with "type": "input") will be stored.
  • SCRIPTS_DIR - directory where all pipeline scripts and config.json file will be stored.