Merge pull request #247 from pepkit/dev

v1.2.0
pepkit · May 26, 2020 · ebdab21 · ebdab21
2 parents bb4f0e5 + f8328b0
commit ebdab21
Show file tree

Hide file tree

Showing 108 changed files with 4,331 additions and 4,184 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,9 +1,7 @@
 # ignore test results
-tests/test/*
+oldtests/test/*
 
 # toy/experimental files
-*.csv
-*.tsv
 *.pkl
 
 # ignore eggs
@@ -69,10 +67,12 @@ open_pipelines/
 *RESERVE*
 
 doc/
+site/
 build/
 dist/
 looper.egg-info/
 loopercli.egg-info/
+__pycache__/
 
 
 *ipynb_checkpoints*

diff --git a/.travis.yml b/.travis.yml
@@ -3,14 +3,18 @@ python:
   - "2.7"
   - "3.5"
   - "3.6"
+  - "3.7"
+  - "3.8"
 os:
   - linux
 install:
   - pip install --upgrade six
   - pip install .
   - pip install -r requirements/requirements-dev.txt
   - pip install -r requirements/requirements-test.txt
-script: pytest --cov=looper
+script: pytest tests -x -vv --cov=looper
+after_success:
+  - coveralls
 branches:
   only:
     - dev

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,4 +1,5 @@
 include requirements/*
 include README.md
 include logo_looper.svg
-include looper/jinja_templates/*
+include looper/jinja_templates/*
+include looper/schemas/*
diff --git a/docs/README.md b/docs/README.md
@@ -4,16 +4,16 @@
 
 ## What is looper?
 
-`Looper` is a pipeline submitting engine. `Looper` deploys any command-line pipeline for each sample in a project organized in [standard PEP format](https://pepkit.github.io/docs/home/). You can think of `looper` as providing a single user interface to running, summarizing, monitoring, and otherwise managing all of your sample-intensive research projects the same way, regardless of data type or pipeline used.
+Looper is a job submitting engine. Looper deploys arbitrary shell commands for each sample in a [standard PEP project](https://pepkit.github.io/docs/home/). You can think of looper as providing a single user interface to running, monitoring, and managing all of your sample-intensive research projects the same way, regardless of data type or pipeline used.
 
 ## What makes looper better?
 
-`Looper`'s key strength is that it **decouples job handling from the pipeline process**. In a typical pipeline, job handling (managing how individual jobs are submitted to a cluster) is delicately intertwined with actual pipeline commands (running the actual code for a single compute job). The `looper` approach is modular, following the [the unix principle](https://en.wikipedia.org/wiki/Unix_philosophy): `looper` *only* manages job submission. This approach leads to several advantages compared with the traditional integrated approach:
+Looper **decouples job handling from the pipeline process**. In a typical pipeline, job handling (managing how individual jobs are submitted to a cluster) is delicately intertwined with actual pipeline commands (running the actual code for a single compute job). In contrast, the looper approach is modular: looper *only* manages job submission. This approach leads to several advantages compared with the traditional integrated approach:
 
-1. running a pipeline on just one or two samples/jobs is simpler, and does not require a full-blown distributed compute environment.
-2. pipelines do not need to independently re-implement job handling code, which is shared.
-3. every project uses a universal structure (expected folders, file names, and sample annotation format), so datasets can more easily move from one pipeline to another.
-4. users must learn only a single interface that works with any of their projects for any pipeline.
+1. pipelines do not need to independently re-implement job handling code, which is shared.
+2. every project uses a universal structure, so datasets can move from one pipeline to another.
+3. users must learn only a single interface that works with any project for any pipeline.
+4. running just one or two samples/jobs is simpler, and does not require a  distributed compute environment.
 
 
 
@@ -24,13 +24,13 @@ Releases are posted as [GitHub releases](https://github.com/pepkit/looper/releas
 
 
 ```console
-pip install --user loopercli
+pip install --user looper
 ```
 
 Update with:
 
 ```console
-pip install --user --upgrade loopercli
+pip install --user --upgrade looper
 ```
 
 If the `looper` executable in not automatically in your `$PATH`, add the following line to your `.bashrc` or `.profile`:

diff --git a/docs/changelog.md b/docs/changelog.md
@@ -2,7 +2,34 @@
 
 This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) format. 
 
+## [1.2.0] - 2020-05-26
+
+**This version introduced backwards-incompatible changes.**
+
+### Added
+- Commands:
+    - `init`; initializes `.looper.yaml` file
+    - `inspect`; inspects `Project` or `Sample` objects
+    - `table`; writes summary stats table
+    - `runp`; runs project level pipelines
+- Input schemas and output schemas
+- `--settings` argument to specify compute resources as a YAML file
+- Option to preset CLI options in a dotfile
+- `--command-extra` and `--command-extra-override` arguments that append specified string to pipeline commands
+- Option to specify destination of sample YAML in pipeline interface
+- `--pipeline_interfaces` argument that allows pipeline interface specification via CLI
+
+### Changed
+- `looper summarize` to `looper report`
+- Pipeline interface format changed drastically
+- The PyPi name changed from 'loopercli' to 'looper'
+- resources section in pipeline interface replaced with `size_dependent_attributes` or `dynamic_variables_command_template`.
+- `--compute` can be used to specify arguments other than resources
+- `all_input_files` and `required_input_files` keys in pipeline interface moved to the input schema and renamed to `files` and `required_files`
+- pipeline interface specification
+
 ## [0.12.6] -- 2020-02-21
+
 ### Added
 - possibility to execute library module as a script: `python -m looper ...`
 

diff --git a/docs/cluster-computing.md b/docs/cluster-computing.md
diff --git a/docs/concentric-templates.md b/docs/concentric-templates.md
@@ -0,0 +1,56 @@
+# Looper's concentric template system
+
+## Introduction
+
+To build job scripts, looper uses a 2-level template system consisting of an inner template wrapped by an outer template. The inner template is called a *command template*, which produces the individual commands to execute. The outer template is the *submission template*, which wraps the commands in environment handling code. This layered design allows us to decouple the computing environment from the pipeline, which improves portability.
+
+## The command template
+
+The command template is specified by a pipeline in the pipeline interface. A very basic command template could be something like this:
+
+```console
+pipeline_command {sample.input_file} --arg
+```
+
+In the simplest case, looper can run the pipeline by simply running these commands. This example contains no information about computing environment, such as SLURM submission directives.
+
+## The submission template
+
+To extend to submitting the commands to a cluster, it may be tempting to add these details directly to the command template, which cause the jobs to be submitted to SLURM instead of run directly. However, this would restrict the pipeline to *only* running via SLURM, since the submission code would be tightly coupled to the command code. Instead, looper retains flexibility by introducing a second template layer, the *submission template*. The submission template is specified at the level of the computing environment.  A submission template can also be as simple or complex as required. For a command to be run in a local computing environment, a basic template will suffice:
+
+```console
+#! /usr/bin/bash
+
+{CODE}
+```
+
+A more complicated template could submit a job to a SLURM cluster:
+
+```console
+#!/bin/bash
+#SBATCH --job-name='{JOBNAME}'
+#SBATCH --output='{LOGFILE}'
+#SBATCH --mem='{MEM}'
+#SBATCH --cpus-per-task='{CORES}'
+#SBATCH --time='{TIME}'
+echo 'Compute node:' `hostname`
+echo 'Start time:' `date +'%Y-%m-%d %T'`
+
+srun {CODE}
+```
+
+## The advantages of concentric templates
+
+Looper first populates the command template, and then provides the output as a variable and used to populate the `{CODE}` variable in the submission template. This decoupling provides substantial advantages:
+
+1. The commands can be run on any computing environment by simply switching the submission template.
+2. The submission template can be used for any computing environment parameters, such as containers.
+3. The submission template only has to be defined once *per environment*, so many pipelines can use them.
+4. We can [group multiple individual commands](grouping-jobs.md) into a single submission script.
+5. The submission template is universal and can be handled by dedicated submission template software.
+
+In fact, looper uses [divvy](http://divvy.databio.org) to handle submission templates. The divvy submission templates can be used for interactive submission of jobs, or used by other software.
+
+## Populating templates
+
+The task of running jobs can be thought of as simply populating the templates with variables. To do this, Looper provides [variables from several sources](variable-namespaces.md). 
diff --git a/docs/config-files.md b/docs/config-files.md
@@ -6,46 +6,39 @@ We've organized these files so that each handle a different level of infrastruct
 
 - Environment
 - Project
-- Sample
 - Pipeline
 
 This makes the system very adaptable and portable, but for a newcomer, it is easy to map each to its purpose. 
 So, here's an explanation of each for you to use as a reference until you are familiar with the whole ecosystem. 
-Which ones you need to know about will depend on whether you're a **pipeline *user*** (running pipelines on your project) 
-or a **pipeline *developer*** (building your own pipeline).
+Which ones you need to know about will depend on whether you're a pipeline *user* (running pipelines on your project) 
+or a pipeline *developer* (building your own pipeline).
 
 
 ## Pipeline users
 
-Users (non-developers) of pipelines only need to be aware of one or two config files:
+Users (non-developers) of pipelines only need to be aware of one or two config files.
 
-- The [project config](define-your-project): This file is specific to each project and 
-contains information about the project's metadata, where the processed files should be saved, 
-and other variables that allow to configure the pipelines specifically for this project. 
-It follows the standard `looper` format (now referred to as `PEP`, or "*portable encapsulated project*" format).
+### Project configuration
 
-If you are planning to submit jobs to a cluster, then you need to know about a second config file:
-- The [`PEPENV` config](cluster-computing.md): This file tells `looper` how to use compute resource managers, like SLURM. 
-After initial setup it typically requires little (if any) editing or maintenance.
+[**project config**](defining-a-project.md) -- this file is specific to each project and contains information about the project's metadata, where the processed files should be saved, and other variables that allow to configure the pipelines specifically for this project. It follows the standard Portable Encapsulated Project format, or PEP for short.
 
-That should be all you need to worry about as a pipeline user. 
-If you need to adjust compute resources or want to develop a pipeline or have more advanced project-level control 
-over pipelines, you'll need knowledge of the config files used by pipeline developers.
+### Environment configuration
+
+[**environment config**](http://divvy.databio.org/en/latest/configuration/) -- if you are planning to submit jobs to a cluster, then you need to be aware of environment configuration. This task is farmed out to [divvy](http://divvy.databio.org/en/latest/), a computing resource configuration manager. Follow the divvy documentation to learn about ways to tweak the computing environment settins according to your needs.
+
+That should be all you need to worry about as a pipeline user. If you need to adjust compute resources or want to develop a pipeline or have more advanced project-level control over pipelines, you'll need knowledge of the config files used by pipeline developers.
 
 
 ## Pipeline developers
 
-If you want to make pipeline compatible with `looper`, tweak the way `looper` interacts with a pipeline for a given project, 
-or change the default cluster resources requested by a pipeline, you need to know about a configuration file that coordinates linking pipelines to a project.
-- The [pipeline interface file](pipeline-interface.md):
-This file sas two sections"
-  - `protocol_mapping` tells looper which pipelines exist, and how to map each protocol (sample data type) to a pipeline
-  - `pipelines` describes options, arguments, and compute resources that defined how `looper` should communicate with each pipeline.
+### Pipeline configuration
+
+If you want to make pipeline compatible with looper, tweak the way looper interacts with a pipeline for a given project, 
+or change the default cluster resources requested by a pipeline, you need to know about a configuration file that coordinates linking pipelines to a project. This happens via the [pipeline interface file](pipeline-interface-specification.md).
 
-Finally, if you're using [the `pypiper` framework](https://github.com/databio/pypiper) to develop pipelines, 
-it uses a pipeline-specific configuration file, which is detailed in the [`pypiper` documentation](http://pypiper.readthedocs.io/en/latest/advanced.html#pipeline-config-files). 
+Finally, if you're using [the pypiper framework](https://github.com/databio/pypiper) to develop pipelines, 
+it uses a pipeline-specific configuration file, which is detailed in the [pypiper documentation](http://pypiper.readthedocs.io/en/latest/advanced.html#pipeline-config-files). 
 
 Essentially, each pipeline may provide a configuration file describing where software is, 
 and parameters to use for tasks within the pipeline. This configuration file is by default named like pipeline name, 
-with a `.yaml` extension instead of `.py`. For example, by default `rna_seq.py` looks for an accompanying `rna_seq.yaml` file. 
-These files can be changed on a per-project level using the `pipeline_config` section of a [project configuration file](define-your-project).
+with a `.yaml` extension instead of `.py`. For example, by default `rna_seq.py` looks for an accompanying `rna_seq.yaml` file.