Skip to content

Commit

Permalink
Merge pull request #247 from pepkit/dev
Browse files Browse the repository at this point in the history
v1.2.0
  • Loading branch information
stolarczyk authored May 26, 2020
2 parents bb4f0e5 + f8328b0 commit ebdab21
Show file tree
Hide file tree
Showing 108 changed files with 4,331 additions and 4,184 deletions.
6 changes: 3 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
# ignore test results
tests/test/*
oldtests/test/*

# toy/experimental files
*.csv
*.tsv
*.pkl

# ignore eggs
Expand Down Expand Up @@ -69,10 +67,12 @@ open_pipelines/
*RESERVE*

doc/
site/
build/
dist/
looper.egg-info/
loopercli.egg-info/
__pycache__/


*ipynb_checkpoints*
Expand Down
6 changes: 5 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,18 @@ python:
- "2.7"
- "3.5"
- "3.6"
- "3.7"
- "3.8"
os:
- linux
install:
- pip install --upgrade six
- pip install .
- pip install -r requirements/requirements-dev.txt
- pip install -r requirements/requirements-test.txt
script: pytest --cov=looper
script: pytest tests -x -vv --cov=looper
after_success:
- coveralls
branches:
only:
- dev
Expand Down
3 changes: 2 additions & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
include requirements/*
include README.md
include logo_looper.svg
include looper/jinja_templates/*
include looper/jinja_templates/*
include looper/schemas/*
16 changes: 8 additions & 8 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,16 @@

## What is looper?

`Looper` is a pipeline submitting engine. `Looper` deploys any command-line pipeline for each sample in a project organized in [standard PEP format](https://pepkit.github.io/docs/home/). You can think of `looper` as providing a single user interface to running, summarizing, monitoring, and otherwise managing all of your sample-intensive research projects the same way, regardless of data type or pipeline used.
Looper is a job submitting engine. Looper deploys arbitrary shell commands for each sample in a [standard PEP project](https://pepkit.github.io/docs/home/). You can think of looper as providing a single user interface to running, monitoring, and managing all of your sample-intensive research projects the same way, regardless of data type or pipeline used.

## What makes looper better?

`Looper`'s key strength is that it **decouples job handling from the pipeline process**. In a typical pipeline, job handling (managing how individual jobs are submitted to a cluster) is delicately intertwined with actual pipeline commands (running the actual code for a single compute job). The `looper` approach is modular, following the [the unix principle](https://en.wikipedia.org/wiki/Unix_philosophy): `looper` *only* manages job submission. This approach leads to several advantages compared with the traditional integrated approach:
Looper **decouples job handling from the pipeline process**. In a typical pipeline, job handling (managing how individual jobs are submitted to a cluster) is delicately intertwined with actual pipeline commands (running the actual code for a single compute job). In contrast, the looper approach is modular: looper *only* manages job submission. This approach leads to several advantages compared with the traditional integrated approach:

1. running a pipeline on just one or two samples/jobs is simpler, and does not require a full-blown distributed compute environment.
2. pipelines do not need to independently re-implement job handling code, which is shared.
3. every project uses a universal structure (expected folders, file names, and sample annotation format), so datasets can more easily move from one pipeline to another.
4. users must learn only a single interface that works with any of their projects for any pipeline.
1. pipelines do not need to independently re-implement job handling code, which is shared.
2. every project uses a universal structure, so datasets can move from one pipeline to another.
3. users must learn only a single interface that works with any project for any pipeline.
4. running just one or two samples/jobs is simpler, and does not require a distributed compute environment.



Expand All @@ -24,13 +24,13 @@ Releases are posted as [GitHub releases](https://github.com/pepkit/looper/releas


```console
pip install --user loopercli
pip install --user looper
```

Update with:

```console
pip install --user --upgrade loopercli
pip install --user --upgrade looper
```

If the `looper` executable in not automatically in your `$PATH`, add the following line to your `.bashrc` or `.profile`:
Expand Down
27 changes: 27 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,34 @@

This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) format.

## [1.2.0] - 2020-05-26

**This version introduced backwards-incompatible changes.**

### Added
- Commands:
- `init`; initializes `.looper.yaml` file
- `inspect`; inspects `Project` or `Sample` objects
- `table`; writes summary stats table
- `runp`; runs project level pipelines
- Input schemas and output schemas
- `--settings` argument to specify compute resources as a YAML file
- Option to preset CLI options in a dotfile
- `--command-extra` and `--command-extra-override` arguments that append specified string to pipeline commands
- Option to specify destination of sample YAML in pipeline interface
- `--pipeline_interfaces` argument that allows pipeline interface specification via CLI

### Changed
- `looper summarize` to `looper report`
- Pipeline interface format changed drastically
- The PyPi name changed from 'loopercli' to 'looper'
- resources section in pipeline interface replaced with `size_dependent_attributes` or `dynamic_variables_command_template`.
- `--compute` can be used to specify arguments other than resources
- `all_input_files` and `required_input_files` keys in pipeline interface moved to the input schema and renamed to `files` and `required_files`
- pipeline interface specification

## [0.12.6] -- 2020-02-21

### Added
- possibility to execute library module as a script: `python -m looper ...`

Expand Down
60 changes: 0 additions & 60 deletions docs/cluster-computing.md

This file was deleted.

56 changes: 56 additions & 0 deletions docs/concentric-templates.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Looper's concentric template system

## Introduction

To build job scripts, looper uses a 2-level template system consisting of an inner template wrapped by an outer template. The inner template is called a *command template*, which produces the individual commands to execute. The outer template is the *submission template*, which wraps the commands in environment handling code. This layered design allows us to decouple the computing environment from the pipeline, which improves portability.

## The command template

The command template is specified by a pipeline in the pipeline interface. A very basic command template could be something like this:

```console
pipeline_command {sample.input_file} --arg
```

In the simplest case, looper can run the pipeline by simply running these commands. This example contains no information about computing environment, such as SLURM submission directives.

## The submission template

To extend to submitting the commands to a cluster, it may be tempting to add these details directly to the command template, which cause the jobs to be submitted to SLURM instead of run directly. However, this would restrict the pipeline to *only* running via SLURM, since the submission code would be tightly coupled to the command code. Instead, looper retains flexibility by introducing a second template layer, the *submission template*. The submission template is specified at the level of the computing environment. A submission template can also be as simple or complex as required. For a command to be run in a local computing environment, a basic template will suffice:

```console
#! /usr/bin/bash

{CODE}
```

A more complicated template could submit a job to a SLURM cluster:

```console
#!/bin/bash
#SBATCH --job-name='{JOBNAME}'
#SBATCH --output='{LOGFILE}'
#SBATCH --mem='{MEM}'
#SBATCH --cpus-per-task='{CORES}'
#SBATCH --time='{TIME}'
echo 'Compute node:' `hostname`
echo 'Start time:' `date +'%Y-%m-%d %T'`

srun {CODE}
```

## The advantages of concentric templates

Looper first populates the command template, and then provides the output as a variable and used to populate the `{CODE}` variable in the submission template. This decoupling provides substantial advantages:

1. The commands can be run on any computing environment by simply switching the submission template.
2. The submission template can be used for any computing environment parameters, such as containers.
3. The submission template only has to be defined once *per environment*, so many pipelines can use them.
4. We can [group multiple individual commands](grouping-jobs.md) into a single submission script.
5. The submission template is universal and can be handled by dedicated submission template software.

In fact, looper uses [divvy](http://divvy.databio.org) to handle submission templates. The divvy submission templates can be used for interactive submission of jobs, or used by other software.

## Populating templates

The task of running jobs can be thought of as simply populating the templates with variables. To do this, Looper provides [variables from several sources](variable-namespaces.md).
41 changes: 17 additions & 24 deletions docs/config-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,46 +6,39 @@ We've organized these files so that each handle a different level of infrastruct

- Environment
- Project
- Sample
- Pipeline

This makes the system very adaptable and portable, but for a newcomer, it is easy to map each to its purpose.
So, here's an explanation of each for you to use as a reference until you are familiar with the whole ecosystem.
Which ones you need to know about will depend on whether you're a **pipeline *user*** (running pipelines on your project)
or a **pipeline *developer*** (building your own pipeline).
Which ones you need to know about will depend on whether you're a pipeline *user* (running pipelines on your project)
or a pipeline *developer* (building your own pipeline).


## Pipeline users

Users (non-developers) of pipelines only need to be aware of one or two config files:
Users (non-developers) of pipelines only need to be aware of one or two config files.

- The [project config](define-your-project): This file is specific to each project and
contains information about the project's metadata, where the processed files should be saved,
and other variables that allow to configure the pipelines specifically for this project.
It follows the standard `looper` format (now referred to as `PEP`, or "*portable encapsulated project*" format).
### Project configuration

If you are planning to submit jobs to a cluster, then you need to know about a second config file:
- The [`PEPENV` config](cluster-computing.md): This file tells `looper` how to use compute resource managers, like SLURM.
After initial setup it typically requires little (if any) editing or maintenance.
[**project config**](defining-a-project.md) -- this file is specific to each project and contains information about the project's metadata, where the processed files should be saved, and other variables that allow to configure the pipelines specifically for this project. It follows the standard Portable Encapsulated Project format, or PEP for short.

That should be all you need to worry about as a pipeline user.
If you need to adjust compute resources or want to develop a pipeline or have more advanced project-level control
over pipelines, you'll need knowledge of the config files used by pipeline developers.
### Environment configuration

[**environment config**](http://divvy.databio.org/en/latest/configuration/) -- if you are planning to submit jobs to a cluster, then you need to be aware of environment configuration. This task is farmed out to [divvy](http://divvy.databio.org/en/latest/), a computing resource configuration manager. Follow the divvy documentation to learn about ways to tweak the computing environment settins according to your needs.

That should be all you need to worry about as a pipeline user. If you need to adjust compute resources or want to develop a pipeline or have more advanced project-level control over pipelines, you'll need knowledge of the config files used by pipeline developers.


## Pipeline developers

If you want to make pipeline compatible with `looper`, tweak the way `looper` interacts with a pipeline for a given project,
or change the default cluster resources requested by a pipeline, you need to know about a configuration file that coordinates linking pipelines to a project.
- The [pipeline interface file](pipeline-interface.md):
This file sas two sections"
- `protocol_mapping` tells looper which pipelines exist, and how to map each protocol (sample data type) to a pipeline
- `pipelines` describes options, arguments, and compute resources that defined how `looper` should communicate with each pipeline.
### Pipeline configuration

If you want to make pipeline compatible with looper, tweak the way looper interacts with a pipeline for a given project,
or change the default cluster resources requested by a pipeline, you need to know about a configuration file that coordinates linking pipelines to a project. This happens via the [pipeline interface file](pipeline-interface-specification.md).

Finally, if you're using [the `pypiper` framework](https://github.com/databio/pypiper) to develop pipelines,
it uses a pipeline-specific configuration file, which is detailed in the [`pypiper` documentation](http://pypiper.readthedocs.io/en/latest/advanced.html#pipeline-config-files).
Finally, if you're using [the pypiper framework](https://github.com/databio/pypiper) to develop pipelines,
it uses a pipeline-specific configuration file, which is detailed in the [pypiper documentation](http://pypiper.readthedocs.io/en/latest/advanced.html#pipeline-config-files).

Essentially, each pipeline may provide a configuration file describing where software is,
and parameters to use for tasks within the pipeline. This configuration file is by default named like pipeline name,
with a `.yaml` extension instead of `.py`. For example, by default `rna_seq.py` looks for an accompanying `rna_seq.yaml` file.
These files can be changed on a per-project level using the `pipeline_config` section of a [project configuration file](define-your-project).
with a `.yaml` extension instead of `.py`. For example, by default `rna_seq.py` looks for an accompanying `rna_seq.yaml` file.
Loading

0 comments on commit ebdab21

Please sign in to comment.