Skip to content

Latest commit

 

History

History
191 lines (138 loc) · 8.19 KB

README.md

File metadata and controls

191 lines (138 loc) · 8.19 KB

bindz-rbp

test-conda test-singularity ATtRACT GitHub issues GitHub license DOI

bindz-rbp is a computational workflow which aims to predict binding sites of RNA-binding proteins in a given input RNA sequence, implemented in a snakemake pipeline 🐍

Table of Contents

General information

bindz-rbp predicts binding sites of distinct regulators in an RNA sequence by calculating posterior probabilities with MotEvo, given the sequence specificity of regulators, represented as position-specific weight matrices. It is intended to help in the analysis of individual reporter sequences, by predicting regulatory that may act on the sequence as well as how the binding may be affected by specific mutations introduced in the reporter sequences. The tools scans the input sequence with a set of position-specific weight matrices (PWMs) representing the binding specificity of individual RNA-binding proteins. The run time scales linearly with both the sequence length and with the number of PWMs, so please make sure to test it on your architecture before running it on batches of sequences.

The tool is implemented as a Snakemake workflow.

rule_graph

The main output of the pipeline are:

  • combined_MotEvo_results.tsv: a tab-separated file which collects information related to all predicted binding sites of all analyzed motifs into one table.
  • binding_sites.bed: simplified list of binding sites in a BED format.
  • ProbabilityVsSequence.pdf: a visualisation of binding positions and probabilities in a form of a heatmap.

Installation instructions

Snakemake is a workflow management system that helps to create and execute data processing pipelines. It requires Python 3 and can be most easily installed via the bioconda channel from the anaconda cloud service.

Step 1: Download and install Miniconda3

To install the latest version of miniconda please execute:

[Linux]:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source .bashrc

[macOS]:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
bash Miniconda3-latest-MacOSX-x86_64.sh
source .bashrc

Step 2: Clone the repository

Cloning repositories requires git to be installed (available via conda):

conda install git

Clone this git repository into a desired location (here: bindz-rbp in the current working directory ) with the following command:

git clone https://github.com/zavolanlab/bindz-rbp

Step 3: Build and activate virtual environment for bindz-rbp

To help the users in the installation process we have prepared a recipe for a conda virtual environment that contains all the software needed to run bindz-rbp. This environment can be created by the following script:

bash bindz-rbp/scripts/create-conda-environment-main.sh

The built conda environment may then be activated with:

conda activate bindz-rbp

Optional: Download and parse PWMs from ATtRACT database

Inside this repository we have included a snapshot of a database of Position Weight Matrices for distinct RNA binding proteins (ATtRACT: 26-08-2020). We suggest to use the pre-formatted files which we have already prepared: resources/ATtRACT_hsa and resources/ATtRACT_mmu for Homo sapiens and Mus musculus, respectively.

However, if the user would like to download and parse a new version of matrices from ATtRACT we describe the procedure below:

Please change directory to the pipeline's root directory:

cd bindz-rbp

To utilize position-specific weight matrices from the ATtRACT database of known RBPs' binding motifs we provide two scripts:

  1. Download and extract the database into a directory ATtRACT under resources:

    bash scripts/download-ATtRACT-motifs.sh -o resources/ATtRACT
  2. Parse the database and reformat the PWMs into a TRANSFAC format (currently supported species are Homo_sapiens or Mus_musculus):

    Homo sapiens

     mkdir resources/ATtRACT/ATtRACT_hsa
     python scripts/format-ATtRACT-motifs.py \
     --pwms resources/ATtRACT/pwm.txt \
     --names resources/ATtRACT/ATtRACT_db.txt \
     --organism Homo_sapiens \
     --outdir resources/ATtRACT/ATtRACT_hsa

    Mus musculus

     mkdir resources/ATtRACT/ATtRACT_mmu
     python scripts/format-ATtRACT-motifs.py \
     --pwms resources/ATtRACT/pwm.txt \
     --names resources/ATtRACT/ATtRACT_db.txt \
     --organism Mus_musculus \
     --outdir resources/ATtRACT/ATtRACT_mmu

    To print information about the script's arguments please type:

    python scripts/format-ATtRACT-motifs.py --help
    

Workflow execution

Please change directory to the pipeline's root directory:

cd bindz-rbp

All the input, output and parameters for the pipeline execution should be specified in a snakemake configuration file in YAML format. Such a file can be created based on our prepared template located at workflow/config/config-template.yml. Assuming that the user created a config.yml and saved it in the repository's root directory (and that it is the current working directory) the workflow can be executed on the local machine with:

snakemake \
    --snakefile="workflow/Snakefile" \
    --configfile="config.yml" \
    --use-conda \
    --cores=1 \
    --printshellcmds \
    --verbose

We also provide a integration test for the pipeline on a small input dataset to examine if the installation was successful:

bash tests/integration/execution/snakemake_local_run_conda_environments.sh

Contributing

This project lives off your contributions, be it in the form of bug reports, feature requests, discussions, or fixes and other code changes. 🙂

Please refer to the contributing guidelines if you are interested to contribute. Please mind the code of conduct for all interactions with the community.

Contact

For questions or suggestions regarding the code, please use the issue tracker. For any other inquiries, please contact us by email: zavolab-biozentrum@unibas.ch 📨

(c) 2022 Zavolan lab, Biozentrum, University of Basel