Skip to content

Commit

Permalink
Add read the docs and initial docs
Browse files Browse the repository at this point in the history
  • Loading branch information
endast committed Nov 22, 2023
1 parent c4d6773 commit 2b988f3
Show file tree
Hide file tree
Showing 13 changed files with 796 additions and 0 deletions.
20 changes: 20 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# .readthedocs.yaml
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

# Required
version: 2

# Set the OS, Python version and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.12"

sphinx:
configuration: docs/conf.py
fail_on_warning: true

python:
install:
- requirements: docs/requirements.txt
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

Rare variant association testing using deep learning and data-driven burden scores

[![Documentation Status](https://readthedocs.org/projects/deeprvat/badge/?version=latest)](https://deeprvat.readthedocs.io/en/latest/?badge=latest)

## Installation

Expand Down
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
Binary file added docs/_static/annotation_pipeline_dag.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
253 changes: 253 additions & 0 deletions docs/_static/preprocess_rulegraph.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
127 changes: 127 additions & 0 deletions docs/annotations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# DeepRVAT Annotation pipeline

This pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/). It uses [bcftools + samstools](https://www.htslib.org/), as well as [perl](https://www.perl.org/), [deepRiPe](https://ohlerlab.mdc-berlin.de/software/DeepRiPe_140/) and [deepSEA](http://deepsea.princeton.edu/) as well as [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html), including plugins for [primateAI](https://github.com/Illumina/PrimateAI) and [spliceAI](https://github.com/Illumina/SpliceAI). DeepRiPe annotations were acquired using [faatpipe repository by HealthML](https://github.com/HealthML/faatpipe)[[1]](#reference-1-target) and DeepSea annotations were calculated using [kipoi-veff2](https://github.com/kipoi/kipoi-veff2)[[2]](#reference-2-target), abSplice scores were computet using [abSplice](https://github.com/gagneurlab/absplice/)[[3]](#reference-3-target)

![dag](_static/annotation_pipeline_dag.png)
*Figure 1: Example DAG of annoation pipeline using only two bcf files as input.*

## Input

The pipeline uses left-normalized bcf files containing variant information, a reference fasta file as well as a text file that maps data blocks to chromosomes as input. It is expected that the bcf files contain the columns "CHROM" "POS" "ID" "REF" and "ALT". Any other columns, including genotype information are stripped from the data before annotation tools are used on the data. The variants may be split into several vcf files for each chromosome and each "block" of data. The filenames should then contain the corresponding chromosome and block number. The pattern of the file names, as well as file structure may be specified in the corresponding [config file](https://github.com/PMBio/deeprvat/blob/main/pipelines/config/deeprvat_annotation_config.yaml).

(requirements-target)=
## Requirements

BCFtools as well as HTSlib should be installed on the machine,
- [CADD](https://github.com/kircherlab/CADD-scripts/tree/master/src/scripts) as well as
- [VEP](http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html),
- [absplice](https://github.com/gagneurlab/absplice/tree/master),
- [kipoi-veff2](https://github.com/kipoi/kipoi-veff2)
- [faatpipe](https://github.com/HealthML/faatpipe), and the
- [vep-plugins repository](https://github.com/Ensembl/VEP_plugins/)

will be installed by the pipeline together with the [plugins](https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html) for primateAI and spliceAI. Annotation data for CADD, spliceAI and primateAI should be downloaded. The path to the data may be specified in the corresponding [config file](https://github.com/PMBio/deeprvat/blob/main/pipelines/config/deeprvat_annotation_config.yaml).
Download path:
- [CADD](http://cadd.gs.washington.edu/download): "All possible SNVs of GRCh38/hg38" and "gnomad.genomes.r3.0.indel.tsv.gz" incl. their Tabix Indices
- [SpliceAI](https://basespace.illumina.com/s/otSPW8hnhaZR): "genome_scores_v1.3"/"spliceai_scores.raw.snv.hg38.vcf.gz" and "spliceai_scores.raw.indel.hg38.vcf.gz"
- [PrimateAI](https://basespace.illumina.com/s/yYGFdGih1rXL) PrimateAI supplementary data/"PrimateAI_scores_v0.2_GRCh38_sorted.tsv.bgz"


## Output

The pipeline outputs one annotation file for VEP, CADD, DeepRiPe, DeepSea and Absplice for each input vcf-file. The tool further creates concatenated files for each tool and one merged file containing Scores from AbSplice, VEP incl. CADD, primateAI and spliceAI as well as principal components from DeepSea and DeepRiPe.

## Configure the annotation pipeline
The snakemake annotation pipeline is configured using a yaml file with the format akin to the [example file](https://github.com/PMBio/deeprvat/blob/main/pipelines/config/deeprvat_annotation_config.yaml).

The config above would use the following directory structure:
```shell

|-- reference
| |-- fasta file


|-- metadata
| |-- pvcf_blocks.txt

|-- preprocessing_workdir
| |--reference
| | |-- fasta file
| |-- norm
| | |-- bcf
| | | |-- bcf_input_files
| | | |-- ...
| | |-- variants
| | | |-- variants.tsv.gz

|-- output_dir
| |-- annotations
| | |-- tmp

|-- repo_dir
| |-- ensembl-vep
| | |-- cache
| | |-- plugins
| |-- abSplice
| |-- faatpipe
| |-- kipoi-veff2

|-- annotation_data
| |-- cadd
| |-- spliceAI
| |-- primateAI



```

Bcf files created by the [preprocessing pipeline](https://github.com/PMBio/deeprvat/blob/Annotations/deeprvat/preprocessing/README.md) are used as input data.
The pipeline also uses the variant.tsv file as well as the reference file from the preprocesing pipeline.
The pipeline beginns by installing the repositories needed for the annotations, it will automatically install all repositories in the `repo_dir` folder that can be specified in the config file relative to the annotation working directory.
The text file mapping blocks to chromosomes is stored in `metadata` folder. The output is stored in the `output_dir/annotations` folder and any temporary files in the `tmp` subfolder. All repositories used including VEP with its corresponding cache as well as plugins are stored in `repo_dir/ensempl-vep`.
Data for VEP plugins and the CADD cache are stored in `annotation data`.

## Running the annotation pipeline
### Preconfiguration
- Inside the annotation directory create a directory `repo_dir` and run the [annotation setup script](https://github.com/PMBio/deeprvat/blob/main/deeprvat/annotations/setup_annotation_workflow.sh)
```shell
setup_annotation_workflow.sh repo_dir/ensembl-vep/cache repo_dir/ensembl-vep/Plugins repo_dir
```
or manually clone the repositories mentioned in the [requirements](#requirements-target) into `repo_dir` and install the needed conda environments with
```shell
mamba env create -f repo_dir/absplice/environment.yaml
mamba env create -f repo_dir/kipoi-veff2/environment.minimal.linux.yml
mamba env create -f deeprvat/deeprvat_annotations.yml
```
If you already have some of the needed repositories on your machine you can edit the paths in the [config](https://github.com/PMBio/deeprvat/blob/main/pipelines/config/deeprvat_annotation_config.yaml).


- Inside the annotation directory create a directory `annotation_dir` and download/link the prescored files for CADD, SpliceAI, and PrimateAI (see [requirements](#requirements-target))


### Running the pipeline
After configuration and activating the `deeprvat_annotations` environment run the pipeline using snakemake:

```shell
snakemake -j <nr_cores> -s annotations.snakemake --configfile config/deeprvat_annotation.config --use-conda
```
## Running the annotation pipeline without the preprocessing pipeline

It is possible to run the annotation pipeline without having run the preprocessing prior to that.
However, the annotation pipeline requires some files from this pipeline that then have to be created manually.
- Left normalized bcf files from the input. These files do not have to contain any genotype information. "chrom, "pos", "ref" and "alt" columns will suffice.
- a reference fasta file will have to be provided
- A tab separated file containing all input variants "chrom, "pos", "ref" and "alt" entries each with a unique id.


## References

(reference-1-target)=
<a id="1">[1]</a> Monti, R., Rautenstrauch, P., Ghanbari, M. et al. Identifying interpretable gene-biomarker associations with functionally informed kernel-based tests in 190,000 exomes. Nat Commun 13, 5332 (2022). https://doi.org/10.1038/s41467-022-32864-2

(reference-2-target)=
<a id="2">[2]</a> Žiga Avsec et al., “Kipoi: accelerating the community exchange and reuse of predictive models for genomics,” bioRxiv, p. 375345, Jan. 2018, doi: 10.1101/375345.

(reference-3-target)=
<a id="3">[3]</a>N. Wagner et al., “Aberrant splicing prediction across human tissues,” Nature Genetics, vol. 55, no. 5, pp. 861–870, May 2023, doi: 10.1038/s41588-023-01373-3.

30 changes: 30 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Configuration file for the Sphinx documentation builder.
#
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information

project = "DeepRVAT"
copyright = "2023, Clarke, B., Holtkamp, E., Öztürk, H., Mück, M., Wahlberg, M., Meyer, K., Brechtmann, F., Hölzlwimmer, F. R., Gagneur, J., & Stegle, O"
author = "Clarke, B., Holtkamp, E., Öztürk, H., Mück, M., Wahlberg, M., Meyer, K., Brechtmann, F., Hölzlwimmer, F. R., Gagneur, J., & Stegle, O"
release = "0.1.0"

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

extensions = ["autodoc2", "myst_parser", 'sphinx_copybutton']
autodoc2_packages = [
"../deeprvat",
]

templates_path = ["_templates"]
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]


# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

html_theme = "sphinx_rtd_theme"
html_static_path = ["_static"]
29 changes: 29 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
.. DeepRVAT documentation master file, created by
sphinx-quickstart on Wed Nov 22 10:24:36 2023.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to DeepRVAT's documentation!
====================================

Rare variant association testing using deep learning and data-driven burden scores


.. toctree::
:maxdepth: 2
:caption: Contents:

usage.md
preprocessing.md
annotations.md
seed_gene_discovery.md

apidocs/index


Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
35 changes: 35 additions & 0 deletions docs/make.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
@ECHO OFF

pushd %~dp0

REM Command file for Sphinx documentation

if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=.
set BUILDDIR=_build

%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)

if "%1" == "" goto help

%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end

:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%

:end
popd
Loading

0 comments on commit 2b988f3

Please sign in to comment.