Skip to content

Commit

Permalink
updated readme, regenerated DAG
Browse files Browse the repository at this point in the history
  • Loading branch information
rvosa committed Jan 16, 2024
1 parent 62458ae commit 87f77d8
Show file tree
Hide file tree
Showing 2 changed files with 1,240 additions and 679 deletions.
31 changes: 20 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,18 @@
# Bactria: BarCode TRee Inference and Analysis
This repository contains code and data for building very large, topologically-constrained
barcode phylogenies through a divide-and-conquer strategy. Such trees are useful as
reference materials in the comparable calculation of alpha and beta biodiversity metrics
across metabarcoding assays. The input data for the approach we develop here comes from
BOLD. The international database [BOLD Systems](https://www.boldsystems.org/index.php)
reference materials for curating barcode data by detecting rogue terminals (indicating
incorrect taxonomic annotation) and in the comparable calculation of alpha and beta
biodiversity metrics across metabarcoding assays.

The input data for the approach we develop here currently comes from BOLD data dumps.
The international database [BOLD Systems](https://www.boldsystems.org/index.php)
contains DNA barcodes for hundreds of thousands of species, with multiple barcodes per
species. Theoretically, this data could be filtered and aligned per DNA marker to make
species. The data dumps we use here are TSV files whose columns conform to the nascent
BCDM (barcode data model) vocabulary. As such, other data sources that conform to this
vocabulary could in the future be used as well, such as [UNITE](https://unite.ut.ee/).

Theoretically, such data could be filtered and aligned per DNA marker to make
phylogenetic trees. However, there are two limiting factors: building very large
phylogenies is computationally intensive, and barcodes are not considered ideal for
building big trees because they are short (providing insufficient signal to resolve large
Expand All @@ -27,7 +34,7 @@ chunks are then combined in a large synthesis by grafting them on a backbone mad
exemplar taxa from the subtrees. Here too, the OpenTOL is a source of phylogenetic
constraint.

In this repository this concept is prototyped for both animal species and plant species.
In this repository this concept is developed for both animal species and plant species.

## Installation

Expand All @@ -48,8 +55,9 @@ file and a `requirements.txt` file:
conda env create -f workflow/envs/environment.yml
```
This command will create a new Conda environment named bactria with the packages
specified in the environment.yml file. This file also includes pip packages specified in
the workflow/envs/requirements.txt file, which will be installed after the Conda packages.
specified in the environment.yml file. This step is largely a placeholder because
most of the dependency management is handled at the level of individual pipeline
steps, which each have their own environment specification.
3. **Activate the Environment:**
After creating the environment, activate it using the conda activate command:
```bash
Expand All @@ -66,7 +74,7 @@ file and a `requirements.txt` file:

## How to run

The pipeline is being implemented using snakemake, which is available within the conda
The pipeline is implemented using snakemake, which is available within the conda
environment that results from the installation. Important before running the snakemake pipeline
is to change in [config/config.yaml](config/config.yaml) the number of threads available on your
computer. Which marker gene is used in the pipeline is also specified in the config.yaml (default
Expand All @@ -75,15 +83,15 @@ COI-5P). Prior to execution, the BOLD data package to use (we used the
must be downloaded manually and stored in the [resources/](resources/) directory. If a BOLD release
from another date is used the file names in config.yaml need to be updated.

How to run the pipeline for all family alignments:
How to run the entire pipeline:

```bash
snakemake -j {number of threads}
snakemake -j {number of threads} --use-conda
```

Snakemake rules can be performed separately:
```bash
snakemake -R {Rule} -j {number of threads}
snakemake -R {Rule} -j {number of threads} --use-conda
```

Enter the same number at {number of threads} as you filled in previously in src/config.yaml.
Expand All @@ -101,6 +109,7 @@ All of these subfolders contains further explanatory READMEs to explain their co

- [config](config/) - configuration files
- [doc](doc/) - documentation and background literature
- [logs](logs/) - where log files are written during pipeline runtime
- [resources](resources/) - external data resources (from BOLD and OpenTree) are downloaded here
- [results](results/) - intermediate and final results are generated here
- [workflow](workflow/) - script source code and driver snakefile
Expand Down
Loading

0 comments on commit 87f77d8

Please sign in to comment.