Skip to content

Commit

Permalink
update to v1.1
Browse files Browse the repository at this point in the history
  • Loading branch information
agimpel committed Jul 5, 2024
1 parent 987fb41 commit 21968d9
Show file tree
Hide file tree
Showing 62 changed files with 4,082 additions and 2,566 deletions.
65 changes: 12 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,20 @@


# Overview
`dt4dds` is a Python package providing a customizable, digital representation of the widely-used DNA data storage workflow involving array synthesis, PCR, Aging, and Sequencing-By-Synthesis. By modelling each part of such user-defined workflows with fully customizable experimental parameters, `dt4dds` enables data-driven experimental design and rational design of redundancy. `dt4dds` also includes a pipeline for comprehensively analyzing errors in sequencing data, both from experiments and simulation. This Python package is used in the following publication:
> Andreas L. Gimpel, Wendelin J. Stark, Reinhard Heckel, Robert N. Grass. Experimental quantification of errors and biases in DNA data storage for a digital twin. Manuscript in preparation.
`dt4dds` is a Python package providing a customizable, digital representation of the widely-used DNA data storage workflow involving array synthesis, PCR, Aging, and Sequencing-By-Synthesis. By modelling each part of such user-defined workflows with fully customizable experimental parameters, `dt4dds` enables data-driven experimental design and rational design of redundancy. `dt4dds` also includes a pipeline for comprehensively analyzing errors in sequencing data, both from experiments and simulation. This Python package is used in the following publications:

> Gimpel, A.L., Stark, W.J., Heckel, R., Grass R.N. A digital twin for DNA data storage based on comprehensive quantification of errors and biases. Nat Commun 14, 6026 (2023). https://doi.org/10.1038/s41467-023-41729-1
> Gimpel, A.L., Stark, W.J., Heckel, R., Grass R.N. Challenges for error-correction coding in DNA data storage: photolithographic synthesis and DNA decay. bioRxiv 2024.07.04.602085 (2024). https://doi.org/10.1101/2024.07.04.602085
The Jupyter notebooks and associated code used for generating the figures in the manuscript are found in the [dt4dds_notebooks repository](https://github.com/fml-ethz/dt4dds_notebooks).

## New: Challenges for DNA Data Storage

A C++ implementation of the Digital Twin for DNA Data Storage for two current challenges in error-correction coding for DNA - Photolithographic DNA Synthesis and DNA Decay - is available [in this GitHub repository](https://github.com/fml-ethz/dt4dds-challenges). More information is also provided in the following publication:

> Gimpel, A.L., Stark, W.J., Heckel, R., Grass R.N. Challenges for error-correction coding in DNA data storage: photolithographic synthesis and DNA decay. bioRxiv 2024.07.04.602085 (2024). https://doi.org/10.1101/2024.07.04.602085

# Web-based Tool
A web-based version of `dt4dds` with an easy-to-use graphical user interface and the most common workflows is available at [dt4dds.ethz.ch](https://dt4dds.ethz.ch).
Expand All @@ -25,21 +33,7 @@ A web-based version of `dt4dds` with an easy-to-use graphical user interface and
This package only requires a standard computer. Depending on the size and complexity of the simulated workflows, sufficient RAM to support the in-memory operations is required. The required amount of RAM can be reduced by decreasing the number of cores used for parallelization (see config below), at the cost of increased run time.

## Software requirements
This package is compatible with Windows, macOS and Linux. The package has been developed and tested on Ubuntu 20.04 using Python 3.10. The following Python packages are required:
```
numpy
scipy
pandas
biopython
edlib
numba
plotly
PyYAML
rapidfuzz
ruamel.yaml
tqdm
```
For the analysis pipeline, BBMap (v38.99) is required and assumed to be installed in `~/.local/bin/`.
This package is compatible with Windows, macOS and Linux. The package has been developed and tested on Ubuntu 20.04 using Python 3.10. The Python packages listed in [requirements.txt](/requirements.txt) are required.


# Installation guide
Expand Down Expand Up @@ -85,9 +79,6 @@ import dt4dds
At this point, you can set custom configuration options and enable logging, if desired:
```python
dt4dds.default_logging() # enable logging output

dt4dds.config.enable_multiprocessing = True # enable parallelization
dt4dds.config.n_processes = 2 # limit the number of parallel processes to 2
```
For the full documentation of configuration options, refer to [config.py](dt4dds/helpers/config.py).

Expand Down Expand Up @@ -156,7 +147,7 @@ For the full documentation of methods, refer to [pcr.py](dt4dds/processes/pcr.py
The `dt4dds.processes.Aging()` class models decay for a user-defined number of half-lives. It requires a `SeqPool` as input, and yields a `SeqPool` representative of the decayed oligo pool by invocation of its `process()` method:
```python
# set up aging for one half-life, add further non-default parameters as arguments if desired
aging = dt4dds.processes.Aging(fixed_decay_ratio=0.5)
aging = dt4dds.processes.Aging(n_halflives=1)

# perform aging and receive the oligo pool representation after decay
post_decay_pool = aging.process(pool)
Expand Down Expand Up @@ -205,37 +196,5 @@ For documentation of all defaults, refer to [defaults.py](dt4dds/settings/defaul
For documentation of all settings, refer to [settings.py](dt4dds/settings/settings.py).



## Analysing errors and biases

The pipeline for analysing errors and biases is provided in the `dt4dds.analysis` module. The convenience functions, provided as a command-line interface in [runner.py](dt4dds/bin/runner.py) and described below, require the sequencing datasets as gzipped FASTQ (i.e., `R1.fq.gz` and `R2.fq.gz`) and the reference sequences as a FASTA-formatted file (i.e., `design_files.fasta`) in a common directory.


### Characterising errors
To characterize the errors present in a single- or paired-read sequencing dataset in gzipped FASTQ format, the three convenience methods in [runner.py](dt4dds/bin/runner.py) can be used:
```bash
# analyse 100000 paired reads
python3 runner.py paired <path/to/folder/> -n 100000

# analyse 100000 single reads
python3 runner.py single <path/to/folder/> -n 100000

# analyse 100000 PhiX reads
python3 runner.py phix <path/to/folder/> -n 100000
```
In all cases, the scripts deposit the analysis results in the same folder, stratified by similarity group. The remaining command line arguments are described in [runner.py](dt4dds/bin/runner.py).

### Quantifying coverage
To generate coverage data for a single- or paired-read sequencing dataset, the two BBMMap-based convenience methods in [runner.py](dt4dds/bin/runner.py) can be used:
```bash
# analyse coverage of paired reads
python3 runner.py scafstats_paired <path/to/folder/>

# analyse coverage of single reads
python3 runner.py scafstats_single <path/to/folder/>
```
In both cases, the scripts deposit the analysis results in the same folder, in the `scafstats.txt` file.


# License
This project is licensed under the GPLv3 license, see [here](LICENSE).
6 changes: 1 addition & 5 deletions demos/advanced.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@

# set up config
dt4dds.default_logging()
dt4dds.config.enable_multiprocessing = False
dt4dds.config.n_processes = 1
dt4dds.config.show_progressbars = True


Expand Down Expand Up @@ -47,9 +45,7 @@
# Aging for one half-live
#
aging_settings = dt4dds.settings.defaults.Aging()
aging = dt4dds.processes.Aging(aging_settings(
fixed_decay_ratio=0.5,
))
aging = dt4dds.processes.Aging(aging_settings(), n_halflives=1)
pool = aging.process(pool)
pool.volume = 1

Expand Down
2 changes: 0 additions & 2 deletions demos/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@

# set up config
dt4dds.default_logging()
dt4dds.config.enable_multiprocessing = False
dt4dds.config.n_processes = 1
dt4dds.config.show_progressbars = True


Expand Down
3 changes: 1 addition & 2 deletions dt4dds/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__VERSION__ = '1.0.0'
__VERSION__ = '1.1.0'

from .helpers import logging
from .helpers import config
Expand All @@ -10,4 +10,3 @@

from . import processes
from . import properties
from . import analysis
2 changes: 0 additions & 2 deletions dt4dds/analysis/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +0,0 @@
from . import errormapper
from .analysis import GroupAnalysis, SeriesAnalysis, ErrorAnalysis, ErrorFile, SeqErrorAnalysis, SeqGroupAnalysis, DistributionAnalysis
Loading

0 comments on commit 21968d9

Please sign in to comment.