Many testing and benchmark programs require large files of sequence data that should be placed in this directory.
Below are instructions for how to download the necessary data. Make sure
you are in this directory (cd data
).
This data is from the difference recurrence paper by Suzuki and Kasahara.
curl -OL https://github.com/Daniel-Liu-c0deb0t/diff-bench-paper/releases/download/v1.0/sequences.txt.gz
gunzip sequences.txt.gz
Since these reads are filtered to only have gaps smaller than 20bp, it is not representative of typical reads. Therefore, this dataset will be rarely used.
This data is from the BiWFA repository and reformatted.
curl -OL https://github.com/Daniel-Liu-c0deb0t/block-aligner/releases/download/v0.0.0/seq_pairs.10kbps.5000.txt.gz
curl -OL https://github.com/Daniel-Liu-c0deb0t/block-aligner/releases/download/v0.0.0/seq_pairs.50kbps.10000.txt.gz
gunzip seq_pairs.10kbps.5000.txt.gz
gunzip seq_pairs.50kbps.10000.txt.gz
These files contain pairs of reads that are alignable.
This data is from the Wavefront Aligner paper.
curl -OL https://github.com/Daniel-Liu-c0deb0t/block-aligner/releases/download/v0.0.0/real.illumina.b10M.txt.gz
curl -OL https://github.com/Daniel-Liu-c0deb0t/block-aligner/releases/download/v0.0.0/real.ont.b10M.txt.gz
gunzip real.illumina.b10M.txt.gz
gunzip real.ont.b10M.txt.gz
The Illumina, 1kbp Nanopore, and 25kbp Nanopore datasets are just a list of reads, where every two reads form a pair that is alignable.
This data is generated with mmseqs2
and the Uniclust30 dataset.
Two datasets with two different coverages percentages are used: 0.8
(default in mmseqs2
) and 0.95
. Using a higher coverage helps gather
sequences that are "globally alignable", as mmseqs2
uses local alignment.
The dataset with the lower coverage percent is expected to be more challenging.
Scripts for generating the data: 0.8
coverage
and 0.95
coverage.
curl -OL https://github.com/Daniel-Liu-c0deb0t/block-aligner/releases/download/v0.0.0/uc30.tar.gz
curl -OL https://github.com/Daniel-Liu-c0deb0t/block-aligner/releases/download/v0.0.0/uc30_0.95.tar.gz
tar -xvf uc30.tar.gz
tar -xvf uc30_0.95.tar.gz
This data is generated with mmseqs2
and the SCOPe dataset.
This data is used for aligning sequences to profiles (position-specific scoring matrices) of protein domains.
mkdir scop && cd scop
curl -OL https://github.com/Daniel-Liu-c0deb0t/block-aligner/releases/download/v0.0.0/scop.tar.gz
tar -xvf scop.tar.gz