Skip to content

Commit

Permalink
format for webpgae
Browse files Browse the repository at this point in the history
  • Loading branch information
hasindu2008 committed Jan 16, 2025
1 parent 402e284 commit 61c86ae
Showing 1 changed file with 131 additions and 154 deletions.
285 changes: 131 additions & 154 deletions docs/archive-lossy.md
Original file line number Diff line number Diff line change
@@ -1,101 +1,84 @@
# A Guide for Archiving Lossy Data

Lossy compression of raw nanopore signal data can be a great way to save disk
space without significantly impacting basecalling accuracy. This makes it
particularly suitable for archiving. Naturally, one may be concerned that this
conversion would significantly deteriorate the quality of their data. To remedy
such concerns, this guide outlines a number of sanity checks which when
successful give confidence in the lossy conversion.
Lossy compression of raw nanopore signal data can be a great way to save disk space without significantly impacting basecalling accuracy. This makes it particularly suitable for archiving. Naturally, one may be concerned that this conversion would significantly deteriorate the quality of their data. To remedy such concerns, this guide outlines a number of sanity checks which when successful give confidence in the lossy conversion.

## The Conversion

To lossy compress your data, set the following variables

```bash
SLOW5_FILE=data.blow5 # path to original data
SLOW5_LOSSY_FILE=lossy.blow5 # path to lossy output
NUM_THREADS=8
SLOW5_FILE=data.blow5 # path to original data
SLOW5_LOSSY_FILE=lossy.blow5 # path to lossy output
NUM_THREADS=8
```

and run:

```bash
slow5tools degrade "$SLOW5_FILE" -o "$SLOW5_LOSSY_FILE" -t "$NUM_THREADS"
slow5tools degrade "$SLOW5_FILE" -o "$SLOW5_LOSSY_FILE" -t "$NUM_THREADS"
```

If the command fails with the message "No suitable bits suggestion", this is
because your dataset type has not yet been profiled by our team. Submit an issue
on GitHub <https://github.com/hasindu2008/slow5tools/issues> with your dataset
attached.
If the command fails with the message "No suitable bits suggestion", this is because your dataset type has not yet been profiled by our team. Submit an issue on GitHub <https://github.com/hasindu2008/slow5tools/issues> with your dataset attached.

## Read Count

The simplest sanity check is to ensure that the number of reads is the same for
the original and lossy compressed data. The following shell snippet does the
check:
The simplest sanity check is to ensure that the number of reads is the same for the original and lossy compressed data. The following shell snippet does the check:

```bash
get_num_reads() {
slow5tools stats "$1" | grep 'number of records' | awk '{print $NF}'
}
get_num_reads() {
slow5tools stats "$1" | grep 'number of records' | awk '{print $NF}'
}

NUM_SLOW5_READS=$(get_num_reads "$SLOW5_FILE")
NUM_SLOW5_LOSSY_READS=$(get_num_reads "$SLOW5_LOSSY_FILE")
NUM_SLOW5_READS=$(get_num_reads "$SLOW5_FILE")
NUM_SLOW5_LOSSY_READS=$(get_num_reads "$SLOW5_LOSSY_FILE")

echo "Read count: $NUM_SLOW5_READS in SLOW5, $NUM_SLOW5_LOSSY_READS in lossy SLOW5"
echo "Read count: $NUM_SLOW5_READS in SLOW5, $NUM_SLOW5_LOSSY_READS in lossy SLOW5"

if [ "$NUM_SLOW5_READS" -ne "$NUM_SLOW5_LOSSY_READS" ]
then
echo 'Failed sanity check: Read count differs' >&2
exit 1
fi
if [ "$NUM_SLOW5_READS" -ne "$NUM_SLOW5_LOSSY_READS" ]
then
echo 'Failed sanity check: Read count differs' >&2
exit 1
fi
```

## Uniqueness

Next, we should ensure that there are no duplicate read IDs in the lossy data.
The simplest method is to index the lossy file:
Next, we should ensure that there are no duplicate read IDs in the lossy data. The simplest method is to index the lossy file:

```bash
slow5tools index "$SLOW5_LOSSY_FILE"
slow5tools index "$SLOW5_LOSSY_FILE"
```

This will fail if a duplicate read ID is encountered, or additionally if the file
is corrupted or truncated. However, a more comprehensive method is to use `slow5tools
view` which decompresses and parses the entire file, and thus does a more
detailed check for data corruption. You can achieve this using the following
shell pipeline:
This will fail if a duplicate read ID is encountered, or additionally if the file is corrupted or truncated. However, a more comprehensive method is to use `slow5tools view` which decompresses and parses the entire file, and thus does a more detailed check for data corruption. You can achieve this using the following shell pipeline:

```bash
slow5tools view -t "$NUM_THREADS" -K 20000 "$SLOW5_LOSSY_FILE" | awk '{print $1}' | grep -v '^\#\|^\@' | sort | uniq -c | awk '{if($1!=1){print "Duplicate read ID found",$2; exit 1}}'
slow5tools view -t "$NUM_THREADS" -K 20000 "$SLOW5_LOSSY_FILE" | awk '{print $1}' | grep -v '^\#\|^\@' | sort | uniq -c | awk '{if($1!=1){print "Duplicate read ID found",$2; exit 1}}'
```

## Basecalling

The most important sanity check is to ensure that basecalling accuracy has not
been adversely affected.
The most important sanity check is to ensure that basecalling accuracy has not been adversely affected.

First, basecall the data and obtain the BAM files. For example, using
[slow5-dorado](https://github.com/hiruna72/slow5-dorado/releases/) :
First, basecall the data and obtain the BAM files. For example, using [slow5-dorado](https://github.com/hiruna72/slow5-dorado/releases/) :

```bash
MODEL=dna_r9.4.1_e8_hac@v3.3 # path to basecalling model
BAM=data.bam # path to bam output
BAM_LOSSY=lossy.bam # path to lossy bam output
MODEL=dna_r9.4.1_e8_hac@v3.3 # path to basecalling model
BAM=data.bam # path to bam output
BAM_LOSSY=lossy.bam # path to lossy bam output

slow5-dorado basecaller "$MODEL" "$SLOW5_FILE" > "$BAM"
slow5-dorado basecaller "$MODEL" "$SLOW5_LOSSY_FILE" > "$BAM_LOSSY"
slow5-dorado basecaller "$MODEL" "$SLOW5_FILE" > "$BAM"
slow5-dorado basecaller "$MODEL" "$SLOW5_LOSSY_FILE" > "$BAM_LOSSY"
```

Next, obtain the identity scores using this [identitydna.sh](https://github.com/hasindu2008/biorand/blob/master/bin/identitydna.sh) shell script for DNA and [identityrna.sh](https://github.com/hasindu2008/biorand/blob/master/bin/identityrna.sh). For example:

```bash
GENOME=hg38noAlt.idx # path to fasta/idx genome
SCORE=score.tsv # path to score output
SCORE_LOSSY=score_lossy.tsv # path to lossy score output
GENOME=hg38noAlt.idx # path to fasta/idx genome
SCORE=score.tsv # path to score output
SCORE_LOSSY=score_lossy.tsv # path to lossy score output

identitydna.sh "$GENOME" "$BAM" > "$SCORE"
identitydna.sh "$GENOME" "$BAM_LOSSY" > "$SCORE_LOSSY"
identitydna.sh "$GENOME" "$BAM" > "$SCORE"
identitydna.sh "$GENOME" "$BAM_LOSSY" > "$SCORE_LOSSY"
```

Check that both identity scores satisfy the following inequalities:
Expand All @@ -108,45 +91,45 @@ Check that both identity scores satisfy the following inequalities:
This can be achieved using the following shell snippet:

```bash
die() {
echo "$1" >&2
exit 1
}

assert() {
x=$(echo "if ($1) 1" | bc)
if [ "$x" != 1 ]
then
die "failed: $x"
fi
}

score_chk() {
SCORE_HEADER='sample mean sstdev q1 median q3 n'
path=$1

hdr=$(head -n1 "$path")
if [ "$hdr" != "$SCORE_HEADER" ]
then
die 'invalid header'
fi

data=$(tail -n1 "$path")
mean=$(echo "$data" | cut -f2)
med=$(echo "$data" | cut -f5)
n=$(echo "$data" | cut -f7)

assert "$mean >= 0.93"
assert "$med >= 0.97"
assert "$n >= $NUM_SLOW5_READS"
assert "$n <= (1.2 * $NUM_SLOW5_READS)"
}

# quicker than section "Read Count"
NUM_SLOW5_READS=$(slow5tools skim --rid "$SLOW5_LOSSY_FILE" | wc -l)

score_chk "$SCORE"
score_chk "$SCORE_LOSSY"
die() {
echo "$1" >&2
exit 1
}

assert() {
x=$(echo "if ($1) 1" | bc)
if [ "$x" != 1 ]
then
die "failed: $x"
fi
}

score_chk() {
SCORE_HEADER='sample mean sstdev q1 median q3 n'
path=$1

hdr=$(head -n1 "$path")
if [ "$hdr" != "$SCORE_HEADER" ]
then
die 'invalid header'
fi

data=$(tail -n1 "$path")
mean=$(echo "$data" | cut -f2)
med=$(echo "$data" | cut -f5)
n=$(echo "$data" | cut -f7)

assert "$mean >= 0.93"
assert "$med >= 0.97"
assert "$n >= $NUM_SLOW5_READS"
assert "$n <= (1.2 * $NUM_SLOW5_READS)"
}

# quicker than section "Read Count"
NUM_SLOW5_READS=$(slow5tools skim --rid "$SLOW5_LOSSY_FILE" | wc -l)

score_chk "$SCORE"
score_chk "$SCORE_LOSSY"
```

Finally, check that the following pairwise inequalities are satisfied:
Expand All @@ -159,115 +142,109 @@ Finally, check that the following pairwise inequalities are satisfied:
Continuing from the previous shell snippet:

```bash
score_pair_chk() {
path=$1
path_lossy=$2

data=$(tail -n1 "$path")
mean=$(echo "$data" | cut -f2)
med=$(echo "$data" | cut -f5)
n=$(echo "$data" | cut -f7)

data_lossy=$(tail -n1 "$path_lossy")
mean_lossy=$(echo "$data_lossy" | cut -f2)
med_lossy=$(echo "$data_lossy" | cut -f5)
n_lossy=$(echo "$data_lossy" | cut -f7)

assert "($mean - $mean_lossy) <= 0.001"
assert "($med - $med_lossy) <= 0.001"
assert "($n - $n_lossy) >= 0"
assert "($n - $n_lossy) <= (0.001 * $n)"
}

score_pair_chk "$SCORE" "$SCORE_LOSSY"
score_pair_chk() {
path=$1
path_lossy=$2

data=$(tail -n1 "$path")
mean=$(echo "$data" | cut -f2)
med=$(echo "$data" | cut -f5)
n=$(echo "$data" | cut -f7)

data_lossy=$(tail -n1 "$path_lossy")
mean_lossy=$(echo "$data_lossy" | cut -f2)
med_lossy=$(echo "$data_lossy" | cut -f5)
n_lossy=$(echo "$data_lossy" | cut -f7)

assert "($mean - $mean_lossy) <= 0.001"
assert "($med - $med_lossy) <= 0.001"
assert "($n - $n_lossy) >= 0"
assert "($n - $n_lossy) <= (0.001 * $n)"
}

score_pair_chk "$SCORE" "$SCORE_LOSSY"
```

## Methylation

Another related sanity check is to see whether the methylation frequencies have
been adversely affected. We can achieve this by obtaining their Pearson
correlation coefficient and making sure that it is above a certain threshold.
Another related sanity check is to see whether the methylation frequencies have been adversely affected. We can achieve this by obtaining their Pearson correlation coefficient and making sure that it is above a certain threshold.

Again, basecall the data, but this time use modification calling. For example,
using slow5-dorado:
Again, basecall the data, but this time use modification calling. For example, using slow5-dorado:

```bash
MODEL=dna_r9.4.1_e8_hac@v3.3 # path to basecalling model
BASES=5mCG_5hmCG # modified base codes (m6A_DRACH for rna)
BAM=meth.bam # path to bam output
BAM_LOSSY=meth_lossy.bam # path to lossy bam output
MODEL=dna_r9.4.1_e8_hac@v3.3 # path to basecalling model
BASES=5mCG_5hmCG # modified base codes (m6A_DRACH for rna)
BAM=meth.bam # path to bam output
BAM_LOSSY=meth_lossy.bam # path to lossy bam output

slow5-dorado basecaller "$MODEL" "$SLOW5_FILE" --modified-bases "$BASES" > "$BAM"
slow5-dorado basecaller "$MODEL" "$SLOW5_LOSSY_FILE" --modified-bases "$BASES" > "$BAM_LOSSY"
slow5-dorado basecaller "$MODEL" "$SLOW5_FILE" --modified-bases "$BASES" > "$BAM"
slow5-dorado basecaller "$MODEL" "$SLOW5_LOSSY_FILE" --modified-bases "$BASES" > "$BAM_LOSSY"
```

Then map the BAM files to the reference genome and index them:

```bash
map() {
bam=$1
genome=$2
map() {
bam=$1
genome=$2

samtools fastq -TMM,ML "$bam" | minimap2 -x map-ont -a -y -Y --secondary=no "$genome" - | samtools sort -@32 -
}
samtools fastq -TMM,ML "$bam" | minimap2 -x map-ont -a -y -Y --secondary=no "$genome" - | samtools sort -@32 -
}

GENOME=hg38noAlt.fa # path to genome fasta/idx
BAM_MAP=meth_map.bam # path to mapped bam output
BAM_LOSSY_MAP=meth_lossy_map.bam # path to mapped lossy bam output
GENOME=hg38noAlt.fa # path to genome fasta/idx
BAM_MAP=meth_map.bam # path to mapped bam output
BAM_LOSSY_MAP=meth_lossy_map.bam # path to mapped lossy bam output

map "$BAM" "$GENOME" > "$BAM_MAP"
map "$BAM_LOSSY" "$GENOME" > "$BAM_LOSSY_MAP"
map "$BAM" "$GENOME" > "$BAM_MAP"
map "$BAM_LOSSY" "$GENOME" > "$BAM_LOSSY_MAP"

samtools index "$BAM_MAP"
samtools index "$BAM_LOSSY_MAP"
samtools index "$BAM_MAP"
samtools index "$BAM_LOSSY_MAP"
```

Acquire the methylation frequencies using [minimod](https://github.com/warp9seq/minimod):

```bash
MODS=mods.mm.tsv # path to meth frequencies output
MODS_LOSSY=mods_lossy.mm.tsv # path to lossy meth frequencies output
MODS=mods.mm.tsv # path to meth frequencies output
MODS_LOSSY=mods_lossy.mm.tsv # path to lossy meth frequencies output

minimod mod-freq "$GENOME" "$BAM_MAP" > "$MODS"
minimod mod-freq "$GENOME" "$BAM_LOSSY_MAP" > "$MODS_LOSSY"
minimod mod-freq "$GENOME" "$BAM_MAP" > "$MODS"
minimod mod-freq "$GENOME" "$BAM_LOSSY_MAP" > "$MODS_LOSSY"
```

Finally, obtain the Pearson correlation coefficient using [this Python script](https://github.com/warp9seq/minimod/blob/main/test/compare.py)


```bash
corr=$(python3 compare.py "$MODS" "$MODS_LOSSY")
corr=$(python3 compare.py "$MODS" "$MODS_LOSSY")
```

and ensure that it is above a chosen threshold (say 0.95):

```bash
assert "$corr >= 0.95" # using function from section "Basecalling"
assert "$corr >= 0.95" # using function from section "Basecalling"
```

## Subsetting

For the sections which deal with basecalling, a significant time saving can be
made at the expense of completeness by taking a random subset of the SLOW5 reads
from the original and lossy data, and then proceeding with the relevant sanity
checks.
For the sections which deal with basecalling, a significant time saving can be made at the expense of completeness by taking a random subset of the SLOW5 reads from the original and lossy data, and then proceeding with the relevant sanity checks.

To obtain a random subset of the original and lossy data:

```bash
SIZE=500000 # subset size
READID_SUB=rids # path to read ids subset output
SLOW5_SUB=subset.blow5 # path to subset output
SLOW5_LOSSY_SUB=lossy_subset.blow5 # path to lossy subset output

slow5tools skim --rid "$SLOW5_LOSSY_FILE" | sort -R | head -n "$SIZE" > "$READID_SUB"
slow5tools get -l "$READID_SUB" -o "$SLOW5_SUB" -t "$NUM_THREADS" "$SLOW5_FILE"
slow5tools get -l "$READID_SUB" -o "$SLOW5_LOSSY_SUB" -t "$NUM_THREADS" "$SLOW5_LOSSY_FILE"
SIZE=500000 # subset size
READID_SUB=rids # path to read ids subset output
SLOW5_SUB=subset.blow5 # path to subset output
SLOW5_LOSSY_SUB=lossy_subset.blow5 # path to lossy subset output

slow5tools skim --rid "$SLOW5_LOSSY_FILE" | sort -R | head -n "$SIZE" > "$READID_SUB"
slow5tools get -l "$READID_SUB" -o "$SLOW5_SUB" -t "$NUM_THREADS" "$SLOW5_FILE"
slow5tools get -l "$READID_SUB" -o "$SLOW5_LOSSY_SUB" -t "$NUM_THREADS" "$SLOW5_LOSSY_FILE"
```

Then proceed as normal, using

```bash
SLOW5_FILE=SLOW5_SUB
SLOW5_LOSSY_FILE=SLOW5_LOSSY_SUB
SLOW5_FILE=SLOW5_SUB
SLOW5_LOSSY_FILE=SLOW5_LOSSY_SUB
```

0 comments on commit 61c86ae

Please sign in to comment.