0.4.0 (#14)

* json report to stdout, can be disabled with `--no-json` * rename * small optimisations
Sam-Sims · Oct 6, 2023 · 09f827e · 09f827e
1 parent 473a539
commit 09f827e
Show file tree

Hide file tree

Showing 7 changed files with 168 additions and 223 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,4 @@
 /target
 /test-data
-/benchmarks/raw
+/benchmarks_raw
 Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,15 +1,15 @@
 [package]
-name = "krakenxtract"
-version = "0.3.0"
+name = "kractor"
+version = "0.4.0"
 edition = "2021"
 authors = ["Samuel Sims"]
 description = "Extract reads from a FASTQ file based on taxonomic classification via Kraken2."
 readme = "README.md"
-repository = "https://github.com/Sam-Sims/krakenXtract"
+repository = "https://github.com/Sam-Sims/kractor"
 license = "MIT"
-keywords = ["kraken", "bioinformatics", "taxonomy", "fastq"]
+keywords = ["kraken", "bioinformatics", "taxonomy", "fastq", "metagenomics"]
 categories = ["command-line-utilities", "science"]
-documentation = "https://docs.rs/crate/krakenxtract/"
+documentation = "https://docs.rs/crate/kractor/"
 
 # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
 
@@ -18,8 +18,10 @@ chrono = "0.4.31"
 clap = { version = "4.3.19", features = ["derive"] }
 crossbeam = "0.8.2"
 env_logger = "0.10.0"
+lazy_static = "1.4.0"
 log = "0.4.20"
-niffler = { version = "2.5.0", features = ["gz_cloudflare_zlib"] }
+niffler = { version = "2.5.0", default-features = false, features = ["gz", "bz2", "gz_cloudflare_zlib"] }
 noodles = { version = "0.51.0", features = ["fastq", "fasta"] }
+serde_json = "1.0.107"
 
 
diff --git a/README.md b/README.md
@@ -1,43 +1,34 @@
-# krakenXtract
-
-[![Release](https://github.com/Sam-Sims/krakenXtract/actions/workflows/release.yaml/badge.svg)](https://github.com/Sam-Sims/krakenXtract/actions/workflows/release.yaml)
-![GitHub release (with filter)](https://img.shields.io/github/v/release/sam-sims/krakenxtract)
-![crates.io](https://img.shields.io/crates/v/krakenxtract
+[![Release](https://github.com/Sam-Sims/Kractor/actions/workflows/release.yaml/badge.svg)](https://github.com/Sam-Sims/Kractor/actions/workflows/release.yaml)
+![GitHub release (with filter)](https://img.shields.io/github/v/release/sam-sims/Kractor)
+![crates.io](https://img.shields.io/crates/v/kractor
 )
 
-Extract reads from a FASTQ file based on taxonomic classification via Kraken2.
-
+# Kractor
 
-## Motivation
+**kra**ken extr**actor**
 
-Heavily inspired by the great [KrakenTools](https://github.com/jenniferlu717/KrakenTools). 
+Kractor extracts sequencing reads based on taxonomic classifications obtained via [Kraken2](https://github.com/DerrickWood/kraken2). It consumes paired or unpaired `fastq[.gz/.bz]` files as input alongisde a Kraken2 standard output. It can optionally consume a Kraken2 report to extract all taxonomic parents and children of a given taxid. Fast by default, it outputs `fast[q/a]` files, that can optionally be compressed.
 
-Having been wanting to experiment with Rust for a while, this is essentially an implementation of the `extract_kraken_reads.py` script, [re-implemented](https://www.reddit.com/media?url=https%3A%2F%2Fi.redd.it%2Fgood-for-you-crab-v0-5v9ygeh9r1c91.jpg%3Fs%3Dd759db5275e32c6e2bd5c22bddbd783acca46247) in Rust. 
+Kractor significantly enhances processing speed compared to KrakenTools for both paired and unpaired reads. Paired reads are processed approximately 21x quicker for compressed fastqs and 10x quicker for uncompressed. Unpaired reads are approximately 4x faster for both compressed and uncompressed inputs.
 
-The main motivation was to provide a speedup when extracting a large number of reads from large FASTQ files - and to learn Rust!
+ For additional details, refer to the [benchmarks](benchmarks/benchmarks.md)
 
-## Current features
-
-- Extract all reads from a `fastq` file based on a taxonomic id
-- Extract all the parents or the children of the specified taxon id
-- Supports single or paired-end `fastq` files
-- Supports both uncompressed or `gzip` inputs and outputs.
-- Multithreaded
-- ~4.4x speed up compared to KrakenTools
+## Motivation
 
-## Benchmarks (WIP)
+Heavily inspired by the great [KrakenTools](https://github.com/jenniferlu717/KrakenTools). 
 
-For more detail see [benchmarks](benchmarks/benchmarks.md)
+At the time of writing KrakenTools operates as a single-threaded Python implementation which poses limitations in speed when processing large, paired-end fastq files. The main motivation was to enchance speed when parsing and extracting (writing) a large volume of reads - and also to learn rust!
 
 ## Installation
 
-### Precompiled:
-Github release: [0.3.0](https://github.com/Sam-Sims/krakenXtract/releases/tag/v0.3.0)
+### Binaries:
+
+Precompiled binaries for Linux, MacOS and Windows are attached to the latest release [0.4.0](https://github.com/Sam-Sims/Kractor/releases/tag/v0.4.0)
 
 ### Cargo:
 Requires [cargo](https://www.rust-lang.org/tools/install)
 ```
-cargo install krakenxtract
+cargo install kractor
 ```
 
 ### Build from source:
@@ -49,33 +40,33 @@ To install please refer to the rust documentation: [docs](https://www.rust-lang.
 #### Clone the repository:
 
 ```bash
-git clone https://github.com/Sam-Sims/krakenxtract
+git clone https://github.com/Sam-Sims/Kractor
 ```
 
 #### Build and add to path:
 
 ```bash
-cd kraken-extract
+cd Kractor
 cargo build --release
 export PATH=$PATH:$(pwd)/target/release
 ```
 
-All executables will be in the directory kraken-extract/target/release.
+All executables will be in the directory Kractor/target/release.
 
 ## Usage
-
+![Alt text](screenshot.png)
 ### Basic Usage:
 
 ```bash
-krakenXtract -k <kraken_output> -i <fastq_file> -t <taxonomic_id> -o <output_file>
+kractor -k <kraken_output> -i <fastq_file> -t <taxonomic_id> -o <output_file> > kractor_report.json
 ```
 Or, if you have paired-end illumina reads:
 ```bash
-krakenXtract -k <kraken_output> -i <R1_fastq_file> -i <R2_fastq_file> -t <taxonomic_id> -o <R1_output_file> -o <R2_output_file>
+kractor -k <kraken_output> -i <R1_fastq_file> -i <R2_fastq_file> -t <taxonomic_id> -o <R1_output_file> -o <R2_output_file>
 ```
 If you want to extract all children of a taxon:
 ```bash
-krakenXtract -k <kraken_output> -r <kraken_report> -i <fastq_file> -t <taxonomic_id> --children -o <output_file>
+kractor -k <kraken_output> -r <kraken_report> -i <fastq_file> -t <taxonomic_id> --children -o <output_file>
 ```
 
 ### Arguments:
@@ -86,7 +77,7 @@ krakenXtract -k <kraken_output> -r <kraken_report> -i <fastq_file> -t <taxonomic
 
 `-i, --input`
 
-This option will specify the input files containing the reads you want to extract from. They can be compressed - (`gzip`, `bzip`, `lzma`, `zstd`). Paired end reads can be specified by:
+This option will specify the input files containing the reads you want to extract from. They can be compressed - (`gz`, `bz2`). Paired end reads can be specified by:
 
 Using `--input` twice: `-i <R1_fastq_file> -i <R2_fastq_file>`
 
@@ -100,7 +91,7 @@ Using `--input` once but passing both files: `-i <R1_fastq_file> <R2_fastq_file>
 
 This option will specify the output files containing the extracted reads. The order of the output files is assumed to be the same as the input. 
 
-By default the compression will be inferred from the output file extension for supported file types (`gzip`, `bzip`, `lzma` and `zstd`). If the output type cannot be inferred, plaintext will be output.
+By default the compression will be inferred from the output file extension for supported file types (`gz`, `bz`). If the output type cannot be inferred, plaintext will be output.
 
 #### Kraken Output
 
@@ -124,17 +115,15 @@ This option will manually set the compression mode used for the output file and
 
 Valid values are:
 
-- `gz` to output gzip
-- `bz` to output bzip
-- `lzma` to output lzma
-- `zstd` to output zstd
+- `gz` to output gz
+- `bz2` to output bz2
 - `none` to not apply compresison
 
 #### Compression level
 
 `-l, --level`
 
-This option will set the compression level to use if compressing the output. Should be a value between 1-9 with 1 being the fastest but largest file size and 9 is for slowest, but best file size. By default this is set at 6, but for the highest speeds 2 is a good trade off for speed/filesize.
+This option will set the compression level to use if compressing the output. Should be a value between 1-9 with 1 being the fastest but largest file size and 9 is for slowest, but best file size. By default this is set at 2 as it is a good trade off for speed/filesize.
 
 #### Output fasta
 
@@ -166,27 +155,36 @@ This will extract all the reads classified as decendents or subtaxa of `--taxid`
 
 This will output every read except those matching the taxid. Works with `--parents` and `--children`
 
+#### Skip report
+
+`--no-json`
+
+This will skip the json report that is output to stdout upon programme completion.
+
 ## Future plans
 
 - [x] Support unzipped fastq files
 - [x] Support paired end FASTQ files
 - [x] `--include-parents` and `--include-children` arguments
 - [ ] Supply multiple taxonomic IDs to extract
 - [x] Exclude taxonomic IDs
-- [ ] `--append`
 - [x] `--compression-mode`
 - [x] More verbose output
-- [ ] Proper benchmarks
-- [x] Output fasta format (for blast??)
+- [x] Benchmarks
+- [x] Output fasta
 - [x] Output non `gz`
 - [ ] Tests
 
 ## Version
 
-- 0.3.0
+- 0.4.0
 
 ## Changelog
 
+### 0.4.0
+- Json report including in stdout upon successful completion (can be disabled with --no-json)
+- Renamed
+
 ### 0.3.0
 
 - Support for paired-end files

diff --git a/benchmarks/benchmarks.md b/benchmarks/benchmarks.md
@@ -1,67 +1,114 @@
 # Benchmarks:
 
-Benchmarks were run in triplicate with 1 warmup run - using [hyperfine](https://github.com/sharkdp/hyperfine).
+Benchmarks were run 5x each with 1 warmup run - using [hyperfine](https://github.com/sharkdp/hyperfine).
 
-They were run on my own PC running Ubuntu in WSL2 (Plans to test in native linux):
+They were run on my own PC running Ubuntu 23.04:
 
 - i7 12700k - 12 cores
-- 32GB memory with 16GB Assigned to WSL2
+- 32GB memory
 - M.2 SSD drive
 
 Hyperfine command:
 ```bash
-hyperfine --warmup 1 -r 3 --export-json --export-markdown
+hyperfine --warmup 1 -r 5 --export-json --export-markdown
 ```
+## Unpaired
 
-## Benchmark 1:
-ONT long read sequencing, outputting a plain (non-compressed) fastq file
+Single tests were based on the [Zymo mock community](https://github.com/LomanLab/mockcommunity) Zymo-GridION-EVEN-BB-SN dataset.
+
+### Benchmark 1:
+ONT long read sequencing, input `fastq` format, outputting a non-compressed `fastq` file. Extracting all reads classified as *Bacillus spizizenii*.
 
 *Inputs:*
 
 | File type | Platform | Total reads | Reads to extract | Kraken output size| Output|
 |-----------|----------|-------------|------------------|-------------------|-------|
-|`.fastq.gz`|ONT       |5,454,495    |3,489,386         |1.8GB              |`.fastq`
+|`.fastq`   |ONT       |3,491,078    |490,984         |885MB             |`.fastq`
 
 *Commands run:*
 
 | Tool | Command   |
 |------|---------|
-| krakenXtract | `./target/release/krakenxtract --kraken test-data/output.kraken --fastq test-data/ont_test.fastq.gz --taxid 666 --output krakenxtract_benchmark.fq --no-compress` |
-| KrakenTools | `extract_kraken_reads.py -k test-data/output.kraken -s test-data/ont_test.fastq.gz -o krakentools_benchmark.fq --fastq-output -t 666` |
+| `KrakenTools` | `extract_kraken_reads.py -s Zymo-GridION-EVEN-BB-SN.fq -k out.kraken  -o krakentools.fq -t 96241 --fastq-output` |
+| `Kractor` | `kractor -i Zymo-GridION-EVEN-BB-SN.fq -k out.kraken -o kractor.fq -t 96241` |
+
 
 *Results:*
 | Tool | Mean [s] | Min [s] | Max [s] | Relative |
 |:---|---:|---:|---:|---:|
-| `krakenXtract` | 80.585 ± 9.795 | 72.964 | 91.633 | 1.00 |
-| `KrakenTools` | 354.722 ± 8.613 | 349.398 | 364.659 | 4.40 ± 0.55 |
+| `krakenTools` | 254.881 ± 9.482 | 242.553  | 263.158 | 1.00 |
+| `Kractor` | 58.253 ± 5.651 | 48.358 | 62.222  | 4.38 ± 0.45 |
+
+### Benchmark 2:
+ONT long read sequencing, input `fastq.gz` format, outputting a non-compressed `fastq` file. Extracting all reads classified as *Bacillus spizizenii*.
+
+
+*Inputs:*
+
+| File type | Platform | Total reads | Reads to extract | Kraken output size| Output|
+|-----------|----------|-------------|------------------|-------------------|-------|
+|`.fastq.gz`|ONT       |3,491,078    |490,984        |885MB              |`.fastq`
+
+*Commands run:*
 
-Welch t-test: t = -36.4, p = 4e-06
+| Tool | Command   |
+|------|---------|
+| KrakenTools | `extract_kraken_reads.py -s Zymo-GridION-EVEN-BB-SN.fq.gz -k out.kraken  -o krakentools.fq -t 96241 --fastq-output` |
+| Kractor | `kractor -i Zymo-GridION-EVEN-BB-SN.fq.gz -k out.kraken -o kractor.fq -t 96241` |
 
-There is a difference between the two benchmarks (p < 0.05).
 
-![Alt text](img/ont-benchmark-whisker.png)
+*Results:*
+| Command | Mean [s] | Min [s] | Max [s] | Relative |
+|:---|---:|---:|---:|---:|
+| `krakenTools` | 376.592 ± 3.905 | 373.343 | 383.315 | 1.00 |
+| `Kractor` | 100.044 ± 3.599 | 98.044 | 106.449 | 3.76 ± 0.14 |
 
+## Paired
 
+Paired end tests were based on SRA accession: SRR19995508
 
-## Benchmark 2:
-ONT long read sequencing, outputting a compressed fastq file.
+### Benchmark 3:
+Illumina paired end sequencing, input `fastq` format, outputting a non-compressed `fastq` file.
+
+*Inputs:*
+
+| File type | Platform | Total reads | Reads to extract | Kraken output size| Output|
+|-----------|----------|-------------|------------------|-------------------|-------|
+|`.fastq`|Illumina paired       |53,526,611    |1,646,117        |9.3GB              |`.fastq`
+
+*Commands run:*
+
+| Tool | Command   |
+|------|---------|
+| KrakenTools | `extract_kraken_reads.py -s SRR19995508_R1.fastq -s2 SRR19995508_R2.fastq -o R1_tools.fq -o2 R2_tools.fq -k out.kraken -t 590 --fastq-output` |
+| Kractor | `kractor -i SRR19995508_R1.fastq -i SRR19995508_R2.fastq -k out.kraken -t 590 -o R1_kractor.fq -o R2_kractor.fq` |
+
+
+*Results:*
+| Command | Mean [s] | Min [s] | Max [s] | Relative |
+|:---|---:|---:|---:|---:|
+| `krakenTools` | 898.306 ± 14.203 | 884.229 | 920.653 | 1.00 |
+| `Kractor` | 94.198 ± 2.317 | 90.852 | 96.474 | 9.54 ± 0.28 |
 
-NOTE: Kraken Tools has no option to output a compressed fastq.
+### Benchmark 4:
+Illumina paired end sequencing, input `fastq.gz` format, outputting a non-compressed `fastq` file.
 
 *Inputs:*
 
 | File type | Platform | Total reads | Reads to extract | Kraken output size| Output|
 |-----------|----------|-------------|------------------|-------------------|-------|
-|`.fastq.gz`|ONT       |5,454,495    |3,489,386         |1.8GB              |`.fastq.gz`
+|`.fastq.gz`|Illumina paired       |53,526,611    |1,646,117        |9.3GB              |`.fastq`
 
 *Commands run:*
 
 | Tool | Command   |
 |------|---------|
-| krakenXtract | `./target/release/krakenxtract --kraken test-data/output.kraken --fastq test-data/ont_test.fastq.gz --taxid 666 --output krakenxtract_benchmark.fq.gz` |
+| KrakenTools | `extract_kraken_reads.py -s SRR19995508_R1.fastq.gz -s2 SRR19995508_R2.fastq.gz -o R1_tools.fq -o2 R2_tools.fq -k out.kraken -t 2 --fastq-output` |
+| Kractor | `kractor -i SRR19995508_R1.fastq.gz -i SRR19995508_R2.fastq.gz -k out.kraken -t 2 -o R1_kractor.fq -o R2_kractor.fq` |
 
 
 *Results:*
 | Command | Mean [s] | Min [s] | Max [s] | Relative |
 |:---|---:|---:|---:|---:|
-| krakenXtract | 78.042 ± 1.622 | 76.361 | 79.597 | 1.00 |
+| `krakenTools` | 1033.379 ± 25.238 | 1005.720 | 1068.522 | 1.00 |
+| `Kractor` | 49.071 ± 0.179 | 48.857 | 49.334 | 21.06 ± 0.52 |
diff --git a/screenshot.png b/screenshot.png
diff --git a/src/cli.rs b/src/cli.rs
@@ -1,5 +1,5 @@
 use clap::Parser;
-use log::{debug, error, info, trace, warn};
+use log::error;
 
 #[derive(Parser, Debug)]
 #[command(
@@ -38,7 +38,7 @@ pub struct Cli {
     #[arg(
         short = 'l',
         long = "level",
-        default_value = "6",
+        default_value = "2",
         value_parser(validate_compression_level)
     )]
     pub compression_level: niffler::Level,
@@ -54,6 +54,9 @@ pub struct Cli {
     // Output reads in FASTA format
     #[arg(long, action)]
     pub output_fasta: bool,
+    // Dont output json
+    #[arg(long = "no-json")]
+    pub no_json: bool,
     // Verbose
     #[arg(short)]
     pub verbose: bool,
@@ -85,9 +88,7 @@ impl Cli {
 fn validate_compression(s: &str) -> Result<niffler::compression::Format, String> {
     match s {
         "gz" => Ok(niffler::compression::Format::Gzip),
-        "bz" => Ok(niffler::compression::Format::Bzip),
-        "lzma" => Ok(niffler::compression::Format::Lzma),
-        "zst" => Ok(niffler::compression::Format::Zstd),
+        "bz2" => Ok(niffler::compression::Format::Bzip),
         "none" => Ok(niffler::compression::Format::No),
         _ => Err(format!("Unknown compression type: {}", s)),
     }