Skip to content

Commit

Permalink
0.4.0 (#14)
Browse files Browse the repository at this point in the history
* json report to stdout, can be disabled with `--no-json`
* rename
* small optimisations
  • Loading branch information
Sam-Sims authored Oct 6, 2023
1 parent 473a539 commit 09f827e
Show file tree
Hide file tree
Showing 7 changed files with 168 additions and 223 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
/target
/test-data
/benchmarks/raw
/benchmarks_raw
Cargo.lock
14 changes: 8 additions & 6 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
[package]
name = "krakenxtract"
version = "0.3.0"
name = "kractor"
version = "0.4.0"
edition = "2021"
authors = ["Samuel Sims"]
description = "Extract reads from a FASTQ file based on taxonomic classification via Kraken2."
readme = "README.md"
repository = "https://github.com/Sam-Sims/krakenXtract"
repository = "https://github.com/Sam-Sims/kractor"
license = "MIT"
keywords = ["kraken", "bioinformatics", "taxonomy", "fastq"]
keywords = ["kraken", "bioinformatics", "taxonomy", "fastq", "metagenomics"]
categories = ["command-line-utilities", "science"]
documentation = "https://docs.rs/crate/krakenxtract/"
documentation = "https://docs.rs/crate/kractor/"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

Expand All @@ -18,8 +18,10 @@ chrono = "0.4.31"
clap = { version = "4.3.19", features = ["derive"] }
crossbeam = "0.8.2"
env_logger = "0.10.0"
lazy_static = "1.4.0"
log = "0.4.20"
niffler = { version = "2.5.0", features = ["gz_cloudflare_zlib"] }
niffler = { version = "2.5.0", default-features = false, features = ["gz", "bz2", "gz_cloudflare_zlib"] }
noodles = { version = "0.51.0", features = ["fastq", "fasta"] }
serde_json = "1.0.107"


82 changes: 40 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,34 @@
# krakenXtract

[![Release](https://github.com/Sam-Sims/krakenXtract/actions/workflows/release.yaml/badge.svg)](https://github.com/Sam-Sims/krakenXtract/actions/workflows/release.yaml)
![GitHub release (with filter)](https://img.shields.io/github/v/release/sam-sims/krakenxtract)
![crates.io](https://img.shields.io/crates/v/krakenxtract
[![Release](https://github.com/Sam-Sims/Kractor/actions/workflows/release.yaml/badge.svg)](https://github.com/Sam-Sims/Kractor/actions/workflows/release.yaml)
![GitHub release (with filter)](https://img.shields.io/github/v/release/sam-sims/Kractor)
![crates.io](https://img.shields.io/crates/v/kractor
)

Extract reads from a FASTQ file based on taxonomic classification via Kraken2.

# Kractor

## Motivation
**kra**ken extr**actor**

Heavily inspired by the great [KrakenTools](https://github.com/jenniferlu717/KrakenTools).
Kractor extracts sequencing reads based on taxonomic classifications obtained via [Kraken2](https://github.com/DerrickWood/kraken2). It consumes paired or unpaired `fastq[.gz/.bz]` files as input alongisde a Kraken2 standard output. It can optionally consume a Kraken2 report to extract all taxonomic parents and children of a given taxid. Fast by default, it outputs `fast[q/a]` files, that can optionally be compressed.

Having been wanting to experiment with Rust for a while, this is essentially an implementation of the `extract_kraken_reads.py` script, [re-implemented](https://www.reddit.com/media?url=https%3A%2F%2Fi.redd.it%2Fgood-for-you-crab-v0-5v9ygeh9r1c91.jpg%3Fs%3Dd759db5275e32c6e2bd5c22bddbd783acca46247) in Rust.
Kractor significantly enhances processing speed compared to KrakenTools for both paired and unpaired reads. Paired reads are processed approximately 21x quicker for compressed fastqs and 10x quicker for uncompressed. Unpaired reads are approximately 4x faster for both compressed and uncompressed inputs.

The main motivation was to provide a speedup when extracting a large number of reads from large FASTQ files - and to learn Rust!
For additional details, refer to the [benchmarks](benchmarks/benchmarks.md)

## Current features

- Extract all reads from a `fastq` file based on a taxonomic id
- Extract all the parents or the children of the specified taxon id
- Supports single or paired-end `fastq` files
- Supports both uncompressed or `gzip` inputs and outputs.
- Multithreaded
- ~4.4x speed up compared to KrakenTools
## Motivation

## Benchmarks (WIP)
Heavily inspired by the great [KrakenTools](https://github.com/jenniferlu717/KrakenTools).

For more detail see [benchmarks](benchmarks/benchmarks.md)
At the time of writing KrakenTools operates as a single-threaded Python implementation which poses limitations in speed when processing large, paired-end fastq files. The main motivation was to enchance speed when parsing and extracting (writing) a large volume of reads - and also to learn rust!

## Installation

### Precompiled:
Github release: [0.3.0](https://github.com/Sam-Sims/krakenXtract/releases/tag/v0.3.0)
### Binaries:

Precompiled binaries for Linux, MacOS and Windows are attached to the latest release [0.4.0](https://github.com/Sam-Sims/Kractor/releases/tag/v0.4.0)

### Cargo:
Requires [cargo](https://www.rust-lang.org/tools/install)
```
cargo install krakenxtract
cargo install kractor
```

### Build from source:
Expand All @@ -49,33 +40,33 @@ To install please refer to the rust documentation: [docs](https://www.rust-lang.
#### Clone the repository:

```bash
git clone https://github.com/Sam-Sims/krakenxtract
git clone https://github.com/Sam-Sims/Kractor
```

#### Build and add to path:

```bash
cd kraken-extract
cd Kractor
cargo build --release
export PATH=$PATH:$(pwd)/target/release
```

All executables will be in the directory kraken-extract/target/release.
All executables will be in the directory Kractor/target/release.

## Usage

![Alt text](screenshot.png)
### Basic Usage:

```bash
krakenXtract -k <kraken_output> -i <fastq_file> -t <taxonomic_id> -o <output_file>
kractor -k <kraken_output> -i <fastq_file> -t <taxonomic_id> -o <output_file> > kractor_report.json
```
Or, if you have paired-end illumina reads:
```bash
krakenXtract -k <kraken_output> -i <R1_fastq_file> -i <R2_fastq_file> -t <taxonomic_id> -o <R1_output_file> -o <R2_output_file>
kractor -k <kraken_output> -i <R1_fastq_file> -i <R2_fastq_file> -t <taxonomic_id> -o <R1_output_file> -o <R2_output_file>
```
If you want to extract all children of a taxon:
```bash
krakenXtract -k <kraken_output> -r <kraken_report> -i <fastq_file> -t <taxonomic_id> --children -o <output_file>
kractor -k <kraken_output> -r <kraken_report> -i <fastq_file> -t <taxonomic_id> --children -o <output_file>
```

### Arguments:
Expand All @@ -86,7 +77,7 @@ krakenXtract -k <kraken_output> -r <kraken_report> -i <fastq_file> -t <taxonomic

`-i, --input`

This option will specify the input files containing the reads you want to extract from. They can be compressed - (`gzip`, `bzip`, `lzma`, `zstd`). Paired end reads can be specified by:
This option will specify the input files containing the reads you want to extract from. They can be compressed - (`gz`, `bz2`). Paired end reads can be specified by:

Using `--input` twice: `-i <R1_fastq_file> -i <R2_fastq_file>`

Expand All @@ -100,7 +91,7 @@ Using `--input` once but passing both files: `-i <R1_fastq_file> <R2_fastq_file>

This option will specify the output files containing the extracted reads. The order of the output files is assumed to be the same as the input.

By default the compression will be inferred from the output file extension for supported file types (`gzip`, `bzip`, `lzma` and `zstd`). If the output type cannot be inferred, plaintext will be output.
By default the compression will be inferred from the output file extension for supported file types (`gz`, `bz`). If the output type cannot be inferred, plaintext will be output.

#### Kraken Output

Expand All @@ -124,17 +115,15 @@ This option will manually set the compression mode used for the output file and

Valid values are:

- `gz` to output gzip
- `bz` to output bzip
- `lzma` to output lzma
- `zstd` to output zstd
- `gz` to output gz
- `bz2` to output bz2
- `none` to not apply compresison

#### Compression level

`-l, --level`

This option will set the compression level to use if compressing the output. Should be a value between 1-9 with 1 being the fastest but largest file size and 9 is for slowest, but best file size. By default this is set at 6, but for the highest speeds 2 is a good trade off for speed/filesize.
This option will set the compression level to use if compressing the output. Should be a value between 1-9 with 1 being the fastest but largest file size and 9 is for slowest, but best file size. By default this is set at 2 as it is a good trade off for speed/filesize.

#### Output fasta

Expand Down Expand Up @@ -166,27 +155,36 @@ This will extract all the reads classified as decendents or subtaxa of `--taxid`

This will output every read except those matching the taxid. Works with `--parents` and `--children`

#### Skip report

`--no-json`

This will skip the json report that is output to stdout upon programme completion.

## Future plans

- [x] Support unzipped fastq files
- [x] Support paired end FASTQ files
- [x] `--include-parents` and `--include-children` arguments
- [ ] Supply multiple taxonomic IDs to extract
- [x] Exclude taxonomic IDs
- [ ] `--append`
- [x] `--compression-mode`
- [x] More verbose output
- [ ] Proper benchmarks
- [x] Output fasta format (for blast??)
- [x] Benchmarks
- [x] Output fasta
- [x] Output non `gz`
- [ ] Tests

## Version

- 0.3.0
- 0.4.0

## Changelog

### 0.4.0
- Json report including in stdout upon successful completion (can be disabled with --no-json)
- Renamed

### 0.3.0

- Support for paired-end files
Expand Down
87 changes: 67 additions & 20 deletions benchmarks/benchmarks.md
Original file line number Diff line number Diff line change
@@ -1,67 +1,114 @@
# Benchmarks:

Benchmarks were run in triplicate with 1 warmup run - using [hyperfine](https://github.com/sharkdp/hyperfine).
Benchmarks were run 5x each with 1 warmup run - using [hyperfine](https://github.com/sharkdp/hyperfine).

They were run on my own PC running Ubuntu in WSL2 (Plans to test in native linux):
They were run on my own PC running Ubuntu 23.04:

- i7 12700k - 12 cores
- 32GB memory with 16GB Assigned to WSL2
- 32GB memory
- M.2 SSD drive

Hyperfine command:
```bash
hyperfine --warmup 1 -r 3 --export-json --export-markdown
hyperfine --warmup 1 -r 5 --export-json --export-markdown
```
## Unpaired

## Benchmark 1:
ONT long read sequencing, outputting a plain (non-compressed) fastq file
Single tests were based on the [Zymo mock community](https://github.com/LomanLab/mockcommunity) Zymo-GridION-EVEN-BB-SN dataset.

### Benchmark 1:
ONT long read sequencing, input `fastq` format, outputting a non-compressed `fastq` file. Extracting all reads classified as *Bacillus spizizenii*.

*Inputs:*

| File type | Platform | Total reads | Reads to extract | Kraken output size| Output|
|-----------|----------|-------------|------------------|-------------------|-------|
|`.fastq.gz`|ONT |5,454,495 |3,489,386 |1.8GB |`.fastq`
|`.fastq` |ONT |3,491,078 |490,984 |885MB |`.fastq`

*Commands run:*

| Tool | Command |
|------|---------|
| krakenXtract | `./target/release/krakenxtract --kraken test-data/output.kraken --fastq test-data/ont_test.fastq.gz --taxid 666 --output krakenxtract_benchmark.fq --no-compress` |
| KrakenTools | `extract_kraken_reads.py -k test-data/output.kraken -s test-data/ont_test.fastq.gz -o krakentools_benchmark.fq --fastq-output -t 666` |
| `KrakenTools` | `extract_kraken_reads.py -s Zymo-GridION-EVEN-BB-SN.fq -k out.kraken -o krakentools.fq -t 96241 --fastq-output` |
| `Kractor` | `kractor -i Zymo-GridION-EVEN-BB-SN.fq -k out.kraken -o kractor.fq -t 96241` |


*Results:*
| Tool | Mean [s] | Min [s] | Max [s] | Relative |
|:---|---:|---:|---:|---:|
| `krakenXtract` | 80.585 ± 9.795 | 72.964 | 91.633 | 1.00 |
| `KrakenTools` | 354.722 ± 8.613 | 349.398 | 364.659 | 4.40 ± 0.55 |
| `krakenTools` | 254.881 ± 9.482 | 242.553 | 263.158 | 1.00 |
| `Kractor` | 58.253 ± 5.651 | 48.358 | 62.222 | 4.38 ± 0.45 |

### Benchmark 2:
ONT long read sequencing, input `fastq.gz` format, outputting a non-compressed `fastq` file. Extracting all reads classified as *Bacillus spizizenii*.


*Inputs:*

| File type | Platform | Total reads | Reads to extract | Kraken output size| Output|
|-----------|----------|-------------|------------------|-------------------|-------|
|`.fastq.gz`|ONT |3,491,078 |490,984 |885MB |`.fastq`

*Commands run:*

Welch t-test: t = -36.4, p = 4e-06
| Tool | Command |
|------|---------|
| KrakenTools | `extract_kraken_reads.py -s Zymo-GridION-EVEN-BB-SN.fq.gz -k out.kraken -o krakentools.fq -t 96241 --fastq-output` |
| Kractor | `kractor -i Zymo-GridION-EVEN-BB-SN.fq.gz -k out.kraken -o kractor.fq -t 96241` |

There is a difference between the two benchmarks (p < 0.05).

![Alt text](img/ont-benchmark-whisker.png)
*Results:*
| Command | Mean [s] | Min [s] | Max [s] | Relative |
|:---|---:|---:|---:|---:|
| `krakenTools` | 376.592 ± 3.905 | 373.343 | 383.315 | 1.00 |
| `Kractor` | 100.044 ± 3.599 | 98.044 | 106.449 | 3.76 ± 0.14 |

## Paired

Paired end tests were based on SRA accession: SRR19995508

## Benchmark 2:
ONT long read sequencing, outputting a compressed fastq file.
### Benchmark 3:
Illumina paired end sequencing, input `fastq` format, outputting a non-compressed `fastq` file.

*Inputs:*

| File type | Platform | Total reads | Reads to extract | Kraken output size| Output|
|-----------|----------|-------------|------------------|-------------------|-------|
|`.fastq`|Illumina paired |53,526,611 |1,646,117 |9.3GB |`.fastq`

*Commands run:*

| Tool | Command |
|------|---------|
| KrakenTools | `extract_kraken_reads.py -s SRR19995508_R1.fastq -s2 SRR19995508_R2.fastq -o R1_tools.fq -o2 R2_tools.fq -k out.kraken -t 590 --fastq-output` |
| Kractor | `kractor -i SRR19995508_R1.fastq -i SRR19995508_R2.fastq -k out.kraken -t 590 -o R1_kractor.fq -o R2_kractor.fq` |


*Results:*
| Command | Mean [s] | Min [s] | Max [s] | Relative |
|:---|---:|---:|---:|---:|
| `krakenTools` | 898.306 ± 14.203 | 884.229 | 920.653 | 1.00 |
| `Kractor` | 94.198 ± 2.317 | 90.852 | 96.474 | 9.54 ± 0.28 |

NOTE: Kraken Tools has no option to output a compressed fastq.
### Benchmark 4:
Illumina paired end sequencing, input `fastq.gz` format, outputting a non-compressed `fastq` file.

*Inputs:*

| File type | Platform | Total reads | Reads to extract | Kraken output size| Output|
|-----------|----------|-------------|------------------|-------------------|-------|
|`.fastq.gz`|ONT |5,454,495 |3,489,386 |1.8GB |`.fastq.gz`
|`.fastq.gz`|Illumina paired |53,526,611 |1,646,117 |9.3GB |`.fastq`

*Commands run:*

| Tool | Command |
|------|---------|
| krakenXtract | `./target/release/krakenxtract --kraken test-data/output.kraken --fastq test-data/ont_test.fastq.gz --taxid 666 --output krakenxtract_benchmark.fq.gz` |
| KrakenTools | `extract_kraken_reads.py -s SRR19995508_R1.fastq.gz -s2 SRR19995508_R2.fastq.gz -o R1_tools.fq -o2 R2_tools.fq -k out.kraken -t 2 --fastq-output` |
| Kractor | `kractor -i SRR19995508_R1.fastq.gz -i SRR19995508_R2.fastq.gz -k out.kraken -t 2 -o R1_kractor.fq -o R2_kractor.fq` |


*Results:*
| Command | Mean [s] | Min [s] | Max [s] | Relative |
|:---|---:|---:|---:|---:|
| krakenXtract | 78.042 ± 1.622 | 76.361 | 79.597 | 1.00 |
| `krakenTools` | 1033.379 ± 25.238 | 1005.720 | 1068.522 | 1.00 |
| `Kractor` | 49.071 ± 0.179 | 48.857 | 49.334 | 21.06 ± 0.52 |
Binary file added screenshot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 6 additions & 5 deletions src/cli.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
use clap::Parser;
use log::{debug, error, info, trace, warn};
use log::error;

#[derive(Parser, Debug)]
#[command(
Expand Down Expand Up @@ -38,7 +38,7 @@ pub struct Cli {
#[arg(
short = 'l',
long = "level",
default_value = "6",
default_value = "2",
value_parser(validate_compression_level)
)]
pub compression_level: niffler::Level,
Expand All @@ -54,6 +54,9 @@ pub struct Cli {
// Output reads in FASTA format
#[arg(long, action)]
pub output_fasta: bool,
// Dont output json
#[arg(long = "no-json")]
pub no_json: bool,
// Verbose
#[arg(short)]
pub verbose: bool,
Expand Down Expand Up @@ -85,9 +88,7 @@ impl Cli {
fn validate_compression(s: &str) -> Result<niffler::compression::Format, String> {
match s {
"gz" => Ok(niffler::compression::Format::Gzip),
"bz" => Ok(niffler::compression::Format::Bzip),
"lzma" => Ok(niffler::compression::Format::Lzma),
"zst" => Ok(niffler::compression::Format::Zstd),
"bz2" => Ok(niffler::compression::Format::Bzip),
"none" => Ok(niffler::compression::Format::No),
_ => Err(format!("Unknown compression type: {}", s)),
}
Expand Down
Loading

0 comments on commit 09f827e

Please sign in to comment.