LongReadSum: A fast and flexible QC tool for long read sequencing data

LongReadSum supports FASTA, FASTQ, BAM, FAST5, and sequencing_summary.txt file formats for quick generation of QC data in HTML and text format.

README Contents

Installation using Anaconda (recommended)
Installation using Docker
Building from source
General usage for common filetypes:
Revision history
Getting help
Citing LongReadSum

Installation using Anaconda

First, install Anaconda.

Next, create a new environment. This installation has been tested with Python 3.10, Linux 64-bit.

conda create -n longreadsum python=3.9
conda activate longreadsum

LongReadSum and its dependencies can then be installed using the following command:

conda install -c wglab -c conda-forge -c jannessp -c bioconda longreadsum=1.4.0

Installation using Docker

First, install Docker. Pull the latest image from Docker hub, which contains the latest longreadsum release and its dependencies.

docker pull genomicslab/longreadsum

Running

On Unix/Linux:

docker run -v C:/Users/.../DataDirectory:/mnt/ -it genomicslab/longreadsum bam -i /mnt/input.bam -o /mnt/output

Note that the -v command is required for Docker to find the input file. Use a directory under C:/Users/ to ensure volume files are mounted correctly. In the above example, the local directory C:/Users/.../DataDirectory containing the input file input.bam is mapped to a directory /mnt/ in the Docker container. Thus, the input file and output directory arguments are relative to the /mnt/ directory, but the output files will also be saved locally in C:/Users/.../DataDirectory under the specified subdirectory output.

Building from source

To get the latest updates in longreadsum, you can build from source. First install Anaconda. Then follow the instructions below to install LongReadSum and its dependencies:

# Pull the latest updates
git clone https://github.com/WGLab/LongReadSum
cd LongReadSum

# Create the longreadsum environment, install dependencies, and activate
conda env create -f environment.yml
conda activate longreadsum

# Build the program
make

Running

Activate the conda environment and then run with arguments:

conda activate longreadsum
longreadsum <FILETYPE> [arguments]

General Usage

Specify the filetype followed by parameters:

longreadsum <FILETYPE> -i $INPUT_FILE -o $OUTPUT_DIRECTORY

Common parameters

To see all parameters for a filetype, run:

longreadsum <FILETYPE> --help

This section describes parameters common to all filetypes:

Parameter	Description	Default
-i, --input	A single input filepath
-I, --inputs	Multiple comma-separated input filepaths
-P, --pattern	Use pattern matching (*) to specify multiple input files. Enclose the pattern in double quotes.
-g, --log	Log file path	log_output.log
-G, --log-level	Logging level (1: DEBUG, 2: INFO, 3: WARNING, 4: ERROR, 5: CRITICAL)	2
-o, --outputfolder	Output directory	output_longreadsum
-t, --threads	The number of threads used	1
-Q, --outprefix	Output file prefix	QC_

WGS BAM

This section describes how to generate QC reports for BAM files from whole-genome sequencing (WGS) with alignments to a linear reference genome such as GRCh38 (data shown is HG002 sequenced with ONT Kit V14 Promethion R10.4.1 from https://labs.epi2me.io/askenazi-kit14-2022-12/)

General usage

longreadsum bam -i $INPUT_FILE -o $OUTPUT_DIRECTORY

BAM with base modifications

This section describes how to generate QC reports for BAM files with MM, ML base modification tags (data shown is HG002 sequenced with ONT MinION R9.4.1 from https://labs.epi2me.io/gm24385-5mc/)

Parameters

Parameter	Description	Default
--mod	Run base modification analysis on the BAM file	False
--modprob	Base modification filtering threshold. Above/below this value, the base is considered modified/unmodified.	0.8
--ref	The reference genome FASTA file to use for identifying CpG sites (optional)

General usage

longreadsum bam -i $INPUT_FILE -o $OUTPUT_DIRECTORY --ref $REF_GENOME --modprob 0.8

RRMS BAM

This section describes describes how to generate QC reports for ONT RRMS BAM files and associated CSVs (data shown is HG002 RRMS using ONT R9.4.1).

Accepted reads:

Rejected reads:

Parameters

Parameter	Description	Default
-c, --csv	CSV file containing read IDs to extract from the BAM file*

The CSV file should contain a read_id column with the read IDs in the BAM file, and a decision column with the accepted/rejected status of the read. Accepted reads will have stop_receiving in the decision column, while rejected reads will have unblock:

batch_time,read_number,channel,num_samples,read_id,sequence_length,decision
1675186897.6034577,93,4,4011,f943c811-3f97-4971-8aed-bb9f36ffb8d1,361,unblock
1675186897.7544408,80,68,4025,fab0c19d-8085-454c-bfb7-c375bbe237a1,462,unblock
1675186897.7544408,93,127,4028,5285e0ba-86c0-4b5d-ba27-5783acad6105,438,unblock
1675186897.7544408,103,156,4023,65d8befa-eec0-4496-bf2b-aa1a84e6dc5e,362,stop_receiving
...

General usage

longreadsum rrms -i $INPUT_FILE -o $OUTPUT_DIRECTORY -c $RRMS_CSV

RNA-Seq BAM

This section describes how to generate QC reports for TIN (transcript integrity number) scores from RNA-Seq BAM files (data shown is Adult GTEx v9 long-read RNA-seq data sequenced with ONT cDNA-PCR protocol from https://www.gtexportal.org/home/downloads/adult-gtex/long_read_data).

Outputs

A TSV file with scores for each transcript:

geneID	chrom	tx_start	tx_end	TIN
ENST00000456328.2	chr1	11868	14409	2.69449577083296
ENST00000450305.2	chr1	12009	13670	0.00000000000000
ENST00000488147.2	chr1	14695	24886	94.06518975035769
ENST00000619216.1	chr1	17368	17436	0.00000000000000
ENST00000473358.1	chr1	29553	31097	0.00000000000000
...

An TSV file with TIN score summary statistics:

Bam_file	TIN(mean)	TIN(median)	TIN(stddev)
/mnt/isilon/wang_lab/perdomoj/data/GTEX/GTEX-14BMU-0526-SM-5CA2F_rep.FAK93376.bam	67.06832655372376	74.24996965188242	26.03788585287367

A summary table in the HTML report:

Parameters

Parameter	Description	Default
--genebed	Gene BED12 file required for calculating TIN scores
--sample-size	Sample size for TIN calculation	100
--min-coverage	Minimum coverage for TIN calculation	10

General usage

longreadsum bam -i $INPUT_FILE -o $OUTPUT_DIRECTORY --genebed $BED_FILE --min-coverage <COVERAGE> --sample-size <SIZE>

Download an example HTML report here (data is Adult GTEx v9 long-read RNA-seq data sequenced with ONT cDNA-PCR protocol from https://www.gtexportal.org/home/downloads/adult-gtex/long_read_data)

PacBio unaligned BAM

This section describes how to generate QC reports for PacBio BAM files without alignments (data shown is HG002 sequenced with PacBio Revio HiFi long reads obtained from https://www.pacb.com/connect/datasets/#WGS-datasets).

General usage

longreadsum bam -i $INPUT_FILE -o $OUTPUT_DIRECTORY

ONT POD5

This section describes how to generate QC reports for ONT POD5 (signal) files and their corresponding basecalled BAM files (data shown is HG002 using ONT R10.4.1 and LSK114 downloaded from the tutorial https://github.com/epi2me-labs/wf-basecalling).

Parameters

Note

The interactive signal-base correspondence plots in the HTML report use a lot of memory (RAM) which can make your web browser slow. Thus by default, we randomly sample only a few reads, and the user can specify a list of read IDs as well (e.g. from a specific region of interest).

Parameter	Description	Default
-b, --basecalls	The basecalled BAM file to use for signal extraction
-r, --read_ids	A comma-separated list of read IDs to extract from the file
-R, --read-count	Set the number of reads to randomly sample from the file	3

General usage

# Individual file:
longreadsum pod5 -i $INPUT_FILE -o $OUTPUT_DIRECTORY --basecalls $INPUT_BAM [--read-count <COUNT> | --read-ids <IDS>]

# Directory:
longreadsum pod5 -P "$INPUT_DIRECTORY/*.fast5" -o $OUTPUT_DIRECTORY --basecalls $INPUT_BAM [--read-count <COUNT> | --read-ids <IDS>]

ONT FAST5

Signal QC

This section describes how to generate QC reports for generating a signal and basecalling QC report from ONT FAST5 files with signal and basecall information (data shown is HG002 sequenced with ONT MinION R9.4.1 from https://labs.epi2me.io/gm24385-5mc/)

Parameters

Note

The interactive signal-base correspondence plots in the HTML report use a lot of memory (RAM) which can make your web browser slow. Thus by default, we randomly sample only a few reads, and the user can specify a list of read IDs as well (e.g. from a specific region of interest).

Parameter	Description	Default
-r, --read_ids	A comma-separated list of read IDs to extract from the file
-R, --read-count	Set the number of reads to randomly sample from the file	3

General usage

# Individual file:
longreadsum f5s -i $INPUT_FILE -o $OUTPUT_DIRECTORY [--read-count <COUNT> | --read-ids <IDS>]

# Directory:
longreadsum f5s -P "$INPUT_DIRECTORY/*.fast5" -o $OUTPUT_DIRECTORY [--read-count <COUNT> | --read-ids <IDS>]

Sequence QC

This section describes how to generate QC reports for sequence data from ONT FAST5 files (data shown is HG002 sequenced with ONT MinION R9.4.1 from https://labs.epi2me.io/gm24385-5mc/)

General usage

longreadsum f5 -i $INPUT_FILE -o $OUTPUT_DIRECTORY

Basecall summary

This section describes how to generate QC reports for ONT basecall summary (sequencing_summary.txt) files (data shown is HG002 sequenced with ONT PromethION R10.4 from https://labs.epi2me.io/gm24385_q20_2021.10/, filename gm24385_q20_2021.10/analysis/20210805_1713_5C_PAH79257_0e41e938/guppy_5.0.15_sup/sequencing_summary.txt)

General usage

longreadsum seqtxt -i $INPUT_FILE -o $OUTPUT_DIRECTORY

FASTQ

This section describes how to generate QC reports for FASTQ files (data shown is HG002 ONT 2D from GIAB FTP index)

General usage

longreadsum fq -i $INPUT_FILE -o $OUTPUT_DIRECTORY

FASTA

This section describes how to generate QC reports for FASTA files (data shown is HG002 ONT 2D from GIAB FTP index).

General usage

longreadsum fa -i $INPUT_FILE -o $OUTPUT_DIRECTORY

Revision history

For release history, please visit here.

Getting help

Please refer to the LongReadSum issue pages for posting your issues. We will also respond your questions quickly. Your comments are criticl to improve our tool and will benefit other users.

Citing LongReadSum

Please cite the preprint below if you use our tool:

Perdomo, J. E., Ahsan, M. U., Liu, Q., Fang, L. & Wang, K. LongReadSum: A fast and flexible quality control and signal summarization tool for long-read sequencing data. bioRxiv, 2024.2008.2005.606643, doi:10.1101/2024.08.05.606643 (2024).

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
.github/workflows		.github/workflows
conda		conda
include		include
lib		lib
src		src
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
__main__.py		__main__.py
environment.yml		environment.yml
setup.py		setup.py

License

WGLab/LongReadSum

Folders and files

Latest commit

History

Repository files navigation

LongReadSum: A fast and flexible QC tool for long read sequencing data

README Contents

Installation using Anaconda

Installation using Docker

Running

Building from source

Running

General Usage

Common parameters

WGS BAM

General usage

BAM with base modifications

Parameters

General usage

RRMS BAM

Accepted reads:

Rejected reads:

Parameters

General usage

RNA-Seq BAM

Outputs

Parameters

General usage

PacBio unaligned BAM

General usage

ONT POD5

Parameters

General usage

ONT FAST5

Signal QC

Parameters

General usage

Sequence QC

General usage

Basecall summary

General usage

FASTQ

General usage

FASTA

General usage

Revision history

Getting help

Citing LongReadSum

Please cite the preprint below if you use our tool:

About

Resources

License

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 4

Languages

Packages