Modkit

A bioinformatics tool for working with modified bases from Oxford Nanopore. Specifically for converting modBAM to bedMethyl files using best practices, but also manipulating modBAM files and generating summary statistics. Detailed documentation and quick-start can be found in the online documentation.

Installation

Pre-compiled binaries are provided for Linux from the release page. We recommend the use of these in most circumstances.

Building from source

The provided packages should be used where possible. We understand that some users may wish to compile the software from its source code. To build modkit from source cargo should be used.

git clone https://github.com/nanoporetech/modkit.git
cd modkit
cargo install --path .
# or
cargo install --git https://github.com/nanoporetech/modkit.git

Usage

Modkit comprises a suite of tools for manipulating modified-base data stored in BAM files. Modified base information is stored in the MM and ML tags (see section 1.7 of the SAM tags specification). These tags are produced by contemporary basecallers of data from Oxford Nanopore Technologies sequencing platforms.

Constructing bedMethyl tables

A primary use of modkit is to create summary counts of modified and unmodified bases in an extended bedMethyl format. bedMethyl files tabulate the counts of base modifications from every sequencing read over each reference genomic position.

In its simplest form modkit creates a bedMethyl file using the following:

modkit pileup path/to/reads.bam output/path/pileup.bed --log-filepath pileup.log

No reference sequence is required. A single file (described below) with base count summaries will be created. The final argument here specifies an optional log file output.

The program performs best-practices filtering and manipulation of the raw data stored in the input file. For further details see filtering modified-base calls.

For user convenience the counting process can be modulated using several additional transforms and filters. The most basic of these is to report only counts from reference CpG dinucleotides. This option requires a reference sequence in order to locate the CpGs in the reference:

modkit pileup path/to/reads.bam output/path/pileup.bed --cpg --ref path/to/reference.fasta

The program also contains a range of presets which combine several options for ease of use. The traditional preset,

modkit pileup path/to/reads.bam output/path/pileup.bed \
  --ref path/to/reference.fasta \
  --preset traditional

performs three transforms:

restricts output to locations where there is a CG dinucleotide in the reference,
reports only a C and 5mC counts, using procedures to take into account counts of other forms of cytosine modification (notably 5hmC), and
aggregates data across strands. The strand field od the output will be marked as '.' indicating that the strand information has been lost.

Using this option is equivalent to running with the options:

modkit pileup --cpg --ref <reference.fasta> --ignore h --combine-strands

For more information on the individual options see the Advanced Usage help document.

Description of bedMethyl output

Below is a description of the bedMethyl columns generated by modkit pileup. A brief description of the bedMethyl specification can be found on Encode.

Definitions:

N_mod - Number of calls passing filters that were classified as a residue with a specified base modification.
N_canonical - Number of calls passing filters were classified as the canonical base rather than modified. The exact base must be inferred by the modification code. For example, if the modification code is m (5mC) then the canonical base is cytosine. If the modification code is a, the canonical base is adenosine.
N_{other mod} - Number of calls passing filters that were classified as modified, but where the modification is different from the listed base (and the corresponding canonical base is equal). For example, for a given cytosine there may be 3 reads with h calls, 1 with a canonical call, and 2 with m calls. In the bedMethyl row for h N_{other_mod} would be 2. In the m row N_{other_mod} would be 3.
N_{valid_cov} - the valid coverage. N_{valid_cov} = N_mod + N_{other_mod} + N_canonical, also used as the score in the bedMethyl
N_diff - Number of reads with a base other than the canonical base for this modification. For example, in a row for h the canonical base is cytosine, if there are 2 reads with C->A substitutions, N_diff will be 2.
N_delete - Number of reads with a deletion at this reference position
N_fail - Number of calls where the probability of the call was below the threshold. The threshold can be set on the command line or computed from the data (usually failing the lowest 10th percentile of calls).
N_nocall - Number of reads aligned to this reference position, with the correct canonical base, but without a base modification call. This can happen, for example, if the model requires a CpG dinucleotide and the read has a CG->CH substitution such that no modification call was produced by the basecaller.

bedMethyl column descriptions

column	name	description	type
1	chrom	name of reference sequence from BAM header	str
2	start position	0-based start position	int
3	end position	0-based exclusive end position	int
4	modified base code	single letter code for modified base	str
5	score	Equal to N_{valid_cov}.	int
6	strand	'+' for positive strand '-' for negative strand, '.' when strands are combined	str
7	start position	included for compatibility	int
8	end position	included for compatibility	int
9	color	included for compatibility, always 255,0,0	str
10	N_{valid_cov}	See definitions above.	int
11	fraction modified	N_mod / N_{valid_cov}	float
12	N_mod	See definitions above.	int
13	N_canonical	See definitions above.	int
14	N_{other_mod}	See definitions above.	int
15	N_delete	See definitions above.	int
16	N_fail	See definitions above.	int
17	N_diff	See definitions above.	int
18	N_nocall	See definitions above.	int

Description of columns in `modkit summary`:

Totals table

The lines of the totals table are prefixed with a # character.

row	name	description	type
1	bases	comma-separated list of canonical bases with modification calls.	str
2	total_reads_used	total number of reads from which base modification calls were extracted	int
3+	count_reads_{base}	total number of reads that contained base modifications for {base}	int
4+	filter_threshold_{base}	filter threshold used for {base}	float

Modification calls table

The modification calls table follows immediately after the totals table.

column	name	description	type
1	base	canonical base with modification call	char
2	code	base modification code, or `-` for canonical	char
3	pass_count	total number of passing (confidence >= threshold) calls for the modification in column 2	int
4	pass_frac	fraction of passing (>= threshold) calls for the modification in column 2	float
5	all_count	total number of calls for the modification code in column 2	int
6	all_frac	fraction of all calls for the modification in column 2	float

Advanced usage examples

For complete usage instructions please see the command-line help of the program or the Advanced usage help documentation. Some more commonly required examples are provided below.

To combine multiple base modification calls into one, for example to combine basecalls for both 5hmC and 5mC into a count for "all cytosine modifications" (with code C) the --combine-mods option can be used:

modkit pileup path/to/reads.bam output/path/pileup.bed --combine-mods

In standard usage the --preset traditional option can be used as outlined in the Usage section. By more directly specifying individual options we can perform something similar without loss of information for 5hmC data stored in the input file:

modkit pileup path/to/reads.bam output/path/pileup.bed --cpg --ref path/to/reference.fasta \
    --combine-strands

To produce a bedGraph file for each modification in the BAM file the --bedgraph option can be given. Counts for the positive and negative strands will be put in separate files.

modkit pileup path/to/reads.bam output/directory/path --bedgraph <--prefix string>

The option --prefix [str] parameter allows specification of a prefix to the output file names.

Licence and Copyright

Modkit is distributed under the terms of the Oxford Nanopore Technologies, Ltd. Public License, v. 1.0. If a copy of the License was not distributed with this file, You can obtain one at http://nanoporetech.com

Name		Name	Last commit message	Last commit date
Latest commit History 688 Commits
.github/workflows		.github/workflows
book		book
docs		docs
src		src
tests		tests
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.rustfmt.toml		.rustfmt.toml
CHANGELOG.md		CHANGELOG.md
Cargo.toml		Cargo.toml
LICENCE.txt		LICENCE.txt
ONT_logo_590x106.png		ONT_logo_590x106.png
README.md		README.md
advanced_usage.template		advanced_usage.template
collapse.md		collapse.md
filtering.md		filtering.md
generate_advanced_usage.sh		generate_advanced_usage.sh
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modkit

Installation

Building from source

Usage

Constructing bedMethyl tables

Description of bedMethyl output

Definitions:

bedMethyl column descriptions

Description of columns in `modkit summary`:

Totals table

Modification calls table

Advanced usage examples

About

Releases 35

Packages

Contributors 3

Languages

License

nanoporetech/modkit

Folders and files

Latest commit

History

Repository files navigation

Modkit

Installation

Building from source

Usage

Constructing bedMethyl tables

Description of bedMethyl output

Definitions:

bedMethyl column descriptions

Description of columns in modkit summary:

Totals table

Modification calls table

Advanced usage examples

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 35

Packages 0

Contributors 3

Languages

Description of columns in `modkit summary`:

Packages