MIST is a metagenomic intra-species typing technique that was developed primarily for clinical specimens with low pathogen loads. Hopefully, it will aid in strain-level diagnoses of bacterial infections as well as public health epidemiology and surveillance. Its algorithm contains the following three characteristics.
- Based on average nucleotide identity (ANI), reference genomes are clustered into hierarchical levels to resolve the ambiguous definition of “strain”.
- Maximum likelihood estimation is conducted upon the reads’ mismatch values to infer the compositional abundance.
- Read ambiguity is used to infer the abundance uncertainty, and the similarity to reference genomes is used to predict the presence of novel strains.
MIST contains four modules:
Index
,Cluster
,Species
,Strain
,Bootstrap
. Its workflow is depicted in the figure below. MIST depends on the counts of reads that are matched to the pan-genomes of each pathogen species for species-level typing (panel A). Next, MIST prepares a hierarchical database of reference genomes based on ANI grouping for strain-level typing (panel B). By matching reads to all of the species' reference genomes, the scores for each alignment are transformed to posterior probabilities that indicate the likelihood of the sequence read of being allocated to each cluster. The probability matrix is then employed using the maximum likelihood estimation (MLE) to infer the abundance of each cluster.
Note: Red box, the module Species
; yellow box, the modules Index
and Cluster
; purple box, the module Strain
.
- Linux system
- Python = 3.6
- GCC >= 4.8
- Bowtie 2
- FastANI
- Python modules:
networkx
,pandas
,matplotlib
,numpy
,scipy
,scikit-learn
,joblib
,click
.
- Pre-built pangenome: This folder is used in the
species
module, which include Bowtie-indexed pan-genomes of14
bacterial species (listed below). - Pre-built Bowtie indexed reference genomes: For each species, their complete genomes deposited from NCBI Genbank database are downloaded and are Bowtie-indexed with the module
index
. These pre-built Bowtie index files are used in combination with the pre-built clustering files in the modulestrain
.
- Pre-built clustering files: This folder contains the clustering files (obtained from the module
cluster
) for the14
pathogens and is used in the modulestrain
. Same as above, these clustering files can only be used in combination with the Bowtie-indexed reference genomes.
Note: In additional to the pre-built database above, you can customize your own database by the following steps:
- Downloading reference genomes of a certain species in FASTA format.
- Cluster these reference genomes by the
cluster
module. - Bowtie-index these reference genomes by the
index
module.
$ conda create -n MIST -c conda-forge -c bioconda python=3.6 fastANI bowtie2
$ conda activate MIST
$ git clone https://github.com/pandafengye/MIST.git
$ cd MIST
$ pip install -r requirements.txt --default-timeout=1000 # Install related python dependencies
$ python MIST.py # Test install
MIST contains four modules: index
, cluster
, species
, strain
and bootstrap
.
This module functions to index the reference genomes with Bowtie2 indexer (Bowtie2-build). Once the reference genomes are indexed, users will not need to re-index the genome before each analysis of metagenomics datasets.
$ python MIST.py index --refdir Example_Dir/input/ref_dir/ --output Example_Dir/output/
-i, --refdir PATH
Path to the reference genome folder; All reference genomes should be in FASTA format and put in the same folder; each file represents one reference genome, with a *.fa prefix.
-o, --output PATH
Output folder saving the index files for reference genomes. The base name of the index is the same as the reference genome.
This module functions to assign reference genomes into clusters at user-defined levels. MIST calls FastANI program to calculate ANI for estimation of pairwise genetic distance, based on which the reference genomes are divided into clusters. Same as the index
module, once the clusters are established, users are not required to run this module before each independent job.
$ python MIST.py cluster --threads 8 --refdir Example_Dir/input/ref_dir/ --cutoff 0.98,0.99,0.999 --output Example_Dir/output/
-t, --threads INTEGER Number of threads for ANI
-i, --refdir PATH
input folder of reference genomes
-o, --output PATH
matrix file of the clustered reference genomes
-s, --cutoff TEXT
list of similarity thresholds (between 0 and 1); separated by comma (e.g. 0.98,0.99,0.999)
This module functions to perform species-level typing. MIST calls Bowtie2 to map the user’s mNGS reads (in .fastq format) against the pan-genomes of each bacterial species and estimate the abundance by counting the reads mapped to each species. The species-specific reads are extracted from the resulting SAM file for the downstream strain-level typing.
$ python MIST.py species --threads 8 --pair_1 Example_Dir/input/read/example_data1.1.fq --pair_2 Example_Dir/input/read/example_data1.2.fq --database Pre-built-pangenome/ --output Example_Dir/output/
-p, --threads INTEGER
Number of threads for Bowtie2 (default: 8)
-1, --pair_1 PATH
input fq file with #1 mate, paired with pair_2 file
-2, --pair_2 PATH
input fq file with #2 mate, paired with pair_1 file
-d, --database PATH
input bowtie2 index file for the pan-genome sequences
-o, --output PATH
output folder which contains: 1) read counts for each pathogen species (_MIST_species_count.txt); 2) reads specific to each pathogen species (_MIST.*.fq).
The pre-built pan-genome index file is available at here. For the reads specific to each pathogen species (_MIST.*.fq), 0.1x sequencing coverage of bacterial genome (e.g. 5000 100-bp reads for a 5-Mb bacterial genome) is usually sufficient for MIST to do strain-level typing. Too many reads (e.g., > 50000 reads) for the subsequent mapping and maximum likelihood estimation would otherwise cause long running time. Users can extract a subset (5000) of reads with the command such as "head –n 20000 _MIST.*.fq > input.fq".
This module functions to map metagenomic sequences against reference genomes using Bowtie2, and to measure the relative abundance of each cluster in the metagenomics dataset.
$ python MIST.py strain --threads 8 --indexpath Example_Dir/output/_MIST_index/ --cluster_output Example_Dir/output/_MIST_ref_cluster.csv --pair_1 Example_Dir/input/read/example_data1.1.fq --pair_2 Example_Dir/input/read/example_data1.2.fq --read_length 200 --genome_size 5000000 --output Example_Dir/output/
-p, --threads INTEGER
Number of threads for Bowtie2 (default: 8)
-c, --cluster_output PATH
input file; the matrix of the clustered reference genomes
-i, --indexpath PATH
input folder of index files for reference genomes; produced by MIST-index module.
-U, --single_end PATH
input single-end fq file;can be the output produced by MIST-species module.
-1, --pair_1 PATH
input fq file with #1 mate, paired with pair_2 file
-2, --pair_2 PATH
input fq file with #2 mate, paired with pair_1 file; either choose the paired input or the single-end input.
-l, --read_length FLOAT
read length
-g, --genome_size INTEGER
genome size (optional)
-o, --output PATH
output folder for mismatch matrix file and alignment output files. A folder _MIST_map_alignment, which contains the mapped .sam files corresponding to each reference genome; a file _MIST_map_Mismatch_matrix.csv, which contains the number of mismatches derived from each read mapping against each reference genome.
This module functions to perform bootstrapping for abundance estimation
$ python MIST.py bootstrap --cluster_output Example_Dir/output/_MIST_ref_cluster.csv --mismatch Example_Dir/output/_MIST_strain/_MIST_map_Mismatch_matrix.csv --bootstrap_numbers 100 --read_length 100 --output_dir Example_Dir/output/
-c, --cluster_output PATH input file; the matrix of the clustered reference genomes (result of the cluster module)
-m, --mismatch PATH input file; _MIST_map_Mismatch_matrix.csv (result of the strain module)
-n, --bootstrap_numbers INTEGER number of bootstrap replications [default: 100]
-l, --read_length INTEGER read length [default: 100]
-o, --output_dir PATH output file of abundance estimation
All output files and directories are described in the table below.
File/Directory | Description | Module |
---|---|---|
_MIST_index/ |
Directory containing the index files for reference genomes. | Index |
_MIST_species/ |
Directory containing contains read counts for each pathogen species and reads specific to each pathogen species. | Species |
_MIST_strain/ |
Directory containing contains output for strain-level typing. | Strain |
_MIST_ref_cluster.csv |
The matrix file of the clustered reference genomes. | Cluster |
_MIST_map_Mismatch_matrix.csv |
The number of mismatches derived from each read mapping against each reference genome. | Strain |
_MIST_0.98_measure.csv |
Estimated abundance for each cluster at the 98% ANI level. | Strain |
_MIST_0.99_measure.csv |
Estimated abundance for each cluster at the 99% ANI level. | Strain |
_MIST_0.999_measure.csv |
Estimated abundance for each cluster at the 99.9% ANI level. | Strain |
In this example, we will demonstrate how to use MIST to identify bacterial strains in the mNGS dataset in a standard manner.
- Step 1: Species-level typing. Assume you have a pair of mNGS reads, which are derived from paired-end sequncing. Here we use example_data1.1.fq and example_data1.2.fq, which are located in the
Example_Dir/input/read/
. We also download the Bowtie-indexed pangenomes of the14
common bacterial species, and map the FASTQ reads against the genomes using the modulespecies
.
$ python MIST.py species --threads 8 --pair_1 Example_Dir/input/read/example_data1.1.fq --pair_2 Example_Dir/input/read/example_data1.2.fq --database Pre-built-pangenome/ --output Example_Dir/output/
In the output folder Example_Dir/output/_MIST_species/
, the result file species_count.txt
reveals that in the mNGS reads there are a total of 18628
E. coli reads as well as a few from other species. You may assume that E. coli is the causative pathogen and choose E. coli for the subsequent strain-level typing accordingly. The result file _MIST.Escherichia_coli.fq
stores the E. coli reads retrieved from the mNGS dataset.
No. | Species | Read count |
---|---|---|
0 |
Escherichia_coli | 18628 |
1 |
Salmonella_enterica | 224 |
2 |
Klebsiella_pneumoniae | 89 |
3 |
Acinetobacter_baumannii | 4 |
- Step 2: Strain-level typing. We need to first download the pre-built Bowtie-indexed E. coli reference genomes and the pre-built E. coli clustering files (place them in the directory
Example_Dir/input/
), and run the moduleStrain
.
python MIST.py strain --threads 8 --indexpath Example_Dir/input/Escherichia_coli_MIST_index/ --single_end Example_Dir/output/_MIST_species/_MIST.Escherichia_coli.fq --read_length 100 --cluster_output Example_Dir/input/Escherichia_coli.MIST_ref_cluster.csv --genome_size 5000000 --output Example_Dir/output/
Note: The input file _MIST.Escherichia_coli.fq
is derived from Step 1
.
The input clustering file Escherichia_coli.MIST_ref_cluster.csv
looks like this, in which each reference genome (each row) will be assigned to a certain cluster at a certain ANI threshold (e.g., 0.98
, 0.99
and 0.999
).
Strain | 0.98 | 0.99 | 0.999 |
---|---|---|---|
AE005174.fa |
0 | 0 | 0 |
AE005674.fa |
0 | 1 | 1 |
AE014073.fa |
0 | 1 | 1 |
AE014075.fa |
1 | 9 | 2 |
AM946981.fa |
0 | 2 | 125 |
In the output folder Example_Dir/output/_MIST_strain/
, the most important result files _MIST_0.98_measure.csv
, _MIST_0.99_measure.csv
, and _MIST_0.999_measure.csv
correspond to the ANI levels listed in the clustering file. In the _MIST_0.999_measure.csv
, for example, we can see that there are two clusters in the query reads, with cluster 357
accounting for 51.07%
and cluster 4
accounting for 47.57%
. The column unique best reads
and shared best reads
mean the number of reads with maximum mapping score unique to the cluster and the number of reads with maximum mapping score shared with other clusters, respectively. The average similarity of these reads against the reference genome of the clusters is listed in the column similarity
. The depth is calculated based on the number of the best reads (unique plus shared) and the input genome size.
Number | Cluster | Abundance | Unique_best_reads | Shared_best_reads | Similarity | Depth |
---|---|---|---|---|---|---|
0 |
357 | 0.5107 | 0 | 11412 | 0.9968 | 0.2282 |
1 |
4 | 0.4757 | 0 | 10869 | 0.9969 | 0.2174 |
When the organism in which you are interested is not in the list of pre-built database, you need to customize your own database. For example, after you run Step 1 of Example 1, you have speculated E. coli is the probable pathogen, and assume the pre-built database of E. coli is not provided, please do as follows.
- Step 1: Suppose you retrieve five E. coli genomes (in FASTA format) from NCBI or other database and save them under the directory
Example_Dir/input/ref_dir/
. Firstly, build the Bowtie2-index files for the genomes by using the moduleindex
.
$ python MIST.py index --refdir Example_Dir/input/ref_dir/ --output Example_Dir/output/
- Step 2: Then you will see in the
Example_Dir/output
directory there are five subdirectories, which correspond to the five E. coli genomes. Secondly, assign the reference genomes into clusters at certain ANI levels by running the modulecluster
. If you have no idea of how to set the ANI thresholds, just try 0.98, 0.99, 0.999, 0.9999.
$ python MIST.py cluster --threads 8 --refdir Example_Dir/input/ref_dir/ --cutoff 0.98,0.99,0.999 --output Example_Dir/output/
In the output clustering file Example_Dir/output/_MIST_ref_cluster.csv
, you will see that, at the 98%
ANI level, the genome CP002729.fa
, CU928160.fa
, CP017979.fa
and AP012030.fa
belong to the same cluster; but at the 99%
level, CU928160.fa
splits and forms its independent cluster.
Genome | 0.98 | 0.99 | 0.999 |
---|---|---|---|
CP002729.fa |
0 | 1 | 2 |
CU928160.fa |
0 | 0 | 0 |
CP017979.fa |
0 | 1 | 1 |
AP012030.fa |
0 | 1 | 1 |
CU928163.fa |
1 | 2 | 3 |
- Step 3: Run the module
Strain
using the preparedindex
output directory and the_MIST_ref_cluster.csv
file.
$ python MIST.py strain --threads 8 --indexpath Example_Dir/output/_MIST_index/ --cluster_output Example_Dir/output/_MIST_ref_cluster.csv --pair_1 Example_Dir/input/read/example_data1.1.fq --pair_2 Example_Dir/input/read/example_data1.2.fq --read_length 100 --genome_size 5000000 --output Example_Dir/output/
Occasionally the strain in the clinical samples is not represented by the database, which we call a novel strain. Under such circumstance, MIST may still assign it to a certain cluster in the database, but you can predict if there is a novel strain based on the similarity provided by the output. In this example, we use the query reads derived from Shigella dysenteriae. This is a pathovar of E. coli and is distantly related to the five genomes in Example 2. We perform the strain-level typing using the database prepared in Example 2.
$ python MIST.py strain --threads 8 --indexpath Example_Dir/output/_MIST_index/ --pair_1 Example_Dir/input/read/example_data2.1.fq --pair_2 Example_Dir/input/read/example_data2.2.fq --read_length 100 --cluster_output Example_Dir/output/_MIST_ref_cluster.csv --output Example_Dir/output/
Where example_data2.1.fq
and example_data2.2.fq
are the paired Shigella reads. In the _MIST_0.98_measure.csv
, it shows the reads are assigned to cluster 0
and 1
, with a similarity of 0.9786
and 0.9762
. Because this estimate is performed at the 98%
ANI level, the query reads and the assigned cluster should have a similarity above 98%
. Hence, a similarity below 0.98
indicates that the query reads actually do not belong to the clusters but represent a novel one. Notably, this speculation is based on the hypothesis that the sequencing errors are so few that can be ignored, since a low similarity can also result from high rate of sequencing errors.
Number | Cluster | Abundance | Unique_best_reads | Shared_best_reads | Similarity |
---|---|---|---|---|---|
0 |
0 | 0.8401 | 0 | 757 | 0.9786 |
1 |
1 | 0.1599 | 2 | 298 | 0.9762 |
-
How long does it take to process a mNGS sample?
MIST is thorough and accurate, but not particularly fast. The MIST-strain step of the pipeline can take a while to complete. Two main factors influence the running time: the number of reads (more reads take longer to align) and the size of the pre-built database (i.e., the number of reference genomes). By default the modulestrain
uses8
threads only, but using more threads (with the --threads option) may speed up the calculation. -
Does MIST support the data from long sequencing platform, i.e., Oxford Nanopore?
To our knowledge, genetic differences across strains are mostly represented in strain-specific indel regions and SNPs. Long-read sequencing may well address the former, but calls SNPs poorly with a limited sequencing depth, in particular for mNGS dataset which is usually characterized by an extremely low bacterial load. We are quite confident of the future of long-read sequencing. MIST can be quite easily adjusted from short-read sequencing to long-read sequencing as its algorithm is relatively independent from the sequencing technology.