--sample-id example_sample
: Set a sample ID, or override the alignment file sample ID. This will be included in JSON output, but not TSV output.--min-reads ##
: Minimum number of supporting reads needed to make a call. Default: 4--min-allele-reads ##
: Minimum number of supporting reads needed to call a specific allele size. Default: 2--max-reads ##
: Maximum number of supporting reads to use for calling a locus. Default: 250--min-avg-phred ##
: Minimum average PHRED score for relevant bases (flanking region + tandem repeat). Read segments with average PHRED scores below this (common with a threshold of ~13 and ONT Ultra Long reads, for example) will be skipped. Default: 13--min-read-align-score #.#
: Minimum normalized read alignment score (fractional;0.0
to1.0
) needed to include a read in a call. A good value for pure tandem repeats is 0.9. A good value for much more lenient genotyping is anywhere from 0.0-0.4. Default: 0.9--max-rcn-iters ##
: Maximum number of read copy-number counting iterations to perform. Loci which require a lot of iterations are probably impure tandem repeats, for which the resulting copy number will not be very accurate anyway. Default: 50--flank-size ##
: Size of the flanking region to use on either side of a region to properly anchor reads. Default: 70--realign
or-a
: Whether to perform local re-alignment to attempt recovery of soft-clipped reads. Some aligners may soft-clip around large insertions, e.g. with an expansion (I've noticed this with pbmm2/minimap2). Currently recommended for HiFi or ONT R10 only, since this step aggressively filters out realignments with many mismatches or small indels. Enabling this slows down calling, so it may not be suitable for a very large catalog of tandem repeats.--hq
: Whether to treat provided reads as "high quality", i.e., fairly close to the actual true sequence. Used when detecting expansions, to skip a smoothing filter that may ignore disparate, rare expansion-like read counts. Use for CCS reads or similar data (e.g., R10 nanopore data) ONLY! Default: off--use-hp
: Whether to incorporateHP
tags from a haplotagged alignment file. This should speed up runtime and will potentially improve calling results. This flag is experimental, and has not been tested extensively.--skip-supplementary
or--skip-supp
: Skip supplementary alignments. Default: off--skip-secondary
or--skip-sec
: Skip secondary alignments. Default: off--incorporate-snvs [path]
or--snv [path]
: A path to a VCF with SNVs to incorporate into the calling process and final output. This file is just used as an SNV loci catalog; STRkit itself will perform the SNV calling. Empirically improves calling quality a small amount, speeds up runtime, and gives nearby SNV calls for downstream analysis. You can find dbSNP VCFs athttps://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/
. The file for GRCh38 is called00-common_all.vcf.gz
as of time of writing.--snv-min-base-qual [int]
or--min-sbq [int]
: Minimum PHRED quality score for bases of SNVs to use for phasing. Default: 20--targeted
or-t
: Turn on targeted genotyping mode, which re-weights longer reads differently. Use this option if the alignment file contains targeted reads that do not reflect normal mapping patterns, e.g. from PacBio No-Amp Targeted Sequencing. Default: off--respect-ref
or-e
: Turn off reference TR region 'coordinate extension' from what is specified in the catalog. TR boundaries can be blurry, so by default we give STRkit an opportunity to extend the provided region to improve mapped indel capturing and to be consistent with the approach we use to count repeat copies in non-reference samples. Turning this off should give results closer to other STR callers, at the cost of potentially missing variation.--count-kmers
or-k
: Turn on motif-sized k-mer counting at the allele level, with-k peak
, or at the read level, with-k read
, or both with-k both
. If the flag is provided with no value, it will default topeak.
Note that k-mer counts will only be reported if a--json
path is specified. This feature can be used to detect motif composition differences between alleles or samples. Default:none
--consensus
or-c
: Turn on consensus calculation for alleles. This adds runtime, but gives a better idea of STR structure and is useful for comparing alleles beyond copy number. If--vcf
is set, this option is forced on. Default: off--vcf-anchor-size
: Number of bases upstream (5') of the tandem repeat to include in the VCF output. This can include small indels, and having a size above1
may be beneficial or detrimental to the use case at hand, but is nice for benchmarking and in case of slight misalignment. This is clamped to being in the range of[1, flank_size]
. Default: 5--num-bootstrap ###
or-b
: Now many bootstrap re-samplings to perform. Default: 100--sex-chr ??
or-x
: Sex chromosome configuration. Without this, loci in sex chromosomes will not be genotyped. Can be any configuration of Xs and Ys; only count matters. Default: none--json [path]
or-j
: Path to output JSON call data to. JSON call data is more detailed than thestdout
TSV output. If the value passed isstdout
, the JSON data will be written tostdout
instead of a file. Default: none--indent-json
or-i
: If passed alongside--json [x]
, the JSON output will be indented to be more human-readable but less compact. Default: off--vcf [path]
: Path to output VCF-formatted call data to. Setting this option forces the--consensus
option as well in order to output true REF/ALT values, which slows down runtime somewhat. If the value passed isstdout
, the VCF data will be written tostdout
instead of a file. Default: none--no-tsv
: Suppresses TSV output tostdout
. Without--json
, no output will be generated, which isn't very helpful. Default: TSV output on--seed
: Seed the random number generator used for all random sampling, Gaussian mixture modeling, etc. Useful for replicability.--log-level [level]
: Log level. Value must be oferror
,warning
,info
, anddebug
. Be careful with thedebug
log level, as it can produce gigabytes of logs for a large run. Default:info
.
We have tested STRkit on three different clusters associated with the Digital Research Alliance of Canada (formerly Compute Canada).
Usage is pretty straightforward; for our use cases we set up a Python virtual environment
with the strkit
package installed, and ran a SLURM batch job which looks something like:
#!/bin/bash
#SBATCH --mem=16G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --time=1-00
#SBATCH --account=rrg-xxxxx
module load python/3.9
cd /home/xxxxx || exit
source env/bin/activate
strkit call \
--loci /path/to/catalog \
--ref /path/to/ref.fa.gz \
--processes 10 \
--seed 342 \
path/to/sample.bam
deactivate