Skip to content

Releases: pachterlab/kallisto

lr-kallisto patches and customizable pseudoalignment features

17 Sep 05:45
0397342
Compare
Choose a tag to compare

lr-kallisto:

  • Threshold for rate of unmapped kmers per read can be set via --thresholds (this option wasn't functional previously)

New pseudoalignment features (only for exploratory/research use), including:

  • Can use --no-jump to skip the pseudoalignment jumping logic
  • Can use --union to do set union rather than set intersection over equivalence classes
  • Settings --priors when running quant-tcc now works

lr-kallisto

20 Jul 19:16
fa01edd
Compare
Choose a tag to compare

lr-kallisto

Main update:

  • Added support for long-read (support for --long option in kallisto bus and kallisto quant-tcc)

Other:

  • Allow compilation with larger k-mer size (set MAX_KMER_SIZE=64 when doing cmake)
  • Can disable optimizations (set COMPILATION_ARCH=OFF and ENABLE_AVX2=OFF when doing cmake)

Improving index features and memory

01 Nov 12:47
34e5814
Compare
Choose a tag to compare

Kallisto index version is now index 13 (kallisto v0.50.0 had index version 12)

New features (kallisto index):

  • Can input priors for the EM algorithm
  • D-list has an overhang option
  • D-list is now stored in a hash table rather than part of the graph
  • Fix some compilation issues in Bifrost
  • Can specify custom k-mers to be D-listed by having an empty fasta header
  • Can specify custom k-mers to be indexed by using --distinguish and assigning each input fasta entry a numerical ID (zero-indexed) in the fasta header

New features (kallisto bus technologies):

  • Kallisto technology string can now have format -x bc:umi:cdna%strand%parity
  • Split-seq defaults to --fr-stranded
  • STORM-seq and VASA-seq now supported as technology options

New kallisto index

27 Jun 10:54
a38143d
Compare
Choose a tag to compare

kallisto index

The improved kallisto index reduces memory consumption for large FASTA files and features a d-list option to improve k-mer mapping specificity. Additionally, new input and output features have been added as well as support for sample barcodes (which can be recorded in addition to cell barcodes).

New features

  • kallisto quant-tcc: This new command can run the EM algorithm on a supplied transcripts-compatibility counts (TCC) matrix file, such as that generated by "bustools count", to generate transcript-level estimates. When a gene-mapping file is supplied, gene-level abundances will also be outputted. Effective length normalization will only be performed if a kallisto index is supplied and if fragment length information is provided.
  • New technologies were added to "kallisto bus": -x SmartSeq3 (--tag can be used to supply a 5′ tag sequence that identifies UMI-containing reads), -x BDWTA (BD Rhapsody), -x Visium (10x Visium), -x SPLIT-SEQ (SPLiT-seq preprocessing), and -x Bulk (for preprocessing non-demultiplexed Bulk RNA-seq files)
  • "kallisto bus" can be run with -x BULK specified: In this case, it will either process a batch file (supplied via --batch) like in the old "kallisto pseudo" or will process fastQ files supplied directly on the command line, treating each fastQ file or each pair of fastQ file (if --paired is specified) as an individual sample. This is useful for generating BUS files when each sample is in a separate fastQ file. With bustools and kallisto quant-tcc, this feature effectively entirely deprecates the old "kallisto pseudo".
  • Strand-specificity is now enabled by default for 10X, SureCell, CelSeq, BD Rhapsody, and Smart-seq3 UMI technologies (unstranded is default for other technologies) and the user can override this by supplying --fr-stranded, --rf-stranded, and --unstranded options.
  • Various performance improvements (mostly in regards to data ingestion throughput)
  • A minimal form of the kallisto index is outputted in a file named index.saved and a file containing fragment length distributions (flens.txt) is outputted when "kallisto bus" is run on paired-end reads (which can be specified via the option --paired). This is so kallisto quant-tcc can perform effective length normalization should the need arise.

New index

  • A new index is used that is incompatible with the old index, and users should upgrade to this new index for kallisto v0.50.0
  • With the new index, users can set the minimizer length (--min-size) which can tune indexing runtime+memory performance
  • --max-ec-size has been added so that users can cap the size of equivalence classes (i.e. the number of transcripts compatible with a given k-mer); k-mers that exceed this size aren't considered in the pseudoalignment. This can reduce memory usage and increase runtime performance (with some loss of information if --max-ec-size is too small).
  • --threads option now enabled for kallisto index to allow indices to be created in a multithreaded fashion (to improve runtime)
  • --d-list can be used to supply a FASTA file where distinguishing flanking k-mers will be extracted from (to act as a general k-mer filter for improving mapping specificity)
  • --distinguish option is added (where no polyA trimming, etc. occur) and each target is indexed as-is with the targets distinguished from one another by the target name (e.g. two targets can have the same name and be indexed together as a single target)
  • kallisto inspect can output more information: minimizer length, number of unitigs, max EC size, number of ECs discarded (i.e. over the --max-ec-size threshold), and number of D-listed elements (DFKs)

New input features

  • --inleaved option added to kallisto bus to support reading in interleaved FASTQ input
  • Streaming FASTQ reads directly into kallisto bus is enabled by supplying - in lieu of FASTQ files
  • --x technology string Bustools technology string can read RX:Z: UMIs in FASTQ header comments by supplying something like 0,0,8:RX:1,0,0 (i.e. RX can be supplied into the UMI portion of the technolog string)
  • --numReads can be set to terminate after a certain number of reads have been processed

New sample barcode feature

  • --batch-barcodes in kallisto bus will record encode batch ID as a unique nucleotide sequence in the hidden metadata of the barcode column of the BUS file (i.e. serving as a sample barcode).
  • --batch in kallisto bus now allows a technology string to be supplied (if --batch-barcodes is not supplied, only the barcodes extracted from the technology string are stored in the BUS file [i.e. sample barcodes aren't recorded]; if -1 is supplied in the barcode part of the technology string, only the batch-specific barcodes [i.e. sample barcodes] are stored directly in the BUS file, not in the hidden metadata unless --batch-barcodes is supplied)

New output features

  • kallisto quant-tcc command can output exactly what “kallisto quant” does (including w/ bootstraps for sleuth) for each barcode into separate abundance.tsv files (if --matrix-to-files is specified) or into separate directories, each containing an abundance.tsv file (if ---matrix-to-directories is specified). Also, h5ad will be produced if compiled with that options (unless --plaintext is supplied to quant-tcc).

Other new features

  • Progress is outputted every 1M reads
  • --aa option enabled in kallisto bus and kallisto index for amino acid mapping to nucleotide (functionalities to be described in a paper)

New compilation options

  • HTSLIB is no longer enabled by default; need to use cmake .. -DUSE_BAM=ON
  • Zlib is still compatible and used by default but the better zlib-ng is included and can be used if the given cmake option is supplied.
  • Compilation flags to enable all features are as follows: cmake .. -DZLIBNG=ON -DUSE_BAM=ON -DBUILD_FUNCTESTING=ON -DUSE_HDF5=ON

End of support for existing bulk RNAseq features

  • --bias, --fusion, --genomebam, and --pseudobam in kallisto quant and kallisto bus are no longer supported -- users should use v0.48.0 for use of these features.
  • --gfa,--gtf, and --bed options in kallisto inspect are no longer support -- users should use v0.48.0 for use of these features.

Increase in generalizability of "kallisto bus"

17 Jan 05:02
83bde90
Compare
Choose a tag to compare

New features

  • kallisto quant-tcc: This new command can run the EM algorithm on a supplied transcripts-compatibility counts (TCC) matrix file, such as that generated by "bustools count", to generate transcript-level estimates. When a gene-mapping file is supplied, gene-level abundances will also be outputted. Effective length normalization will only be performed if a kallisto index is supplied and if fragment length information is provided.
  • New technologies were added to "kallisto bus": -x SmartSeq3 (--tag can be used to supply a 5′ tag sequence that identifies UMI-containing reads), -x BDWTA (BD Rhapsody), -x Visium (10x Visium), -x SPLIT-SEQ (SPLiT-seq preprocessing), and -x Bulk (for preprocessing non-demultiplexed Bulk RNA-seq files)
  • "kallisto bus" can be run with no technology specified: In this case, it will either process a batch file (supplied via --batch) like in the old "kallisto pseudo" or will process fastQ files supplied directly on the command line, treating each fastQ file or each pair of fastQ file (if --paired is specified) as an individual sample. This is useful for generating BUS files when each sample is in a separate fastQ file. With bustools and kallisto quant-tcc, this feature effectively entirely deprecates the old "kallisto pseudo".
  • Strand-specificity is now enabled by default for 10X, SureCell, CelSeq, BD Rhapsody, and Smart-seq3 UMI technologies (unstranded is default for other technologies) and the user can override this by supplying --fr-stranded, --rf-stranded, and --unstranded options.
  • Various performance improvements (mostly in regards to data ingestion throughput)
  • A minimal form of the kallisto index is outputted in a file named index.saved and a file containing fragment length distributions (flens.txt) is outputted when "kallisto bus" is run on paired-end reads (which can be specified via the option --paired). This is so kallisto quant-tcc can perform effective length normalization should the need arise.

Deprecation

  • "kallisto pseudo" is now deprecated and will be removed in a future release; users should supply batch files of fastQ file names to "kallisto bus" instead

Fixes

  • Issue #319 : header import
  • Issue #272 : "kallisto quant" and "kallisto pseudo" inconsistency (now fixed)

Phasing out HDF5

12 Feb 23:44
Compare
Choose a tag to compare

Phasing out HDF5

For this release HDF5 is not a required dependency for running kallisto bus for single cell RNA-seq analysis. It is still required for compatibility with sleuth and other downstream tools. By default kallisto will not be built with HDF5 support, this can be enabled by running

cmake  .. -DUSE_HDF5=ON

The binaries for this release are compiled with HDF5 built in, but we will switch from using HDF5 in future versions (coordinated with sleuth).

When running kallisto quant without HDF5 support

  • quant without bootstrapping will create the same files as before, except for abundance.h5
  • quant with bootstrapping, -b, will not perform bootstrapping but displays the following warning
    Warning: kallisto was not compiled with HDF5 support so no bootstrapping will be performed. Run quant with --plaintext option or recompile with HDF5 support to obtain bootstrap estimates.
  • quant with -b k and --plaintext will create the bootstrap values in files bs_abundance_i.tsv for i=0..k-1

For users relying on HDF5 support we recommend compiling kallilsto with HDF5 or downloading the kallisto binaries.

Over the next releases HDF5 will gradually be phased out and information on bootstraps will be replaced with a new format.

Changes

  • kallisto pseudo outputs a file of transcript ids
  • Fixes #240
  • kallisto bus allows having sequence split across more than one file, closes #226

inDrops v3 and BUS parsing of BAM files

04 Nov 17:18
Compare
Choose a tag to compare

This release adds options for parsing the inDrops technology (versions 2 and 3 are new) as well as specifying input from BAM files rather than raw FASTQ files.

New BUS technology options

12 Jun 00:33
Compare
Choose a tag to compare

This version adds the option of specifying an arbitrary single cell technology for the bus command in kallisto.

10xv3 and bug fixes

23 Feb 21:42
Compare
Choose a tag to compare

This release adds 10xv3 as a technology option for the bus command.

Bug fixes

  • #201 Pseudobam was not being run unless bootstrap was also performed
  • #199 Error when reading UMI files for the pseudo mode.
  • -l flag for bus was inactive.

BUS

16 Nov 14:42
Compare
Choose a tag to compare
BUS

Changes from v0.44.0

BUS

kallisto can now process raw FASTQ files for single cell RNA-Seq and create an output in BUS format which can be further processed using bustools

To process single cell data run kallisto with the bus command. To see a list of supported technologies, run with the --list option

> kallisto bus --list 
List of supported single cell technologies

short name       description
----------       -----------
10Xv1            10X chemistry version 1
10Xv2            10X chemistry verison 2
DropSeq          DropSeq
inDrop           inDrop
CELSeq           CEL-Seq
CELSeq2          CEL-Seq version 2
SCRBSeq          SCRB-Seq