- Improved automatic estimation of sequencing error for estimating unknown abundances/coverages.
Explanation:
The -u option estimates the % of sequences that are "unknown" i.e. not captured by the database and the "true" coverage. This requires knowledge of sequencing error. Previous versions failed when the sample was too diverse compared to sequencing depth (e.g. low-throughput sequencing or complex (ocean/soil) metagenomes).
New fallback added: For short-reads only, if the diversity is too high relative to sequencing depth, (avg k-mer depth < 3) then 99.5% is used as a fallback sequence identity estimate.
- A new column called
kmers_reassigned
is now in the profile output. This states how many k-mers are lost due to reassignment for that particular genome. -1, -2
options are now available forsylph profile
. You can now dosylph profile database.syldb -1 1.fq -2 2.fq ...
- Scalable cuckoo filters are now used for read deduplication for memory savings.
- Deduplication algorithm improved. v0.5.0 worked poorly on highly (>15%) duplicated read sets.
- Shorter reads can be sketched now. Down to 32bp instead of 63 bp before.
In previous versions, sylph was underperforming on real illumina data sets. See #5
This is because many real illumina datasets have a non-trivial number of duplicate reads. Duplicate reads mess up sylph's statistical model.
For the single and paired sketching options, a new deduplication routine has been added. This will be described in version 2 of our preprint.
This increases sketching memory by 3-4x but greatly increases performance on real datasets with > 1-2% of duplication, especially for low-abundance genomes.
For paired-end illumina reads with non-trivial (> 1% duplication), sylph can now
- detect up to many more species low-abundance species below 0.3x coverage
- give better coverage/abundance estimates for low-abundance species
- sequence sketches (sylsp) have changed formats. Sequences will need to be re-sketched.
--read-length
option removed and incorporated into the sketches by default. (suggested by @fplaza)
- New warning when
-o
specified and only reads are sketched (#7) - You can now rename sylph samples by specifing a sample naming file with
--sample-names
or--lS
(suggested by @jolespin) - Newline delimited files are available in
profile
andquery
now (suggested by @jolespin)
- small changes for help text, options, and output texts.
- renamed
sylph contain
tosylph query
. - methods for sketching are drastically different now. E.g. we use
-g genome1.fa genom2.fa
for specifying genomes and-r read1.fa read2.fq
for specifying reads when sketching.
-u
or--estimate-unknown
options are now present for estimating unknown organisms in the sample.- When using
-u
, associated options--read-seq-id
and--read-len
are available for calculating true coverages with sylph, i.e., coverages concordant with read mapping
- Coverage calculation is slightly different now.
Continuing development of sylph taxonomic profiling.
--pseudotax
option in previous version is now a new command calledprofile
.- Databases are enabled for profiling by default.
- Changed file suffices to
syldb
andsylsp
.
- Default parameter changes. --min-spacing is set to 30 now.
- Made profiling faster with some algorithmic tweaks.
- Coverage calculated slightly differently
- Many small software changes with respect to threading and outputs
- Sylph's *.sylqueries are no longer compatible with older versions of sylph (< v0.2). Files will need to be resketched.
- Fixed a major bug for the
--pseudotax
option that required redesigning file formats. Please use--enable-pseudotax
when using usingcontain --pseudotax
from now on. --pseudotax
option gives relative abundances now. We are gaining some confidence that this approach gives a rough, but surprisingly decent taxonomic classification.- Changed how
Eff_cov
is calculated. We just use the median coverage now, except when we apply coverage-adjustment
- Fixed command line ambiguity for sketching outputs.
-s
has been replaced with-d
forsylph sketch
. - Sylph outputs the results after processing every sample, instead of batching results, now
- Added
--pseudotax
option, similar to the-w
option in mash screen, where k-mers are assigned to the highest ANI genome so redundancy is removed. The output is a very rough taxonomic classification of the sample.
- Some fixes and parameter changes from the v0.0.x releases.