diff --git a/CHANGELOG.md b/CHANGELOG.md index 8c0baa8..5cf94e4 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,6 +1,6 @@ # Changelog -### v0.9.3 - 2023-05-16 +### v0.9.3 - 2023-07-16 - `kmcp compute/split-genomes`: - fix a bug in chunk computation when splitting circular genomes (`--circular`). @@ -18,6 +18,10 @@ 20:00:55.295 [INFO] 99.3084% (923820/930254) reads matched 20:00:55.295 [INFO] 100.0000% (923820/923820) matched reads belong to the 2 references in the profile +- new tutorials: + - [Detecting specific pathogens](https://bioinf.shenwei.me/kmcp/tutorial/detecting-pathogens) + - [Detecting contaminated sequences](https://bioinf.shenwei.me/kmcp/tutorial/detecting-contaminated-seqs) + ### v0.9.2 - 2023-05-16 - `kmcp profile/cos2simi/filter/index-info/merge-regions/query-fpr`: diff --git a/README.md b/README.md index 5d1eb76..0368529 100644 --- a/README.md +++ b/README.md @@ -35,6 +35,7 @@ https://bioinf.shenwei.me/kmcp - Tutorials - [Taxonomic profiling](https://bioinf.shenwei.me/kmcp/tutorial/profiling) - [Detecting specific pathogens](https://bioinf.shenwei.me/kmcp/tutorial/detecting-pathogens) + - [Detecting contaminated sequences](https://bioinf.shenwei.me/kmcp/tutorial/detecting-contaminated-seqs) - [Sequence and genome searching](https://bioinf.shenwei.me/kmcp/tutorial/searching) - [Usage](https://bioinf.shenwei.me/kmcp/usage) - [Benchmarks](https://bioinf.shenwei.me/kmcp/benchmark) diff --git a/docs/database.md b/docs/database.md index 79e75ef..fb41fcf 100644 --- a/docs/database.md +++ b/docs/database.md @@ -139,7 +139,7 @@ Mapping file: no rank 31 isolate 26 -Masking prophage regions and removing plasmid sequences (optional): +Masking prophage regions and removing plasmid sequences with [genomad](https://github.com/apcamargo/genomad) (optional): conda activate genomad diff --git a/docs/download.md b/docs/download.md index 99069c9..0d77a89 100644 --- a/docs/download.md +++ b/docs/download.md @@ -17,26 +17,37 @@ ARM architecture is supported, but `kmcp search` would be slower. ## Current Version -### [v0.9.2](https://github.com/shenwei356/kmcp/releases/tag/v0.9.2) - 2023-05-16 [![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/kmcp/v0.9.2/total.svg)](https://github.com/shenwei356/kmcp/releases/tag/v0.9.2) +### [v0.9.3](https://github.com/shenwei356/kmcp/releases/tag/v0.9.3) - 2023-07-16 [![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/kmcp/v0.9.3/total.svg)](https://github.com/shenwei356/kmcp/releases/tag/v0.9.3) + +- `kmcp compute/split-genomes`: + - fix a bug in chunk computation when splitting circular genomes (`--circular`). +- `kmcp search/merge`: + - append simple stats to the search result as comment lines, including the number of input and matched queries. e.g., + + # input queries: 930254 + # matched queries: 923820 + # matched percentage: 99.3084% -- `kmcp profile/cos2simi/filter/index-info/merge-regions/query-fpr`: - - **rename/unify the long flag `--out-prefix` to `--out-file`**. - `kmcp profile`: - - fix the number of reads belonging to references in the profile when no matches are found, which should be 0 instead of 1. -- new command: - - `kmcp utils index-density`: plotting the element density of bloom filters for an index file. - An audience was concerned about it, but the results showed the elements (1s) are uniformly distributed in all BFs. + - fix metaphlan out format. [#34](https://github.com/shenwei356/kmcp/issues/34) + - show stats of the number of input and matched queries in log. It would be helpful to hint at whether the reference genomes cover all microorganisms in the sample. + + 20:00:55.295 [INFO] 99.3084% (923820/930254) reads matched + 20:00:55.295 [INFO] 100.0000% (923820/923820) matched reads belong to the 2 references in the profile +- new tutorials: + - [Detecting specific pathogens](https://bioinf.shenwei.me/kmcp/tutorial/detecting-pathogens) + - [Detecting contaminated sequences](https://bioinf.shenwei.me/kmcp/tutorial/detecting-contaminated-seqs) ### Links OS |Arch |File, 中国镜像 |Download Count :------|:---------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -Linux |**64-bit**|[**kmcp_linux_amd64.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.9.2/kmcp_linux_amd64.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_linux_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_linux_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.9.2/kmcp_linux_amd64.tar.gz) -Linux |arm64 |[**kmcp_linux_arm64.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.9.2/kmcp_linux_arm64.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_linux_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_linux_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.9.2/kmcp_linux_arm64.tar.gz) -macOS |**64-bit**|[**kmcp_darwin_amd64.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.9.2/kmcp_darwin_amd64.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_darwin_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_darwin_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.9.2/kmcp_darwin_amd64.tar.gz) -macOS |arm64 |[**kmcp_darwin_arm64.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.9.2/kmcp_darwin_arm64.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_darwin_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_darwin_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.9.2/kmcp_darwin_arm64.tar.gz) -Windows|**64-bit**|[**kmcp_windows_amd64.exe.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.9.2/kmcp_windows_amd64.exe.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_windows_amd64.exe.tar.gz)|[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_windows_amd64.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.9.2/kmcp_windows_amd64.exe.tar.gz) +Linux |**64-bit**|[**kmcp_linux_amd64.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.9.3/kmcp_linux_amd64.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_linux_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_linux_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.9.3/kmcp_linux_amd64.tar.gz) +Linux |arm64 |[**kmcp_linux_arm64.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.9.3/kmcp_linux_arm64.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_linux_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_linux_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.9.3/kmcp_linux_arm64.tar.gz) +macOS |**64-bit**|[**kmcp_darwin_amd64.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.9.3/kmcp_darwin_amd64.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_darwin_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_darwin_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.9.3/kmcp_darwin_amd64.tar.gz) +macOS |arm64 |[**kmcp_darwin_arm64.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.9.3/kmcp_darwin_arm64.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_darwin_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_darwin_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.9.3/kmcp_darwin_arm64.tar.gz) +Windows|**64-bit**|[**kmcp_windows_amd64.exe.tar.gz**](https://github.com/shenwei356/kmcp/releases/download/v0.9.3/kmcp_windows_amd64.exe.tar.gz),
[中国镜像](http://app.shenwei.me/data/kmcp/kmcp_windows_amd64.exe.tar.gz)|[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/kmcp/latest/kmcp_windows_amd64.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/kmcp/releases/download/v0.9.3/kmcp_windows_amd64.exe.tar.gz) *Notes:* @@ -137,6 +148,16 @@ fish: ## Release History +### [v0.9.2](https://github.com/shenwei356/kmcp/releases/tag/v0.9.2) - 2023-05-16 [![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/kmcp/v0.9.2/total.svg)](https://github.com/shenwei356/kmcp/releases/tag/v0.9.2) + +- `kmcp profile/cos2simi/filter/index-info/merge-regions/query-fpr`: + - **rename/unify the long flag `--out-prefix` to `--out-file`**. +- `kmcp profile`: + - fix the number of reads belonging to references in the profile when no matches are found, which should be 0 instead of 1. +- new command: + - `kmcp utils index-density`: plotting the element density of bloom filters for an index file. + An audience was concerned about it, but the results showed the elements (1s) are uniformly distributed in all BFs. + ### [v0.9.1](https://github.com/shenwei356/kmcp/releases/tag/v0.9.1) - 2022-12-26 [![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/kmcp/v0.9.1/total.svg)](https://github.com/shenwei356/kmcp/releases/tag/v0.9.1) - `kmcp search` diff --git a/docs/tutorial/detect-contaminated-seqs/index.md b/docs/tutorial/detecting-contaminated-seqs/index.md similarity index 87% rename from docs/tutorial/detect-contaminated-seqs/index.md rename to docs/tutorial/detecting-contaminated-seqs/index.md index 96cce40..549c9f1 100644 --- a/docs/tutorial/detect-contaminated-seqs/index.md +++ b/docs/tutorial/detecting-contaminated-seqs/index.md @@ -3,7 +3,7 @@ ## tools - kmcp: https://github.com/shenwei356/kmcp -- seqkit: >= v2.5.0, https://github.com/shenwei356/seqkit/issues/390#issuecomment-1633495130 +- seqkit: >= v2.5.0 which has the new command `seqkit merge-slides`. - taxonkit: https://github.com/shenwei356/taxonkit - csvtk: https://github.com/shenwei356/csvtk @@ -18,7 +18,7 @@ ## hardware - RAM >= 64GB -- #CPUs >= 32 preferred. +- CPUs >= 32 preferred. ## steps @@ -32,12 +32,12 @@ and performing metagenomoic profiling with them. # search against GTDB databases # !!! if the KMCP databases are in a network-attached storage disk (NAS), - # !!! please add the flag "-w" to kmcp + # !!! please add the flag "-w" to "kmcp search" seqkit sliding -g -s 50 -W 200 $input \ - | kmcp search -d ~/ws/data/kmcp2023/gtdb.part_1.kmcp/ -o $input.kmcp@gtdb.part_1.tsv.gz + | kmcp search -w -d ~/ws/data/kmcp2023/gtdb.part_1.kmcp/ -o $input.kmcp@gtdb.part_1.tsv.gz seqkit sliding -g -s 50 -W 200 $input \ - | kmcp search -d ~/ws/data/kmcp2023/gtdb.part_2.kmcp/ -o $input.kmcp@gtdb.part_2.tsv.gz + | kmcp search -w -d ~/ws/data/kmcp2023/gtdb.part_2.kmcp/ -o $input.kmcp@gtdb.part_2.tsv.gz # merge seach results kmcp merge -o $input.kmcp.tsv.gz $input.kmcp@gtdb.part_*.tsv.gz @@ -74,7 +74,7 @@ Checking contaminated regions | sed 1d | head -n 1 | sed "s/;/\n/g") \ -o $input.kmcp.tsv.gz.binning.filtered.tsv - # merge regions + # merge regions. seqkit v2.5.0 is needed. seqkit merge-slides $input.kmcp.tsv.gz.binning.filtered.tsv --quiet \ -o $input.kmcp.tsv.gz.cont.tsv @@ -90,14 +90,16 @@ Checking contaminated regions csvtk join -Ht $input.kmcp.tsv.gz.cont.tsv <(seqkit fx2tab -ni -l $input) \ | awk '{print $0"\t"($3-$2)"\t"($3-$2)/$4}' \ | csvtk join -Ht - $input.kmcp.tsv.gz.binning.filtered.tsv.taxa \ - | csvtk add-header -Ht -n chr,begin,end,contig_len,len,frac,taxa \ - | csvtk sort -t -k frac:nr \ + | csvtk add-header -Ht -n chr,begin,end,contig_len,len,proportion,taxa \ + | csvtk sort -t -k proportion:nr \ | tee $input.kmcp.tsv.gz.cont.details.tsv \ | csvtk pretty -t - chr begin end contig_len len frac taxa + chr begin end contig_len len proportion taxa ------------------------ ------ ------ ---------- ---- ----------- -------------------------------------------------------- SAMN02360712.contig00044 0 1151 1151 1151 1 177416(Francisella tularensis subsp. tularensis SCHU S4) SAMN02360712.contig00012 163600 163900 164357 300 0.00182529 1028746(Christiangramia aestuarii) SAMN02360712.contig00008 64850 65150 279605 300 0.00107294 1028746(Christiangramia aestuarii) SAMN02360712.contig00002 362200 362500 622965 300 0.000481568 1028746(Christiangramia aestuarii) + +We can see the whole (proportion: 1) contig `SAMN02360712.contig00044` is from a totally different species, which should be a contaminated sequence. diff --git a/docs/tutorial/index.md b/docs/tutorial/index.md index ac8c370..6b51fc8 100644 --- a/docs/tutorial/index.md +++ b/docs/tutorial/index.md @@ -2,5 +2,5 @@ - [Taxonomic profiling](profiling) - [Detecting specific pathogens](detecting-pathogens) -- [Detecting contaminated sequences](detect-contaminated-seqs) +- [Detecting contaminated sequences](detecting-contaminated-seqs) - [Sequence and genome searching](searching)