diff --git a/README.md b/README.md index 65ec5d3..964622a 100755 --- a/README.md +++ b/README.md @@ -15,7 +15,7 @@ GoldRush iterates through the input long reads to produce a "golden path" of rea 2. **[GoldPolish](https://github.com/bcgsc/goldpolish)** (aka GoldRush-Edit): polishing the genome 3. **[Tigmint-long](https://github.com/bcgsc/tigmint)**: correcting the genome 4. **[GoldChain](https://github.com/bcgsc/ntlink)** (aka GoldRush-Link): scaffolding the genome - +5. **[GoldPolish-Target](https://github.com/bcgsc/goldpolish)**: targeted polishing the genome ## Credits @@ -37,51 +37,50 @@ goldrush run reads=reads G=gsize Commands: - run run default GoldRush pipeline: GoldRush-Path + Polisher (GoldPolish by default) + Tigmint-long + ntLink ( -default 5 rounds) - goldrush-path run GoldRush-Path - path-polish run GoldRush-Path, then GoldPolish - path-tigmint run GoldRush-Path, then GoldPolish, then Tigmint-long - path-tigmint-ntLink run GoldRush-Path, then GoldPolish, then Tigmint-long, then ntLink (default 5 rounds) + run run default GoldRush pipeline: GoldRush-Path + Polisher (GoldPolish by default) + Tigmint-long + ntLink (default 5 rounds) + GoldPolish-Target + goldrush-path run GoldRush-Path + path-polish run GoldRush-Path, then GoldPolish + path-tigmint run GoldRush-Path, then GoldPolish, then Tigmint-long + path-tigmint-ntLink run GoldRush-Path, then GoldPolish, then Tigmint-long, then ntLink (default 5 rounds) + path-tigmint-ntLink-target run GoldRush-Path, then GoldPolish, then Tigmint-long, then ntLink (default 5 rounds), then GoldPolish-Target General options (required): - reads read name [reads]. File must have .fq or .fastq extension, but do not include the suffix in the supplied read name - G haploid genome size (bp) (e.g. '3e9' for human genome) + reads read name [reads]. File must have .fq or .fastq extension, but do not include the suffix in the supplied read name + G haploid genome size (bp) (e.g. '3e9' for human genome) General options (optional): - t number of threads [48] - z minimum size of contig (bp) to scaffold [1000] - track_time If 1 then track the run time and memory usage, if 0 then don't [0] + t number of threads [48] + z minimum size of contig (bp) to scaffold [1000] + track_time If 1 then track the run time and memory usage, if 0 then don't [0] GoldRush-Path options: - k base k value to generate hash [22] - w weight of spaced seed (number of 1's) [16] - tile tile size [1000] - b during insertion, number of consecutive tiles to be inserted with the same ID [10] - u minimum number of unassigned tiles for the read to be considered unassigned [5] - a maximum number of tiles that can be assigned, minimum number of overlapping tiles kept after trimming [1] - o occupancy of the miBF [0.1] - x threshold for number of hits in miBF for a given frame to be considered assigned [10] - h number of seed patterns to use [3] - m minimum read length [20000] - M maximum number of silver paths to generate [5] - r ratio of full genome in golden path [0.9] - P minimum average phred score for each read [15] - d remove reads with greater or equal than d difference between average phred quality of first half and second half of the read [5] - p prefix to use for the output paths [goldrush_asm] + k base k value to generate hash [22] + w weight of spaced seed (number of 1's) [16] + tile tile size [1000] + b during insertion, number of consecutive tiles to be inserted with the same ID [10] + u minimum number of unassigned tiles for the read to be considered unassigned [5] + a maximum number of tiles that can be assigned, minimum number of overlapping tiles kept after trimming [1] + o occupancy of the miBF [0.1] + x threshold for number of hits in miBF for a given frame to be considered assigned [10] + h number of seed patterns to use [3] + m minimum read length [20000] + M maximum number of silver paths to generate [5] + r ratio of full genome in golden path [0.9] + P minimum average phred score for each read [15] + d remove reads with greater or equal than d difference between average phred quality of first half and second half of the read [5] + p prefix to use for the output paths [goldrush_asm] Tigmint-long options: - span min number of spanning molecules [2] - dist maximum distance between alignments to be considered the same molecule [500] + span min number of spanning molecules [2] + dist maximum distance between alignments to be considered the same molecule [500] ntLink options: - k_ntLink k-mer size for minimizers [40] - w_ntLink window size for minimizers [250] - rounds number of rounds of ntLink [5] + k_ntLink k-mer size for minimizers [40] + w_ntLink window size for minimizers [250] + rounds number of rounds of ntLink [5] GoldPolish options: - polisher_mapper Whether to use ntlink or minimap2 for mappings [minimap2] - shared_mem Shared memory path where polishing occurs [/dev/shm] + shared_mem Shared memory path where polishing occurs [/dev/shm] Notes: - GoldRush-Path generates silver paths before generating the golden path @@ -143,6 +142,8 @@ GoldRush has been tested on *Linux* operating systems (centOS7, ubuntu-20.04) * [Tigmint 1.2.6+](https://github.com/bcgsc/tigmint) * [ntLink 1.3.3+](https://github.com/bcgsc/ntlink) * [minimap2](https://github.com/lh3/minimap2) + * [snakemake](https://github.com/snakemake/snakemake) + * [intervaltree](https://github.com/chaimleib/intervaltree) ## Installation ### Installing using conda: diff --git a/azure-pipelines.yml b/azure-pipelines.yml index 73285ff..3b2dadc 100644 --- a/azure-pipelines.yml +++ b/azure-pipelines.yml @@ -15,7 +15,7 @@ jobs: - script: | source activate goldrush_CI conda install --yes -c conda-forge mamba python=3.10 - mamba install --yes -c conda-forge -c bioconda compilers meson gperftools sdsl-lite boost-cpp sparsehash btllib libdivsufsort minimap2 tigmint ntlink miller + mamba install --yes -c conda-forge -c bioconda compilers meson gperftools sdsl-lite boost-cpp sparsehash btllib libdivsufsort minimap2 tigmint ntlink miller snakemake intervaltree displayName: Install dependencies - script: | source activate goldrush_CI @@ -47,8 +47,8 @@ jobs: displayName: Create Anaconda environment - script: | source activate goldrush_CI - conda install --yes -c conda-forge mamba python=3.10 - mamba install --yes -c conda-forge -c bioconda compilers meson gperftools sdsl-lite boost-cpp sparsehash btllib libdivsufsort minimap2 tigmint ntlink miller + conda install --yes -c conda-forge mamba=1.5.10 python=3.10 + mamba install --yes -c conda-forge -c bioconda compilers meson gperftools sdsl-lite boost-cpp sparsehash btllib libdivsufsort minimap2 tigmint ntlink miller snakemake intervaltree displayName: Install dependencies - script: | source activate goldrush_CI diff --git a/bin/goldrush b/bin/goldrush index b8da7ba..895b2c6 100755 --- a/bin/goldrush +++ b/bin/goldrush @@ -87,6 +87,12 @@ cut=250 k_ntLink=40 w_ntLink=250 rounds=5 +soft_mask=True + +# Default GoldPolish-Target parameters +target_flank_length=64 +target_k_ntlink=88 +target_w_ntlink=1000 # Development mode - retains intermediate files. Specify dev=True to enable. dev=False @@ -139,12 +145,13 @@ help: @echo "" @echo " Commands:" @echo "" - @echo " run run default GoldRush pipeline: GoldRush-Path + Polisher (GoldPolish by default) + Tigmint-long + ntLink (default 5 rounds)" + @echo " run run default GoldRush pipeline: GoldRush-Path + Polisher (GoldPolish by default) + Tigmint-long + ntLink (default 5 rounds) + GoldPolish-Target" @echo "" - @echo " goldrush-path run GoldRush-Path" - @echo " path-polish run GoldRush-Path, then $(polisher_logs)" - @echo " path-tigmint run GoldRush-Path, then $(polisher_logs), then Tigmint-long" - @echo " path-tigmint-ntLink run GoldRush-Path, then $(polisher_logs), then Tigmint-long, then ntLink (default 5 rounds)" + @echo " goldrush-path run GoldRush-Path" + @echo " path-polish run GoldRush-Path, then $(polisher_logs)" + @echo " path-tigmint run GoldRush-Path, then $(polisher_logs), then Tigmint-long" + @echo " path-tigmint-ntLin run GoldRush-Path, then $(polisher_logs), then Tigmint-long, then ntLink (default 5 rounds)" + @echo " path-tigmint-ntLink-target run GoldRush-Path, then $(polisher_logs), then Tigmint-long, then ntLink (default 5 rounds), then GoldPolish-Target" @echo "" @echo " General options (required):" @echo " reads read name [reads]. File must have .fq or .fastq extension, but do not include the suffix in the supplied read name" @@ -182,7 +189,6 @@ help: @echo " rounds number of rounds of ntLink [$(rounds)]" @echo "" @echo " GoldPolish options:" - @echo " polisher_mapper Whether to use ntlink or minimap2 for mappings [$(polisher_mapper)]" @echo " shared_mem Shared memory path where polishing occurs [/dev/shm] " @echo "" @echo "Notes:" @@ -205,14 +211,15 @@ run: ln -sf $(prefix)/$(p2).$(polished_infix).fa ln -sf $(prefix)/$(p2).$(polished_infix).span$(span).dist$(dist).tigmint.fa ln -sf $(prefix)/$(p2).$(polished_infix).span$(span).dist$(dist).tigmint.fa.k$(k_ntLink).w$(w_ntLink).ntLink-$(rounds)rounds.fa + ln -sf $(prefix)/$(p2).$(polished_infix).span$(span).dist$(dist).tigmint.fa.k$(k_ntLink).w$(w_ntLink).ntLink-$(rounds)rounds.polished.fa echo "You can find intermediate files and the outputs for each GoldRush stage within the $(prefix) subdirectory." - echo "A soft link to your final assembly is available at: $(p2).$(polished_infix).span$(span).dist$(dist).tigmint.fa.k$(k_ntLink).w$(w_ntLink).ntLink-$(rounds)rounds.fa" + echo "A soft link to your final assembly is available at: $(p2).$(polished_infix).span$(span).dist$(dist).tigmint.fa.k$(k_ntLink).w$(w_ntLink).ntLink-$(rounds)rounds.polished.fa" -run-in-dir: path-tigmint-ntLink check-G check-reads clean +run-in-dir: path-tigmint-ntLink-target check-G check-reads clean path-polish: $(polisher) check-G check-reads clean path-tigmint: tigmint check-G check-reads clean - path-tigmint-ntLink: ntLink_all_rounds ntLink_softlink clean +path-tigmint-ntLink-target: goldpolish_target clean check-G: ifndef G @@ -225,7 +232,6 @@ ifeq ($(long_reads),) $(error $(ERROR_MESSAGE)) endif - # Run GoldRush-Path goldrush-path: $(p2).fa check-G check-reads clean @@ -282,13 +288,19 @@ $(p2).$(polished_infix).span$(span).dist$(dist).tigmint.fa: $(p2).$(polished_inf ntLink_all_rounds: $(p2).$(polished_infix).span$(span).dist$(dist).tigmint.fa.k$(k_ntLink).w$(w_ntLink).z$z.ntLink.gap_fill.$(rounds)rounds.fa check-G check-reads %.fa.k$(k_ntLink).w$(w_ntLink).z$z.ntLink.gap_fill.$(rounds)rounds.fa: %.fa $(long_reads) - $(time) ntLink_rounds run_rounds_gaps target=$< t=$t k=$(k_ntLink) w=$(w_ntLink) z=$z rounds=$(rounds) reads=$(long_reads) + $(time) ntLink_rounds run_rounds_gaps target=$< t=$t k=$(k_ntLink) w=$(w_ntLink) z=$z soft_mask=$(soft_mask) rounds=$(rounds) reads=$(long_reads) ifneq ($(dev), True) - ntLink_rounds clean target=$< t=$t k=$(k_ntLink) w=$(w_ntLink) z=$z rounds=$(rounds) reads=$(long_reads) + ntLink_rounds clean target=$< t=$t k=$(k_ntLink) w=$(w_ntLink) z=$z soft_mask=$(soft_mask) rounds=$(rounds) reads=$(long_reads) endif ntLink_softlink: $(p2).$(polished_infix).span$(span).dist$(dist).tigmint.fa.k$(k_ntLink).w$(w_ntLink).ntLink-$(rounds)rounds.fa check-G check-reads %.k$(k_ntLink).w$(w_ntLink).ntLink-$(rounds)rounds.fa: %.k$(k_ntLink).w$(w_ntLink).z$z.ntLink.gap_fill.$(rounds)rounds.fa ln -sf $(lastword $^) $@ - echo "Done GoldRush-Path + $(polisher_logs) + Tigmint-long + $(rounds) ntLink rounds! Your final assembly can be found in: $@" + echo "Done GoldRush-Path + $(polisher_logs) + Tigmint-long + $(rounds) ntLink rounds! Your post-ntLink assembly can be found in: $@" + +# Run GoldPolish-Target after ntLink rounds +goldpolish_target: $(p2).$(polished_infix).span$(span).dist$(dist).tigmint.fa.k$(k_ntLink).w$(w_ntLink).ntLink-$(rounds)rounds.polished.fa check-G check-reads +%.k$(k_ntLink).w$(w_ntLink).ntLink-$(rounds)rounds.polished.fa: %.k$(k_ntLink).w$(w_ntLink).ntLink-$(rounds)rounds.fa + $(time) goldpolish --target --k-ntlink $(target_k_ntlink) --w-ntlink $(target_w_ntlink) -l $(target_flank_length) $< $(long_reads) $@ + echo "Done GoldRush-Path + $(polisher_logs) + Tigmint-long + $(rounds) ntLink rounds + GoldPolish-Target! Your final assembly can be found in: $@" diff --git a/subprojects/.DS_Store b/subprojects/.DS_Store new file mode 100644 index 0000000..6abb19a Binary files /dev/null and b/subprojects/.DS_Store differ diff --git a/subprojects/goldpolish b/subprojects/goldpolish index 7051db4..3dc7e8a 160000 --- a/subprojects/goldpolish +++ b/subprojects/goldpolish @@ -1 +1 @@ -Subproject commit 7051db478f7874bf600b0f183c727596a8646793 +Subproject commit 3dc7e8a040220c23e30dc50b4fd28ebd4b4e9e20 diff --git a/tests/goldrush_test_demo.sh b/tests/goldrush_test_demo.sh index 40283e3..2b46499 100755 --- a/tests/goldrush_test_demo.sh +++ b/tests/goldrush_test_demo.sh @@ -11,7 +11,7 @@ goldrush run reads=test_reads G=1e6 t=4 p=goldrush_test -B l50=$(abyss-fac goldrush_test_golden_path.goldpolish-polished.span2.dist500.tigmint.fa.k40.w250.ntLink-5rounds.fa |awk '{print $3}' |tail -n1) -if [ -e goldrush_test_golden_path.goldpolish-polished.span2.dist500.tigmint.fa.k40.w250.ntLink-5rounds.fa ] && [ ${l50} -eq 1 ]; then +if [ -e goldrush_test_golden_path.goldpolish-polished.span2.dist500.tigmint.fa.k40.w250.ntLink-5rounds.polished.fa ] && [ ${l50} -eq 1 ]; then echo -e "\nTest successful!" else echo -e "\nTest failed - please check your installation"