Skip to content

bcgsc/ntJoin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Release Issues Conda Published in Bioinformatics

Logo

ntJoin

Scaffolding draft assemblies using reference assemblies and minimizer graphs

Description of the algorithm

ntJoin takes a target assembly and one or more 'reference' assembly as input, and uses information from the reference(s) to scaffold the target assembly. The 'reference' assemblies can be true reference assembly builds, or a different draft genome assemblies.

Instead of using costly alignments, ntJoin uses a more lightweight approach using minimizer graphs to yield a mapping between the input assemblies.

Main steps in the algorithm:

  1. Generate an ordered minimizer sketch for each contig of each input assembly
  2. Filter the minimizers to only retain minimizers that are:
    • Unique within each assembly
    • Found in all assemblies (target + all references)
  3. Build a minimizer graph
    • Nodes: minimizers
    • Edges: between minimizers that are adjacent in at least one of the assemblies. Edge weights are the sum of weights of the assemblies that support an edge.
  4. Filter the graph based on the minimum edge weight (n)
  5. For each node that is a branch node (degree > 2), filter the incident edges with an increasing edge threshold
  6. Each linear path is converted to a list of oriented target assembly contig regions to scaffold together
  7. Target assembly scaffolds are printed out

Credits

Original concept: Rene Warren

Design and implementation: Lauren Coombe

Citing ntJoin

Thank you for your Stars and for using, developing and promoting this free software!

If you use ntJoin in your research, please cite:

Lauren Coombe, Vladimir Nikolic, Justin Chu, Inanc Birol, Rene L Warren: ntJoin: Fast and lightweight assembly-guided scaffolding using minimizer graphs. Bioinformatics (2020) doi: https://doi.org/10.1093/bioinformatics/btaa253.

Usage

Usage: ntJoin assemble target=<target scaffolds> references='List of reference assemblies' reference_weights='List of weights per reference assembly'

Options:
target			Target assembly to be scaffolded in fasta format
references		List of reference files (separated by a space, in fasta format)
target_weight		Weight of target assembly [1]
reference_weights	List of weights of reference assemblies
prefix			Prefix of intermediate output files [out.k<k>.w<w>.n<n>]
t			Number of threads [4]
k			K-mer size for minimizers [32]
w			Window size for minimizers (bp) [1000]
n			Minimum graph edge weight [1]
g			Minimum gap size (bp) [20]
G           		Maximum gap size (bp) (0 if no maximum) [0]
m			Minimum percentage of increasing/decreasing minimizer positions to orient contig [90]
mkt			If True, use Mann-Kendall Test to predict contig orientation (computationally-intensive, overrides 'm') [False]
agp			If True, output AGP file describing output scaffolds [False]
no_cut			If True, will not cut contigs at putative misassemblies [False]
overlap                 If True, attempts to detect and trim overlaps between joined sequences [True]
time		    	If True, will log the time for each step [False]
reference_config	Config file with reference assemblies and reference weights as comma-separated values (See README for example)
                	 This is optional, and will override the 'references' and 'reference_weights' variables if specified

Notes: 
	- Ensure the lists of reference assemblies and weights are in the same order, and that both are space-separated
	- Ensure all assembly files are in the current working directory

Running ntJoin help prints the help documentation.

Examples

Typical ntJoin usage:

  • Target assembly to scaffold: my_scaffolds.fa
  • Assembly to use as 'reference': assembly_ref1.fa
  • Giving the target asssembly a weight of '1' and reference assembly a weight of '2'
  • Using k=32, w=500
  • Ensure that all input assembly files are in or have soft-links to the current working directory
ntJoin assemble target=my_scaffolds.fa target_weight=1 references='assembly_ref1.fa' reference_weights='2' k=32 w=500

Using a config file to specify references (optional):

  • Alternatively, the reference(s) and reference weight(s) can be specified in a comma-separated config file with one row per reference assembly/weight:
reference1.fa,reference1_weight
reference2.fa,reference2_weight
  • It is important to ensure that there are no commas in the name of your reference fasta file
  • Then, the ntJoin command would use the file specified by reference_config for determining the reference(s) and reference weight(s) instead of references and reference_weights
    • If both the reference_config and the references variables are specified, reference_config will override the other variables
  • Example config files can be found in the tests directory: test_config_single.csv, test_config_multiple.csv
  • As with the typical ntJoin usage, ensure that all input assembly files are in or have soft-links to the current working directory, and do not use absolute/relative paths in the config file
ntJoin assemble target=my_scaffolds.fa target_weight=1 reference_config=config_file.csv k=32 w=500

Output files

  • Scaffolded targeted assembly (<target assembly>.k<k>.w<w>.n<n>.all.scaffolds.fa)
  • Path file describing how target assembly was scaffolded (<prefix>.path)
  • Unfiltered minimizer graph in dot format (<prefix>.mx.dot)
  • If agp=True specified, AGP describing how target assembly was scaffolded (<prefix>.agp)

Parameter considerations

  • We recommend setting the reference weight(s) to be higher than the target weight
  • If you are using a reference-grade assembly as the reference, set n=2, otherwise use the default n=1
  • When using no_cut=True, if you find that the output genome size is inflated, it is likely due to large gaps being incorporated into your output assembly. You can set the maximum gap size (G) to offset this.

Overlap feature

  • As of version 1.1.0, ntJoin can detect and trim overlaps between joined sequences. This feature is controlled by the overlap parameter, and is on overlap=True by default. To turn this behaviour off, specify overlap=False
  • Ensure that none of the sequences in your target assembly have terminal N characters. If they do, strip them from the sequence prior to running ntJoin

Installation Instructions

Installing ntJoin using Conda

conda install -c bioconda -c conda-forge ntjoin=1.1.5

Installing ntJoin using Brew

ntJoin can be installed using Homebrew on macOS or Linuxbrew on Linux:

brew install brewsci/bio/ntjoin

Installing ntJoin from the source code

curl -L --output ntJoin-1.1.5.tar.gz https://github.com/bcgsc/ntJoin/releases/download/v1.1.5/ntJoin-1.1.5.tar.gz && tar xvzf ntJoin-1.1.5.tar.gz 

Dependencies

Python dependencies can be installed with:

pip3 install -r requirements.txt

Testing your Installation

See tests/test_installation.sh to test your ntJoin installation and see an example command.

License

ntJoin Copyright (c) 2020 British Columbia Cancer Agency Branch. All rights reserved.

ntJoin is released under the GNU General Public License v3

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

For commercial licensing options, please contact Patrick Rebstein prebstein@bccancer.bc.ca