nf-LO is a nextflow workflow for generating genome alignment files compatible with the UCSC liftOver utility for converting genomic coordinates between assemblies. It can automatically pull genomes directly from NCBI or iGenomes (or the user can provide fasta files) and supports four different aligners (lastz, blat, minimap2, GSAlign). Together these provide solutions for both different-species (lastz and minimap2) as well as same-species alignments (blat and GSAlign), with both standard and ultra-fast algorithms from a source to a target genome. It comes with a series of presets, allowing alignments of genomes depending on their genomic distance (near, medium and far).
See CHANGELOG for more details.
UPDATE 05/2024: The --aligner minimap2
mode now runs in multiple processes, splitting the target genome in fragments of at least --tgtSize
bases; individual contigs and scaffolds will not be fragmented, and each chunk will contain entire sequences, unless the --mm2_lowmem
option is provided. The old approach is still accessible through the --mm2_full_alignment
option. The anaconda recipe with the dependencies has been updated, so please ensure to re-create the container where needed. This optimization allows to perform a minimap2
liftover of the panTro6 to the hg38 genomes on a 16-cores Ryzen 7 8700G 64G Ubuntu machine in under half an hour
UPDATE 14/12/2022: Now the NCBI/iGenomes accession have to be provided in the --source
/--target
field, and then use the appropriate --igenomes_source
/--ncbi_source
and --igenomes_target
/--ncbi_target
as a modifier.
UPDATE 08/06/2022: fixed a bug in which lastz would not align small fragmented genomes, as well as small contigs, in the source assembly. Anyone interested in these small contigs should discard the previous version of nf-LO
using nextflow drop evotools/nf-LO
, and repeat the analyses.
UPDATE 07/06/2022: Added the possibility of providing customized conservation scores in the q-format via the --qscores
flag.
You can find more details on the usage of nf-LO in the readthedocs or in the wiki pages. These also include a simple step-by-step tutorial to run the analyses on your own genomes.
Nextflow first needs to be installed. To do so, follow the instructions here
curl -s https://get.nextflow.io | bash
Note Nextflow requires Java 8 or later. We suggest to install, depending on your preferences:
The workflow natively support four different ways to provide dependencies:
- Anaconda: this is the recommended and easiest way.
- Docker: you can create a docker image locally by using the
Dockerfile
andenvironment.yml
files in the folder - Singularity: you can create a singularity sif image locally by using the
singularity.def
andenvironment.yml
files in the folder - Local installation: we provide an
assets/install.sh
script that will take care of installing all the dependencies.
Using anaconda is the easiest to run almost all components of the workflow, the only exception being mafTools.
This can be installed locally using the assets/install_maftools.sh
script, that will take care of the installation in your linux or macOS machine.
The Singularity and Docker containers contain mafTools.
If you need further information on the installation of the dependencies, you can have a look at the specific wiki page
After obtaining nextflow, to run the nf-LO workflow to align the S. cerevisiae and S. pombe genomes pulled directly from iGenomes simply type:
./nextflow run evotools/nf-LO --igenomes_target sacCer3 --igenomes_source EF2 --distance far --aligner minimap2 -profile conda -latest --outdir ./my_liftover_minimap2
This command will use anaconda to obtain the required dependencies and output a chain file compatible with the liftOver utility to the my_liftover_minmap2 folder. See below for more information on how to alternatively use docker, or to manually install the required tools.
By default, nf-LO will attempt to use all cores available - 1 and the total amount of memory reserved by the java virtual machine. For most installation, it means that the workflow will use up to 3.GB of memory and almost all cores accessible. Users can customize these values in case the memory and/or cpus requested are not enough, or if the user is running the workflow on a cluster system. To do so, users can specify the settings as follow:
--max_cpus
: maximum number of cpus requested and used by the tasks (e.g.--max_cpus 4
will use at most 4 cpus for a single job)--max_time
: maximum time to use for a single job (e.g.--max_time 12.h
will run a task for at most 12 hours)--max_memory
: maximum memory used by a single job (e.g.--max_memory 16.GB
will use at most 16 GB of ram for a job)
nf-LO comes with a series of pre-defined profiles:
- standard: this profile runs all dependencies using anaconda
- local: runs using local exe instead of containerized/conda dependencies (see manual installation for further details)
- conda: runs the dependencies within conda
- uge: runs using UGE scheduling system
- sge: runs using SGE scheduling system
- Additional profiles: see additional profiles supported here
There are three different ways a user can specify genomes to align. Note in each case the source genome is the genome of origin, from which you which to lift the positions. The target genome is the genome to which you wish to lift the positions. We recommend to use soft-masked genomes to reduce the computation time for aligners such as lastz.
The source and target genomes can be specified as local or remote (un)compressed fasta files using the --source
and --target
flags.
nf-LO can download fasta files from ncbi directly using the datasets API. Users provide a GCA/GCF code in the --source
/--target
field, and add the --ncbi_source
and --ncbi_target
flags as follow:
nextflow run evotools/nf-LO --source "GCF_001549955.1" --target "GCF_011751205.1" --ncbi_source --ncbi_target -profile conda
nf-LO can also download genomes from the iGenomes site. Users provide a GCA/GCF code in the --source
/--target
field, and add the --igenomes_source
and --igenomes_target
flags as follow:
nextflow run evotools/nf-LO --source "equCab2" --target "dm6" --igenomes_source --target_igenome -profile conda
Note it is possible to mix source and target flags. For example using --igenomes_source
with --ncbi_target
.
The workflow will provide some custom configuration for the different algorithms and distances. NOTE: the alignment stage heavily affects the results of the chaining process, so we strongly recommend to perform different tests with different configurations, including custom ones. To see the presets available and how to fine-tune the pipeline go to our Alignments wiki page. The chain/net generation can also be fine-tuned to achieve better results (see Chain/Netting).
UPDATE 07/06/2022: it is now possible to specify customized conservation scores as q files (see here for examples) using the --qscores
options and providing the correct input file.
If you're running the workflow on a local workstation, single node or a local server we recommend to define the maximum amount of cores and memory for each job.
You can set that using the --max_memory NCPU
and --max_cpus 'MEM.GB'
, where NCPU is the maximum number of cpus per task and MEM is the maximum amount of memory for a single task.
To test the pipeline locally, simply run:
nextflow run evotools/nf-LO -profile test,conda
This will download and run the pipeline on the two toy genomes provided and generate liftover files. If you have all dependencies installed locally
you can omit conda
from the profile configuration.
Alternatively, you can run it on your own genomes using a command like this:
nextflow run evotools/nf-LO \
--source genome1 \
--target genome2 \
--annotation myfile.gff \
--annotation_format gff \
--distance near \
--aligner lastz \
--tgtSize 10000000 \
--tgtOvlp 100000 \
--srcSize 20000000 \
--liftover_algorithm crossmap \
--outdir ./my_liftover \
--publish_dir_mode copy \
--max_cpus 8 \
--max_memory 32.GB \
-profile conda
This analysis will run using genome1 and genome2 as source and target, respectively. The source genome will be fragmented in chunks of 20Mb, whereas the target will be fragmented in 10Mb chunks overlapping 100Kb. It will use lastz as the aligner using the preset for closely related genomes (near). The output files will be copied into the folder my_liftover.
- How do I liftover between two haplotypes of the same genome? You can lift over positions between haplotypes of the same individual (i.e. having the sequences named
*_hap*
or*_alt*
) by providing the--haplotypes
option.
To cite nf-LO, please refer to:
nf-LO: A scalable, containerised workflow for genome-to-genome lift over
Andrea Talenti, James Prendergast
Genome Biology and Evolution, 2021;, evab183, https://doi.org/10.1093/gbe/evab183
Adaptive seeds tame genomic sequence comparison. Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC. Genome Res. 2011 21(3):487-93; http://dx.doi.org/10.1101/gr.113985.110
Harris, R.S. (2007) Improved pairwise alignment of genomic DNA. Ph.D. Thesis, The Pennsylvania State University
Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. http://dx.doi.org/10.1093/bioinformatics/bty191
Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64
Zhao, H., Sun, Z., Wang, J., Huang, H., Kocher, J.-P., & Wang, L. (2013). CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics (Oxford, England), btt730
Lin, HN., Hsu, WL. GSAlign: an efficient sequence alignment tool for intra-species genomes. BMC Genomics 21, 182 (2020). https://doi.org/10.1186/s12864-020-6569-1