A simple Nextflow workflow designed to map short sequences to human genome and perform a few basic analyses of the mapping results. The short sequences may be of different origin, e.g. sgRNA sequences or sequencing short reads (for simplicity such short sequences are referred to as "reads" in this documentation). If you feel that the provided here help is not enough, see detailed comments in the workflow files.
- Environment setup
1.1. Manual environment configuration
1.2. Automatic environment setup with Docker - Workflow detailed description
2.1. Workflow configuration details
2.2. Workflow design - Running the workflow
The workflow is intendent to be run in Bash on Linux operating systems. Miniconda or Anaconda installation is required. The workflow has been tested using Miniconda installation (conda 23.5.0) and the following packages:
- python 3.10.11
- pip 23.1.2
- numpy 1.24.3
- pandas 2.0.2
- pysam 0.21.0
- pyensembl 2.2.8
- r-base 4.2.0
- bioconductor-tcgabiolinks 2.25.3
- bowtie2 2.5.1
- samtools 1.17
- nextflow 23.04.1
To run the workflow three steps must be taken. Firstly, Nextflow must be installed in the conda environment the pipeline will be launched from (e.g. the base environment):
conda install -c bioconda -c conda-forge nextflow==23.04.1
Then workflow-py
and workflow-r
environments should be created using conda/workflow-py.txt
and conda/workflow-r.txt
files, respectively:
conda create --name workflow-py --file conda/workflow-py.txt
conda create --name workflow-r --file conda/workflow-r.txt
Important: params.condaEnvPy
and params.condaEnvR
in nextflow.config
file must indicated the path of the workflow-py
and workflow-r
, respectively. The default settings are params.condaEnvPy = '/miniconda3/envs/workflow-py'
and params.condaEnvR = '/miniconda3/envs/workflow-r'
, and it is fit for usage in a Docker container. If you use the workflow in another way, please remember to change those to valid paths of existing conda environments.
Finally, the pyensembl
package is supposed to be installed using pip
into the workflow-py
environment:
conda activate workflow-py
pip install pyensembl==2.2.8
conda deactivate
or
<path_to_workflow-py_directory>/bin/pip install pyensembl==2.2.8
The following setup was tested using Docker ver. 24.0.2. In the main workflow directory there is a Dockerfile
. It allows for creation of a Docker image (based on ubuntu:22.04 image from the Docker library) as well as for a completely automatic setup of the workflow environment. All the workflow files are gathered in /hg-mapping
image location. To create such an image (named here workflow-ubuntu:22.04
) clone the repository and run the following command in the workflow main directory where the Dockerfile
is present:
docker build -t workflow-ubuntu:22.04 ./
Then you can create a container and run it, e.g. interactively like this:
docker run -it workflow-ubuntu:22.04
You can download a ready-to-use workflow-ubuntu:22.04
image here (2.6 GB).
Below you will find a tree of all workflow files that are provided. When the workflow is launched, the output files will be published in a subdirectory named output
.
<workflow_location>/
├── conda/
│ ├── workflow-py.txt
│ └── workflow-r.txt
├── input/
│ ├── library.fa
│ └── TCGA_samples.txt
├── templates/
│ ├── analyse_genes.py
│ ├── analyse_mapping.py
│ └── fetch_matrix.r
├── nextflow.config
└── main.nf
The workflow is ready to run and perform its tasks using files that are listed in the table below. The file paths are assigned to Nextflow parameters (params) in the nextflow.config
file. All output files will be published in the output
subdirectory. For more information see comments in the workflow files (main.nf
and nextflow.config
) as well as the next section.
In order to process other than provided files, change the paths assigned to Nexflow params in the nexflow.config
file. Should you need to process input files and save the output files from/to external locations (in respect to the container), consider using bind mount mechanism when running docker run
command (the -v
/--volume
parameter) to bind external locations to the input
and output
workflow subdirectories inside of a container.
Nextflow parameter | File path | Use |
---|---|---|
params.readsFile |
input/library.fa |
Reads to be mapped delivered in a FASTA or a FASTQ file. |
params.samplesTxtFile |
input/TCGA_samples.txt |
A list of TCGA sample ids in a text file. |
params.genomeFastaFile |
Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz remote (foreign) file from Ensembl FTP location |
FASTA file with human genome GRCh38 (release 109) primary assembly reference sequences (not masked) for mapping the reads to. |
params.genomeGtfFile |
Homo_sapiens.GRCh38.109.gtf.gz remote (foreign) file from Ensembl FTP location |
GTF file providing annotations for the reference sequences. |
params.chromosomeYFile |
Homo_sapiens.GRCh38.dna.chromosome.Y.fa.gz remote (foreign) file from Ensembl FTP location |
A short and thus quickly processable reference sequence of the human chromosome Y in a FASTA file that is intended to be used for testing purposes (when swapped with params.genomeFastaFile in main.nf file). |
The workflow consists of the following stages/processes:
No. | Process name | Task description |
---|---|---|
1. | buildIndex |
Using bowtie-build , builds reference sequence index from sequences in the input params.genomeFastaFile . Uses index/genome as the index prefix and saves the index to the output subdirectory. |
2. | mapReads |
Using bowtie2 , maps reads from the params.readsFile to params.genomeFastaFile reference. Saves the results to a gzipped SAM file output/mapping.sam.gz . |
3. | filterMapping |
Using samtools view , filters the mapping results in respect to MAPQ values (>= 30). Saves the results to a gzipped SAM file output/mapping_filtered.sam.gz . |
4. | analyseMapping |
Using templates/analyse_mapping.py Python script that utilises Pysam module, analyses filtered mapping results in order to calculate the end positions of mapped reads (based on CIGAR values) and the strand reads were mapped to (based on FLAG values). Saves the results to a gzipped TSV file mapping_analysis.tsv.gz . Next to QNAME, FLAG, RNAME, POS, MAPQ, CIGAR columns from the SAM input file (names are converted to lower case: qname , flag , rname , pos , mapq , cigar ), renders the end (based on CIGAR) and strand (based on FLAG) columns that denote respectively the end locations of reads within the reference sequence and the strand of the reference sequence reads were mapped to. The pos and end are 1-based and both inclusive, which corresponds to GenBank notation. |
5. | analyseGenes |
Using templates/analyse_genes.py Python script that utilises PyEnsembl module, obtains information of genes the input reads were mapped within. It uses params.genomeGtfFile that indicates the location of the file with annotations for the reference sequences. Saves the results to a gzipped TSV file gene_analysis.tsv.gz . The output file contains qname column (a read sequence id) next to gene_names and gene_ids columns that contain respectively gene names and their ids obtained from Ensembl database. If there is more than one gene in the locus where a given read was mapped, names/ids are separated by a semicolon followed by space ('; ' ). The resulting data may be used to check whether the gene name provided in a read sequence id (qname) may be found among names obtained from Ensembl database based on a read location. |
6. | fetchMatrix |
Using templates/fetch_matrix.r R script that utilises TCGAbiolinks R Bioconductor module, obtains expression matrices for samples, the name of which are given in the params.samplesTxtFile . Saves the results to a gzipped TSV file gene_matrix.tsv.gz . The first column of the output file is an index column that contains gene ids (selected during the previous stage), and the remaining columns contain expression data for the samples in the order their ids are provided in the input params.samplesTxtFile . |
Once the environment is ready and the configuration optionally adjusted to your needs, run the workflow:
nextflow main.nf