hg-mapping

A simple Nextflow workflow designed to map short sequences to human genome and perform a few basic analyses of the mapping results. The short sequences may be of different origin, e.g. sgRNA sequences or sequencing short reads (for simplicity such short sequences are referred to as "reads" in this documentation). If you feel that the provided here help is not enough, see detailed comments in the workflow files.

1. Environment setup

The workflow is intendent to be run in Bash on Linux operating systems. Miniconda or Anaconda installation is required. The workflow has been tested using Miniconda installation (conda 23.5.0) and the following packages:

python 3.10.11
pip 23.1.2
numpy 1.24.3
pandas 2.0.2
pysam 0.21.0
pyensembl 2.2.8
r-base 4.2.0
bioconductor-tcgabiolinks 2.25.3
bowtie2 2.5.1
samtools 1.17
nextflow 23.04.1

1.1. Manual environment configuration

To run the workflow three steps must be taken. Firstly, Nextflow must be installed in the conda environment the pipeline will be launched from (e.g. the base environment):

conda install -c bioconda -c conda-forge nextflow==23.04.1

Then workflow-py and workflow-r environments should be created using conda/workflow-py.txt and conda/workflow-r.txt files, respectively:

conda create --name workflow-py --file conda/workflow-py.txt
conda create --name workflow-r  --file conda/workflow-r.txt

Important: params.condaEnvPy and params.condaEnvR in nextflow.config file must indicated the path of the workflow-py and workflow-r, respectively. The default settings are params.condaEnvPy = '/miniconda3/envs/workflow-py' and params.condaEnvR = '/miniconda3/envs/workflow-r', and it is fit for usage in a Docker container. If you use the workflow in another way, please remember to change those to valid paths of existing conda environments.

Finally, the pyensembl package is supposed to be installed using pip into the workflow-py environment:

conda activate workflow-py
pip install pyensembl==2.2.8
conda deactivate

or

<path_to_workflow-py_directory>/bin/pip install pyensembl==2.2.8

1.2. Automatic environment setup with Docker

The following setup was tested using Docker ver. 24.0.2. In the main workflow directory there is a Dockerfile. It allows for creation of a Docker image (based on ubuntu:22.04 image from the Docker library) as well as for a completely automatic setup of the workflow environment. All the workflow files are gathered in /hg-mapping image location. To create such an image (named here workflow-ubuntu:22.04) clone the repository and run the following command in the workflow main directory where the Dockerfile is present:

docker build -t workflow-ubuntu:22.04 ./

Then you can create a container and run it, e.g. interactively like this:

docker run -it workflow-ubuntu:22.04

You can download a ready-to-use workflow-ubuntu:22.04 image here (2.6 GB).

2. Workflow detailed description

2.1. Workflow tree

Below you will find a tree of all workflow files that are provided. When the workflow is launched, the output files will be published in a subdirectory named output.

<workflow_location>/
├── conda/
│   ├── workflow-py.txt
│   └── workflow-r.txt
├── input/
│   ├── library.fa
│   └── TCGA_samples.txt
├── templates/
│   ├── analyse_genes.py
│   ├── analyse_mapping.py
│   └── fetch_matrix.r
├── nextflow.config
└── main.nf

2.2. Workflow configuration details

The workflow is ready to run and perform its tasks using files that are listed in the table below. The file paths are assigned to Nextflow parameters (params) in the nextflow.config file. All output files will be published in the output subdirectory. For more information see comments in the workflow files (main.nf and nextflow.config) as well as the next section.

In order to process other than provided files, change the paths assigned to Nexflow params in the nexflow.config file. Should you need to process input files and save the output files from/to external locations (in respect to the container), consider using bind mount mechanism when running docker run command (the -v/--volume parameter) to bind external locations to the input and output workflow subdirectories inside of a container.

Nextflow parameter	File path	Use
`params.readsFile`	`input/library.fa`	Reads to be mapped delivered in a FASTA or a FASTQ file.
`params.samplesTxtFile`	`input/TCGA_samples.txt`	A list of TCGA sample ids in a text file.
`params.genomeFastaFile`	`Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz` remote (foreign) file from Ensembl FTP location	FASTA file with human genome GRCh38 (release 109) primary assembly reference sequences (not masked) for mapping the reads to.
`params.genomeGtfFile`	`Homo_sapiens.GRCh38.109.gtf.gz` remote (foreign) file from Ensembl FTP location	GTF file providing annotations for the reference sequences.
`params.chromosomeYFile`	`Homo_sapiens.GRCh38.dna.chromosome.Y.fa.gz` remote (foreign) file from Ensembl FTP location	A short and thus quickly processable reference sequence of the human chromosome Y in a FASTA file that is intended to be used for testing purposes (when swapped with `params.genomeFastaFile` in `main.nf` file).

2.3. Workflow design

The workflow consists of the following stages/processes:

No.	Process name	Task description
1.	`buildIndex`	Using `bowtie-build`, builds reference sequence index from sequences in the input `params.genomeFastaFile`. Uses `index/genome` as the index prefix and saves the index to the `output` subdirectory.
2.	`mapReads`	Using `bowtie2`, maps reads from the `params.readsFile` to `params.genomeFastaFile` reference. Saves the results to a gzipped SAM file `output/mapping.sam.gz`.
3.	`filterMapping`	Using `samtools view`, filters the mapping results in respect to MAPQ values (>= 30). Saves the results to a gzipped SAM file `output/mapping_filtered.sam.gz`.
4.	`analyseMapping`	Using `templates/analyse_mapping.py` Python script that utilises Pysam module, analyses filtered mapping results in order to calculate the end positions of mapped reads (based on CIGAR values) and the strand reads were mapped to (based on FLAG values). Saves the results to a gzipped TSV file `mapping_analysis.tsv.gz`. Next to QNAME, FLAG, RNAME, POS, MAPQ, CIGAR columns from the SAM input file (names are converted to lower case: `qname`, `flag`, `rname`, `pos`, `mapq`, `cigar`), renders the `end` (based on CIGAR) and `strand` (based on FLAG) columns that denote respectively the end locations of reads within the reference sequence and the strand of the reference sequence reads were mapped to. The `pos` and `end` are 1-based and both inclusive, which corresponds to GenBank notation.
5.	`analyseGenes`	Using `templates/analyse_genes.py` Python script that utilises PyEnsembl module, obtains information of genes the input reads were mapped within. It uses `params.genomeGtfFile` that indicates the location of the file with annotations for the reference sequences. Saves the results to a gzipped TSV file `gene_analysis.tsv.gz`. The output file contains `qname` column (a read sequence id) next to `gene_names` and `gene_ids` columns that contain respectively gene names and their ids obtained from Ensembl database. If there is more than one gene in the locus where a given read was mapped, names/ids are separated by a semicolon followed by space (`'; '`). The resulting data may be used to check whether the gene name provided in a read sequence id (qname) may be found among names obtained from Ensembl database based on a read location.
6.	`fetchMatrix`	Using `templates/fetch_matrix.r` R script that utilises TCGAbiolinks R Bioconductor module, obtains expression matrices for samples, the name of which are given in the `params.samplesTxtFile`. Saves the results to a gzipped TSV file `gene_matrix.tsv.gz`. The first column of the output file is an index column that contains gene ids (selected during the previous stage), and the remaining columns contain expression data for the samples in the order their ids are provided in the input `params.samplesTxtFile`.

3. Running the workflow

Once the environment is ready and the configuration optionally adjusted to your needs, run the workflow:

nextflow main.nf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hg-mapping

Contents

1. Environment setup

1.1. Manual environment configuration

1.2. Automatic environment setup with Docker

2. Workflow detailed description

2.1. Workflow tree

2.2. Workflow configuration details

2.3. Workflow design

3. Running the workflow

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
conda		conda
input		input
templates		templates
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config

License

michalbukowski/hg-mapping

Folders and files

Latest commit

History

Repository files navigation

hg-mapping

Contents

3. Running the workflow

About

Resources

License

Stars

Watchers

Forks

Languages