Skip to content

A pipeline to call SNPs from RNA-seq data and plot it along chromosomal coordinates

License

MIT, MIT licenses found

Licenses found

MIT
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

BleekerLab/genotype_by_sequencing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

23 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Genotyping by sequencing (DNA- or RNA-seq data)

Snakemake
Miniconda GitHub Issues Documentation Status

Description

A Snakemake pipeline that calls SNPs (Single Nucleotide Polymorphisms) from either DNA-seq or mRNA-seq data. It trims and aligns DNA/mRNA-seq fastq to a reference genome, then call SNPs before computing the number of SNPs per genomic window (e.g. per 1Mb). This pipeline can process single or paired-end data and is mostly suited for Illumina sequencing data.

Steps

  1. The raw fastq files will be trimmed for adaptors and quality checked with fastp.
  2. The genome sequence FASTA file will be used for the mapping step of the trimmed reads using either bwa or STAR.
  3. The reference genome will be sliced into bins of a pre-determined size (e.g. 1Mb).
  4. SNPs are called based on the alignment .bam files generating one VCF file per sample.
  5. SNPs are filtered based on a read depth threshold and SNP quality.
  6. The obtained filtered VCF file(s) are converted to the BED format .bed to allow computation.
  7. The number of overlapping SNPs per genome bin is computed using BEDOPS bedmap operator.

Input files

  • DNA or RNA-seq fastq files as listed in the config/samples.tsv file. Specify a sample name (e.g. "Sample_A") in the sample column and the paths to the forward read (fq1) and to the reverse read (fq2). If you have single-end reads, leave the fq2 column empty.
  • A genomic reference in FASTA format. For instance, a fasta file containing the 12 chromosomes of tomato (Solanum lycopersicum).
  • (for RNA-seq datasets) A genome annotation file in the GTF format. You can convert a GFF annotation file format into GTF with the gffread program from Cufflinks: gffread my.gff3 -T -o my.gtf. โš ๏ธ for featureCounts to work, the feature in the GTF file should be exon while the meta-feature has to be transcript_id. This will be converted to a BED file to compute the number of genes per genomic bin.

Below is an example of a GTF file format. โš ๏ธ a real GTF file does not have column names (seqname, source, etc.). Remove all non-data rows.

seqname source feature start end score strand frame attributes
SL4.0ch01 maker_ITAG CDS 279 743 . + 0 transcript_id "Solyc01g004000.1.1"; gene_id "gene:Solyc01g004000.1"; gene_name "Solyc01g004000.1";
SL4.0ch01 maker_ITAG exon 1173 1616 . + . transcript_id "Solyc01g004002.1.1"; gene_id "gene:Solyc01g004002.1"; gene_name "Solyc01g004002.1";
SL4.0ch01 maker_ITAG exon 3793 3971 . + . transcript_id "Solyc01g004002.1.1"; gene_id "gene:Solyc01g004002.1"; gene_name "Solyc01g004002.1";

Output files

  • One .tsv file per sample summarising the number of counts per genome bin called [sample name].counts.tsv. This table is used for plotting.
  • fastp QC reports: one per fastq file.

Prerequisites: what you should know before using this pipeline

  • Some command of the Unix Shell to connect to a remote server where you will execute the pipeline. You can find a good tutorial from the Software Carpentry Foundation here and another one from Berlin Bioinformatics here.
  • Some command of the Unix Shell to transfer datasets to and from a remote server (to transfer sequencing files and retrieve the results/). The Berlin Bioinformatics Unix begginer guide available here) should be sufficient for that (check the wget and scp commands).
  • An understanding of the steps of a canonical RNA-Seq analysis (trimming, alignment, etc.). You can find some info here.

Content of this GitHub repository

  • Snakefile: a master file that contains the desired outputs and the rules to generate them from the input files.
  • config/samples.tsv: a file containing sample names and the paths to the forward and eventually reverse reads (if paired-end). This file has to be adapted to your sample names before running the pipeline.
  • config/config.yaml: the configuration files making the Snakefile adaptable to any input files, genome and parameter for the rules.
  • config/refs/: a folder containing
    • a genomic reference in fasta format. The S_lycopersicum_chromosomes.4.00.chrom1.fa is placed for testing purposes.
    • a GTF annotation file. The ITAG4.0_gene_models.sub.gtf for testing purposes.
  • The environment.yaml is used by the conda package manager to create a working environment (see below).

You have to create the config/fastq/ folder that should contain test fastq files. These test fastq files are subsetted paired-end fastq files used to test locally the pipeline. They are generated using Seqtk:seqtk sample -s100 <inputfile> 250000 > <output file>. Files can be downloaded from Zenodo.

Installation and usage (local machine)

Installation

You will need a local copy of the GitHub genotype_by_sequencing repository on your machine. You can either:

  • use git in the shell: git clone git@github.com:BleekerLab/genotype_by_sequencing.git.
  • click on "Clone or download" and select download.
  • Then navigate inside the genotype_by_sequencing folder using Shell commands.

Usage

Configuration โœ๏ธ

You'll need to change a few things to accomodate this pipeline to your needs.

  1. Make sure you have changed the parameters in the config/config.yaml file that specifies where to find the sample data file, the genomic and transcriptomic reference fasta files to use and the parameters for certains rules etc. in particular, indicate whether your data are DNA-seq or RNA-seq data
    The type of input data has to be either DNA or RNA in the datatype section of the config.yaml.

This file is used so the Snakefile does not need to be changed when locations or parameters need to be changed.

๐Ÿ“ conda

Using the conda package manager, you need to create an environment where core softwares such as Snakemake will be installed.

  1. Install the Miniconda3 distribution (>= Python 3.7 version) for your OS (Windows, Linux or Mac OS X).
  2. Inside a Shell window (command line interface), create a virtual environment named gbs using the environment.yaml file with the following command: conda install -c conda-forge mamba --yes && mamba env create --file environment.yaml. The mamba package manager is faster than conda.
  3. Then, before you run the Snakemake pipeline, activate this virtual environment with conda activate gbs.

Dry run

  • Use the snakemake -np to perform a dry run that prints out the rules and commands.

Real run

With conda: snakemake --cores 10

Installation and usage (HPC cluster)

Installation

You will need a local copy of the GitHub genotype_by_sequencing repository on your machine. On a HPC system, you will have to clone it using the Shell command-line: git clone git@github.com:BleekerLab/genotype_by_sequencing.git.

  • click on "Clone or download" and select download.
  • Then navigate inside the genotype_by_sequencing folder using Shell commands.

SLURM sbatch usage

See the detailed protocol here.

Here is an example script to be saved as my_run.sh and executed with SLURM as sbatch my_run.sh
This will submit the batch script to SLURM sbatch.

#!/bin/bash
#
#SBATCH --job-name=snakemake_gbs        # job name
#SBATCH  --time=24:00:00                # mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mem-per-cpu=8G                # RAM requested per job (in Snakemake, one rule = one job)
#SBATCH --output=parallel_%j.log        # standard output and error log (%j substitutes the JOB ID)


#SBATCH --nodes=1                       # Run all processes on a single node	
#SBATCH --cpus-per-task=30              # Number of CPUs per task

source activate genotype

srun snakemake -j $SLURM_CPUS_PER_TASK

SLURM interactive usage

To start an interactive session, do:

  • Step1: srun --time=24:00:00 --mem-per-cpu=8G --cpus-per-task=10 --pty bash -i
  • Step2: conda activate gbs
  • Step3: snakemake -j 10 (since we specified 10 cpus per task)
  • Step4: exit

This starts an interactive bash with 10 CPUs allocated per task and 8G of RAM per CPU.

Retrieving your results

For instance, download the results but exclude bam files (too big).
rsync -a -v -e ssh --exclude="*bam" mgallan1@omics-h0.science.uva.nl:/zfs/omics/personal/mgallan1/workspace/genotype_by_sequencing/results/ [local directory]

Useful links

Directed Acyclic Graph of jobs

dag

References ๐Ÿ“—

Authors

Pipeline dependencies

Acknowledgments ๐Ÿ‘

Johannes Kรถster; creator of Snakemake.

Citation

If you use this software, please use the following citation:

Bliek T. and Galland M. (2021). Genotyping by sequencing pipeline (version 0.1.0).

About

A pipeline to call SNPs from RNA-seq data and plot it along chromosomal coordinates

Resources

License

MIT, MIT licenses found

Licenses found

MIT
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Packages

No packages published