Skip to content

biocorecrg/allele_specific_RNAseq

Repository files navigation

Allele-specific RNA-seq pipeline

Docker Build Status Nextflow version Singularity version Docker version Build Status

This Nextflow[1] based workflow allows alignment of either single or paired ends reads to a reference transcriptome described as a genome in fasta file plus an annotation in GTF format using STAR[2] aligner considering the variant information provided as a VCF file.

STAR implements the WASP[3] method for filtering of allele specific alignments and reports within the resulting alignments which variant is found and from which allele. A python script based on pysam [4] efficiently separates the alignment generated by STAR into four different files:

  • Allele A
  • Allele B
  • Neither A nor B
  • Ambiguous (found both variants on one or different pairs)

HTseq-count[5] tool is then used for counting tags separated in those three categories mapping to the genes. Finally a report is generated using multiQC[6] that summarize the results of each step together with an initial QC evaluation fo raw reads done using FastQC[7]

Install

You need to install either Docker or Singularity and Nextflow.

Then you can clone the repository:

git clone --depth 1 git@github.com:biocorecrg/allele_specific_RNAseq.git

or

git clone --depth 1 https://github.com/biocorecrg/allele_specific_RNAseq.git 

depending on your GitHub configuration.

Prepare the input data

You need to extract the SNP information from the global VCF file, so first of all you need to download the VCF file with the index:

wget ftp://ftp-mouse.sanger.ac.uk/REL-1505-SNPs_Indels/mgp.v5.merged.snps_all.dbSNP142.vcf.gz
wget ftp://ftp-mouse.sanger.ac.uk/REL-1505-SNPs_Indels/mgp.v5.merged.snps_all.dbSNP142.vcf.gz.tbi

then the reference genome:

wget ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa

and the annotation from Ensembl. We used the version Mus_musculus.GRCm38.68 not available in Ensembl archive.

You can use the pipelines specifying between Docker and Singularity containers by using the options

-with-singularity

or

-with-docker

The module makeAnno can be used for generating a VCF file with SNP for the interesting species and a genome with SNP position masked with Ns.

For using the module:

cd makeAnno
NXF_VER=20.01.0 nextflow run make_anno.nf -with-singularity -bg --vcffile mgp.v5.merged.snps_all.dbSNP142.vcf.gz --speciesA CAST_EiJ --speciesB 129S1_SvImJ --genome GRCm38_68.fa --outvcf CAST_EiJ-129S1_SvImJ.vcf > log

Or in case you want to compare against the reference genome:

cd makeAnno
NXF_VER=20.01.0 nextflow run make_anno.nf -with-singularity -bg --vcffile mgp.v5.merged.snps_all.dbSNP142.vcf.gz --speciesA CAST_EiJ --speciesB reference --genome GRCm38_68.fa --outvcf CAST_EiJ-C57BL_6J.vcf > log

This can take some memory for masking the genome. In the output folder named filteredVCF you will find both gzipped masked genome and gzipped vcf annotations.

Run the pipeline

cd allele_specific_RNAseq; 
NXF_VER=20.01.0 nextflow run as_rnaseq.nf -with-singularity -bg > log

optionally you might want to check your pipeline on Nextflow's Tower website. You need to register using an istitutional mail and set the token provided in a variable as described:

export TOWER_ACCESS_TOKEN=<<<<<TOKEN NUMBER>>>>>>
NXF_VER=20.01.0 nextflow run as_rnaseq.nf -with-singularity -bg -with-tower > log 

you can check the status of your pipeline live on the tower website.

The parameters for running the pipeline are defined in the file params.config that can be changed accordingly.

parameter value
reads $baseDir/test/*_{1,2}.fastq.gz
genome $baseDir/test/GRCm38_68_19.masked.fa.gz
annotation $baseDir/test/Mus_musculus.GRCm38.68_19.gtf
strandness reverse
variants $baseDir/test/19_filt.vcf.gz
output $baseDir/output_test
single NO
varcut 1
title Allele specific RNAseq project
subtitle This is my wonderful RNA experiment
PI Luca Cozzuto
User Luca Cozzuto
UCSCgenomeID mm10
email mymail@mydomain.eu

providing a real email address will deliver a mail with the multiqc report when the analysis is finished.

Fastq reads

Fastq paired ends reads can be either plain or gzipped. You need to specify this syntax for capturing the identifier from single end samples:

PATH/*.fastq.gz

and this for paired end ones:

PATH/*_{1,2}.fastq.gz

Genome

Gzipped masked fasta file of the genome obtained running makeAnno

Strandness

It can be either "unstranded" or "forward" or "reverse". It depends on the kind of sequencing.

annotation

GTF file

variants

Gzipped VCF file obtained running makeAnno

single

YES: single end reads. NO: paired ends

varcut

Number of SNP needed for assigning a read to a variant.

Results

The following folder will contain the final outputs:

  • Index: the indexed genome
  • QC: containing fastQC results
  • Alignments: containing sorted BAM files as they were from normal bulk RNAseq
  • Report: A detailed report for the pipeline run, that will be sent via email.
  • Allele_alignments: containing BAM files containing reads with variants marked as: alleleA, alleleB, ref and ambiguous. The count is done considering the strand protocol used and specified as a parameter.
  • cut_N * A folder that is generated for each SNP cut off chosen containing the following sub-folders:
  • Allele_single_counts: containing the count per gene per sample and single alleles. The count is done considering the strand protocol used and specified as a parameter.
  • Allele_merged_Counts: containing the count per gene per allele in a single file for each sample. The count is done considering the strand protocol used and specified as a parameter.
  • Counts: composite counts per gene per sample. No distincion between alleles. Four fields: gene id, tot counts considering the sequencing unstranded, tot counts considering the sequencind done on the forward strand, tot counts considering reverse strand. This information is useful for validating the strand specific protocol used in the sequencing step.
  • Proportions: for each sample the read count per allele are divided for the sum of counts of the variants and multiplied for the composite count. The count is done considering the strand protocol used and specified as a parameter.
propA = alleleA/(alleleA+alleleB) * composite 
propB = alleleB/(alleleA+alleleB) * composite