Skip to content

JKoesterich/RNAseq-container

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

RNAseq-container

This pipeline is used to take in RNA paired end reads and filter the reads based off quality and duplications, then align the reads to a transcriptome and calculate the statistical significance of differential expression between cases and controls.

The programs run in this pipeline are:

  • fastp - does quality control filtering on paired end reads
  • ParDRE - removes artifact duplicates from the paired end reads
  • kallisto - does pseudoalignment on the reads to a transcriptome
  • DESeq - does statistical analysis on the differential expressions between case and controls
  • Additional scripts I created to generate plots and graphs on the data in the intermediate files

The files are split into the programs run in bash using a python script and a R script to run the R program.

Python script

This script takes in the input files and output folder.

Input files

The input files are provided to the program in a single text file.
The fasta files are provided on a single line per paired ends in the form of Input: path/file_R1 \t path/file_R2.
The full path of the transcriptome file is provided on its own line. Kallisto reference: path/to/transcriptome The input files are paired end fasta reads. Each paired end set have the same file name but ends in R1 and R2.
The transcriptome file can either be a transcriptome file containing transcripts for a genome or a kallisto transcript file (ending with .idx) if the transcriptome was previously used in another run of the kallisto program.

More description of supplying the input files can be found in the singularity container section.

Output folder

The script will make directories within the provided output folder to sort and write the output files

The script will call other python scripts to create histograms of read lengths for pre and post fastp filtering and the transcripts in the transcriptome.

The script will also use the transcriptome file to generate a conversion table to convert the transcripts into genes for the final expression levels.

R script

The R script gets called internally at the end of the python script to run the DESeq.
The script will run DESeq and will calculate gene rankings based on the -log(pvalue).
The output of this program is the differential expressions of the genes as well as a subset of the top 100 differentially expressed genes.

Singularity container

These scripts and the programs have been combined into a singularity container.
The container has the versions of the programs tests to work together as well as having set paths for the scripts to call.
The container also ensures that the programs can run on any OS even if the versions of the OS and programs are not compatible as long as the OS can run singularity.

Creating input read file

A specific format for the input file is required. An example can be found below reading in the test files:
Output line tells the program where to print the output folder and all the output files, here it prints the output folder to the folder that the user entered (line 1)
Input lines hold the R1 and R2 files for the sample as well as the condition, can be separated by spaces or tabs (lines 2-4)
Kallisto reference line is the transcriptome that kallisto will use to make its alignment index (line 5)
Kallisto reference line can also be used to provide an index file (.idx file) that was previously created from a chosen transcriptome (Line 6 which has been commented out and will not be read by the code)
If and only if the kallisto reference line was given an index file (.idx) then a conversion file that holds the transcripts to gene conversions is required too (line 7 which has been commented out and will not be read by the code)

output: .
input: /Test/test_files_in/SRR065505_100thou_R1.fastq /Test/test_files_in/SRR065505_100thou_R1.fastq Case
Input: /Test/test_files_in/SRR065506_100thou_R1.fastq /Test/test_files_in/SRR065506_100thou_R2.fastq Case
Input: /Test/test_files_in/SRR065529_100thou_R2.fastq /Test/test_files_in/SRR065529_100thou_R2.fastq Control
kallisto reference: /Test/Homo_sapiens.GRCh37.75.cdna.all.fa.gz
#Kallisto reference: /Test/Homo_sapiens.GRCh37.75.cdna.all.fa_kal_index.idx
#Conversion file: /Test/Homo_sapiens.GRCh37.75.cdna.all.fa_transcript_gene_conversion_table.txt

Test files

Test files have been generated as a tutorial. These files are taken from the GEO generated by the ENCODE project.
These fastq files and the input file are inside the container under the /Test/ folder.

Running the container

Shelling into container

singularity shell DeKal_V0.06.simg This opens the container to allow interaction between directory and inside container.
python /PyRcodes/automated_rna_insing.py /Test/pipeline_input_file.txt This will run the script and reads in the test input file located in the /Test directory inside the container.
exit This will take the user out of the container.

Executing a command from container

singularity exec DeKal_V0.06.simg python /PyRcodes/automated_rna_insing.py /Test/pipeline_input_file.txt This command will run the script on the test files without having the user enter and exit the singularity shell.