This workflow is designed for automated processing of large numbers of samples. Each step below are executed in batch mode.
The following scripts will be executed from the project folder path .../Diversity_outbred_bulk_RNA-Seq/
.
This step is to be performed twice, once on the raw fastq files (before trimming) and once after trimming.
-
Run FASTQC on a local computing environment
#!/bin/bash # make sure FASTQC is installed and availble in the $PATH bash src/sh/run_fastqc.sh --help
Usage: src/sh/run_fastqc.sh -i target_dir -o out_dir -m run_mode -i path to directory with fastq files -o Path to the directory where the outputs will be written -m value should be one of Trimmed/Untrimmed
-
Run FASTQC as a slurm job (UVA internal on Rivanna)
#!/bin/bash sbatch src/slurm/run_fastqc.slurm
If needed, the user should change the values of
target_dir
,out_dir
, andrun_mode
in the slurm script.
-
Run MultiQC on a local computing environment
#!/bin/bash # make sure multiqc is installed and availble in the $PATH bash src/sh/run_multiqc.sh --help
Usage: src/sh/run_multiqc.sh -i target_dir -o out_dir -i path to directory with FASTQC outputs -o Path to the directory where the multiQC report will be written
-
Run MultiQC as a slurm job (UVA internal on Rivanna)
#!/bin/bash sbatch src/slurm/run_multiqc.slurm
If needed, the user should change the values of
target_dir
, andout_dir
in the slurm script.
Make sure to identify the sequencing platform and the library preparation protocol, this information helps us to identify appropiate adapters and contaminents that must be removed. Generally, the sequencing protocol utilizes TruSeq Stranded mRNA Kit, supported across several Illumina platform. Standard TruSeq adapters are provided as Illumina_TruSeq_adapters.fasta
. For more details on Illumina adapters visit official documentation.
-
Run Trimmomatic in a local computing environment
#!/bin/bash # make sure Trimmomatic is installed and availble in the $PATH bash src/sh/trim_fastq.sh --help
Usage: src/sh/trim_fastq.sh -i target_dir -o out_dir -a adapter_file -w window -m min_len -i path to directory with fastq files -o Path to the directory where the outputs will be written -a Path to the adapter fasta file -w Sliding Window -m Minimum read length -t Trimmomatic alias
For default installations of Trimmomatic v0.39, the trimmomatic alias shoud be
trimmomatic-0.39.jar
or it can be the path totrimmomatic-<version>.jar
. For more details on the parameters, see the Trimmomatic manual. -
Run Trimmomatic as a slurm job (UVA internal on Rivanna)
#!/bin/bash sbatch src/slurm/trim_fastq.slurm
If needed, the user should change the values of
target_dir
,out_dir
,adapter_file
,window
andmin_len
in the slurm script.
Trimmomatic provides several valuable statistics, including the input reads and reads remaining after trimming and dropped reads. However, these statics are part of the console output. The following script extracts these statistics from the colsole output saved as a text file or log files.
-
Run as an R script
#!/bin/bash # make sure R is installed and Rscript is availble in the $PATH Rscript src/R/extract_trimstat.R --help
Options: -e ERROR_LOG, --error_log=ERROR_LOG Path to the Trimmomatic error log file -c CONSOLE_LOG, --console_log=CONSOLE_LOG Path to the Trimmomatic console log file -o OUT_DIR, --out_dir=OUT_DIR A folder where the output will be written -h, --help Show this help message and exit
-
Run as a slurm job (UVA internal on Rivanna)
#!/bin/bash sbatch src/slurm/extract_trimstat.slurm
Default values for the arguments are set on the R script. The user may set these variables in the slurm script as needed.
We use Hisat2 for the alignment of the raw reads. The first step in the alignment is to prepare the necessary files, commonly known as a reference genome build or index. This can be performed by the the following script. This step requires a mouse reference genome and a list of known SNPs.
#!/bin/bash
#------- reference genome
# Download mouse reference genome (GRCm38)
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M20/GRCm38.p6.genome.fa.gz
# unzip reference genome
gunzip -d Mus_musculus.GRCm38.dna_sm.toplevel.fa.gz
#------ list of SNPs
# Download list of SNPs from UCSC (GRCm38 mapped positions)
wget http://hgdownload.cse.ucsc.edu/goldenPath/mm10/database/snp142.txt.gz
# unzip the list of SNPs
gunzip -d snp142.txt.gz
-
Prepare genome build in a local computing environment
#!/bin/bash bash src/sh/prepare_genome_build.sh --help
Usage: src/sh/prepare_genome_build.sh -i genome_fasta -s SNP_file -o out_dir -p hisat2_extract_snps_haplotypes_UCSC.py -i Path to the reference genome (to be used for the alignment) -s Path to the SNP file (to be used for the alignment) -p Path to the hisat2_extract_snps_haplotypes_UCSC.py script -o Path to the directory where the outputs will be written
Note:
hisat2_extract_snps_haplotypes_UCSC.py
is originally distributed through Hisat2, and has been made available in this repository at src/Py/hisat2_extract_snps_haplotypes_UCSC.py, under GNU General Public License. The original script can be found in the Hisat2 repository. -
Prepare genome build through a slurm job (UVA internal on Rivanna)
#!/bin/bash sbatch src/slurm/prepare_genome_build.slurm
If needed, the user should change the values of
genome_fasta
,snp_file
,python_script
andout_dir
in the slurm script.
Alternatively, prebuilt genome indexes can be downloaded from the Hisat2 downloads. For this purpose, we can use the SNP-aware or SNP and transcript-aware genome indexes of the GRCm38 or mm10 reference genome.
#!/bin/bash
# genome index with snps
wget https://cloud.biohpc.swmed.edu/index.php/s/grcm38_snp/download -O Hisat2_prebuilt_GRCM38_snp.tar.gz
tar -xf Hisat2_prebuilt_GRCM38_snp.tar.gz
# genome index with transcripts and SNPs
wget https://cloud.biohpc.swmed.edu/index.php/s/grcm38_snp_tran/download -O Hisat2_prebuilt_GRCM38_transcripts_snp.tar.gz
tar -xf Hisat2_prebuilt_GRCM38_transcripts_snp.tar.gz
# delete the .tar.gz only keeping the extracted folder
rm *.tar.gz # make sure to run where only the download .tar.gz are present
The above commands can download and decompress the taballs from the Hisat2 download page. The paths to the extracted genome index folder should need to be provided for the alignment.
-
Perform sequence alignment in a local computing environment
#!/bin/bash bash src/sh/align_reads.sh --help
Usage: src/sh/align_reads.sh -i target_dir -x genome_index_path -n genome_index_name -o out_dir -i Path to the target directory where trimmed fastq files are stored -x Path to the Hisat2 genome index -n Name of the Hisat2 genome index -o Path to the directory where the outputs will be written
Note:
-x genome_index_path
is equivalent to setting theHISAT2_INDEXES
environment variable, whereas-n genome_index_name
should specify the base name of the index files. The basename is the name of any of the index files up to but not including the final .1.ht2 / etc.-i target_dir
should be set on the output directory generated through thesrc/sh/trim_fastq.sh
script. -
Perform sequence alignment through a slurm job (UVA internal on Rivanna)
#!/bin/bash sbatch src/slurm/align_reads.slurm
The user may modify the
target_dir
,genome_index_path
,genome_index_name
, andout_dir
in the slurm script.
-
Generate BAM files in a local environment
#!/bin/bash bash src/sh/unsorted_sam_to_sorted_bam.sh --help
Usage: src/sh/unsorted_sam_to_sorted_bam.sh -i target_dir -o out_dir -i Path to the target directory where unsorted sam files are present -o Path to the directory where the outputs will be written
-
Generate BAM files through a slurm job (UVA internal on Rivanna)
#!/bin/bash sbatch src/slurm/unsoerted_sam_to_sorted_bam.slurm
The user may modify the
target_dir
, andout_dir
in the slurm script as needed.
-
Get alignment statistics in a local environment
#!/bin/bash bash src/sh/get_alignment_stat.sh --help
Usage: src/sh/get_alignment_stat.sh -i target_dir -o out_dir -i Path to the target directory where unsorted sam files are present -o Path to the directory where the outputs will be written
-
Get alignment statistics through a slurm job (UVA internal on Rivanna)
#!/bin/bash sbatch src/slurm/get_alignment_stat.slurm
The user may modify the
input_dir
, andoutput_dir
in the slurm script as needed.
- Summarize alignment statistics with a R script
#!/bin/bash Rscript src/R/summarize_alignment_stats.R --help
Options: -i INPUT, --input=INPUT Path to the folder where the outputs from 'get_alignment_stat.sh' are stored -o OUT_DIR, --out_dir=OUT_DIR A folder where the output will be written -h, --help Show this help message and exit
- Summarize alignment statistics with a slurm job (UVA internal on Rivanna)
The user may modify the
#!/bin/bash sbatch src/slurm/summarize_alignment_stats.slurm
target_dir
, andout_dir
in the slurm script as needed.
-
Create sample-level transcript assembly in a local environment
#!/bin/bash bash src/sh/compute_transcript_assembly.sh --help
Usage: src/sh/compute_transcript_assembly.sh -i target_dir -r ref_annotation -o out_dir -i Path to the target directory where the sorted BAM files are present -r Path to the reference genome annotation (GTF) -o Path to the directory where the outputs will be written
-
Create sample-level transcript assembly with a slurm job (UVA internal on Rivanna)
#!/bin/bash sbatch src/slurm/compute_transcript_assembly.slurm
The user may modify the
target_dir
,annotation
, andout_dir
in the slurm script as needed.
Note:
-
This step needs a reference annotation file that can be downloaded from the UCSC goldenpath/mm10 and other online sources.
-
The user must ensure that the chromosome names in the BAM and the supplied GTF are consistent (i.e. either
chr1
..chrM
or1
..M
). -
When using the prebuilt GRCm38 genome index provided by Hisat2 (for alignment), the user must update the GTF file to get consistent chromosome names (i.e
chr1
-->1
).#!/bin/bash sed 's/\bchr\([0-9XYM]*\)/\1/' mm10.ncbiRefSeq.gtf > mm10.ncbi_RefSeq_clean.gtf
-
Merge sample assemblies in a local environment
#!/bin/bash bash src/sh/merge_transcript_assemblies.sh --help
Usage: src/sh/merge_transcript_assemblies.sh -i target_dir -r ref_annotation -o out_dir -i Path to the target directory sample-level gtf files are present -r Path to the reference genome annotation (GTF) -o Path to the directory where the outputs will be written
-
Merge sample assemblies with a slurm job (UVA internal on Rivanna)
#!/bin/bash sbatch src/slurm/merge_transcript_assemblies.slurm
The user may modify the
target_dir
,annotation
, andout_dir
in the slurm script.
Generate individual sample-level assembly using the script from section 3.1; the ref_annotation
in this case is the merged gtf generated from the section 3.2. An additional slurm script compute_transcript_assembly_consensus.slurm
has been provided. As in the section 3.1, this script uses the src/sh/compute_transcript_assembly.sh
.
-
Merge and filter gene level abundances in a local environment
#!/bin/bash bash src/sh/merge_and_filter_gene_abundances.sh --help
Usage: src/sh/merge_and_filter_gene_abundances.sh -i target_dir -o out_dir -i Path to the target directory where the *gene_abundance.tab files are present (results from the previous step) -o Path to the directory where the outputs will be written
-
Merge and filter gene level abundances with a slurm job (UVA internal on Rivanna)
Due to the non-resource-intensive nature of the above script, a dedicated slurm script is not provided. Users are encouraged to run this script in the interactive mode (with
ijob
command). For more details, see notes for rivanna Users.#!/bin/bash # example ijob -A <account> -p <partition> -N 1 bash bash src/sh/merge_and_filter_gene_abundances.sh -i <target_dir> -o <out_dir>
-
Prepare count matrices in a local environment stringtie provides a python script (prepDE.py) that can easily convert the GTF files into count matrices. Using a custom script a list of the GTF files are prepared and fed into this script.
#!/bin/bash bash src/sh/generate_count_matrices.sh --help
Usage: src/sh/generate_count_matrices.sh -i target_dir -o out_dir -i Path to the target directory where the *_assembly.gtf files are present (results from the section 3.3) -o Path to the directory where the outputs will be written
-
Prepare count matrices with a slurm job (UVA internal on Rivanna)
#!/bin/bash sbatch src/slurm/generate_count_matrices.slurm
The user may modify the
target_dir
andout_dir
in the slurm script.
-
Filter gene level count matrix with a R script
#!/bin/bash Rscript src/R/filter_gene_count_matrix.R --help
Options: -i INPUT, --input=INPUT Path to the gene_count_matrix.csv file is present -g GENE_LIST, --gene_list=GENE_LIST Path to the gene_count_filt_0.1TPM_twenty_percent file -o OUT_DIR, --out_dir=OUT_DIR A folder where the output will be written -h, --help Show this help message and exit
-
Filter gene level count matrix with a a slurm job (UVA internal on Rivanna):
Due to the non-resource-intensive nature of the above script, a dedicated slurm script is not provided. Users are encouraged to run this script in the interactive mode (with
ijob
command)
-
Perform normalization with a R script
#!/bin/bash Rscript src/R/normalize_gene_count_matrix.R --help
Options: -i INPUT, --input=INPUT Path to the gene_count_matrix.csv file is present -o OUT_DIR, --out_dir=OUT_DIR A folder where the output will be written -h, --help Show this help message and exit
-
Perform normalization with a a slurm job (UVA internal on Rivanna):
Due to the non-resource-intensive nature of the above script, a dedicated slurm script is not provided. Users are encouraged to run this script in the interactive mode (with
ijob
command)
-
Compute peer factors with a python script:
#!/bin/bash python2 src/Py/compute_peer_factors.py --help
Options: -h, --help show this help message and exit -i FILE, --input=FILE path to the input file -o FOLDER, --output=FOLDER path to the output folder
Note: A successful
Peer
installation provides apython2.7
instance where the original C++ API can be accessed throughimport peer
command. This script relies on using the peer associated python interpreter. The input file isgene_abundance_vst_qnorm_nohead.csv
generated in the previous step. -
Compute peer factors with a slurm job (UVA internal on Rivanna):
#!/bin/bash sbatch src/slurm/compute_peer_factors.slurm
The user may modify the
input
andout_path
in the slurm script.