School is a collection of genomics analysis workflows that are used for detecting single nucleotide variants (SNVs), insertions/deletions (indels), copy number variants (CNVs) and translocations from RNA and DNA sequencing. These workflows have been validated in a CLIA laboratory at UTSW
This pipeline uses Nextflow, a bioinformatics workflow tool and Singularity, a containerization tool.
Make sure both tools rae installed before running this pipeline. If running on a HPC cluster then load required modules.
module load nextflow/20.01.0 singularity/3.5.3
For most recent published version
git clone -b version_1.0.3 --single-branch https://github.com/bcantarel/school.git
For most recent development version
git clone https://github.com/bcantarel/school.git
The SCHOOL workflow can run many different configurations. The input files must be placed into a fastqs directory along with a design file.
cd school/
mkdir -p ./fastq
This workflow can run either DNA or RNA sequencing. Please determine the desired configuration to achieve the proper analysis run.
The design file must named design.txt and be in tab seperated format for the workflows. This workflow can be run with tumor-only or with tumor and normal pairs. If running tumor-only then do not include the NormalID collumn in the design file.
SampleID | CaseID | TumorID | NormalID | FqR1 | FqR2 |
---|---|---|---|---|---|
Sample1 | Fam1 | Sample1 | Sample2 | Sample1.R1.fastq.gz | Sample1.R2.fastq.gz |
Sample2 | Fam1 | Sample1 | Sample2 | Sample2.R1.fastq.gz | Sample2.R2.fastq.gz |
Sample3 | Fam2 | Sample3 | Sample4 | Sample3.R1.fastq.gz | Sample3.R2.fastq.gz |
Sample4 | Fam2 | Sample3 | Sample4 | Sample4.R1.fastq.gz | Sample4.R2.fastq.gz |
- --input
- directory containing the design file and fastq files
- default is set to '${basedir}/fastq'
- eg: --input '/project/shared/bicf_workflow_ref/workflow_testdata/germline_variants/fastq'
- --output
- directory for the analysis output
- default is set to '${basedir}/analysis'
- eg: --output '${basedir}/output'
- --seqrunid
- Illumina Run ID, used to distrinquished repetative runs
- default set to 'runtest'
- eg: --seqrunid 'run'
- --genome
- directory containing all reference files for the various tools. This includes the genome.fa, dbsnp.vcf.gz, GoldIndels.vcf.gz, ncm.conf, ect.
- default is set for use on UTSW BioHPC.
- eg: --genome '/project/shared/bicf_workflow_ref/human/grch38_cloud/dnaref'
- --virus_genome
- directory containing viral genome reference files
- default is set for use on UTSW BioHPC
- eg: --viral_genome '/project/shared/bicf_workflow_ref/human_virus_genome/clinlab_idt_genomes'
- --pon
- pon file for mutect capture kit bed file
- default is set to not included. If a pon file is included, then mutect will run with reference to the inputted file
- eg: --pon '/project/shared/bicf_workflow_ref/human/grch38_cloud/panels/UTSW_V4_heme/mutect2.pon.vcf.gz'
- --capture
- bed file containing information on gene capture kit
- eg: --capture '/project/shared/bicf_workflow_ref/human/grch38_cloud/panels/UTSW_V4_heme/targetpanel.bed'
- --capturedir
- directory containing capture bed and supporting files
- eg: --capturedir '/project/shared/bicf_workflow_ref/human/grch38_cloud/panels/UTSW_V4_heme'
- --platypus
- select whether to 'skip' platypus algorithm or 'detect' using platypus
- default is set to 'skip'
- eg: --platypus 'detect'
- --markdups
- select the method for marking duplicates from 'picard', 'samtools', 'fgbio_umi', 'picard_umi', or 'none'
- default is set to 'fgbio_umi'
- eg: --markdups 'picard_umi'
- --version
- version of workflow analysis pipeline
- default is set to 'v4'
- For current git version run 'gittag=$(git describe --abbrev=0 --tags)'
- eg: --version $gittag
- --snpeff_vers
- version of reference genome for snpeff tool
- default is set to 'GRCh38.86'
- eg: --snpeff_vers 'GRCh38.86'
nextflow run -w $workdir ${baseDir}/dna.nf --input /project/shared/bicf_workflow_ref/workflow_testdata/germline_variants/fastq --output ${basedir}/output --seqrunid 'SHI1333-27' --pon /project/shared/bicf_workflow_ref/human/grch38_cloud/panels/UTSW_V4_heme/mutect2.pon.vcf.gz --capture /project/shared/bicf_workflow_ref/human/grch38_cloud/panels/UTSW_V4_heme/targetpanel.bed --capturedir /project/shared/bicf_workflow_ref/human/grch38_cloud/panels/UTSW_V4_heme --version $gittag --genome /project/shared/bicf_workflow_ref/human/grch38_cloud/dnaref -resume
The RNA workflow can be run in with the whole genome, or with a specific list of genes of interest.
The design file must named design.txt and be in tab seperated format for the workflows. All RNA workflows can be run usin the same design file format.
SampleID | CaseID | FqR1 | FqR2 |
---|---|---|---|
Sample1 | Fam1 | Sample1.R1.fastq.gz | Sample1.R2.fastq.gz |
Sample2 | Fam1 | Sample2.R1.fastq.gz | Sample2.R2.fastq.gz |
Sample3 | Fam2 | Sample3.R1.fastq.gz | Sample3.R2.fastq.gz |
Sample4 | Fam2 | Sample4.R1.fastq.gz | Sample4.R2.fastq.gz |
- --input
- directory containing the design file and fastq files
- default is set to '${basedir}/fastq'
- eg: --input '/project/shared/bicf_workflow_ref/workflow_testdata/rnaseq/fastq'
- --output
- directory for the analysis output
- default is set to '${basedir}/analysis'
- eg: --output '${basedir}/output'
- --seqrunid
- Illumina Run ID, used to distrinquished repetative runs
- default set to 'runtest'
- eg: --seqrunid 'run'
- --genome
- directory containing all reference files for the various tools. This includes the genome.fa, gencode.gtf, genenames.txt, ect.
- default is set for use on UTSW BioHPC.
- eg: --genome '/project/shared/bicf_workflow_ref/human/grch38_cloud/rnaref'
- --geneinfo
- path to textfile with the geneinfo for the human genome with reference to the genome being used
- default is set for use on UTSW BioHPC
- eg: --geneinfo '/project/shared/bicf_workflow_ref/human/gene_info.human.txt'
- --version
- version of workflow analysis pipeline
- default is set to 'v5'
- For current git version run 'gittag=$(git describe --abbrev=0 --tags)'
- eg: --version $gittag
- --snpeff_vers
- version of reference genome for snpeff tool
- default is set to 'GRCh38.86'
- eg: --snpeff_vers 'GRCh38.86'
- --stranded
- option for -s flag in featurecount used in geneabundance calculations
- default is set to '0'
- eg: --stranded '0'
- --pairs
- select either 'pe' (paired-end) or 'se' (single-end) based on read inputs. Select 'pe' when both R1 and R2 are present. If only R1, then select 'se'.
- default is set to 'pe'
- eg: --pairs 'pe'
- --align
- select the algorithm/tool for alignment from 'hisat' or 'star'
- default is set to 'hisat'
- eg: --align 'hisat'
- --bamct
- choose to either 'detect' or 'skip' bam read count process
- default is set to 'detect'
- eg: --bamct 'detect'
- --glist
- this is an optional parameter. If a genelist is included, then the whole rnaseq is not run but instead is restricted to the list containing genes of interest.
- default is set to run whole rnaseq
- eg: --glist '/project/shared/bicf_workflow_ref/human/grch38_cloud/panels/UTSW_V4_heme/genelist.txt'
- --umi
- This is an optional paramter. If included, then the -u option is added then UMI sequences are in the fastq read names.
- default is set to not include umi
- eg: --umi 'true'
Whole rnaseq example
nextflow run -w $workdir ${baseDir}/rna.nf --input /project/shared/bicf_workflow_ref/workflow_testdata/rnaseq/fastq --output ${basedir}/output --seqrunid 'test' --version $gittag --genome /project/shared/bicf_workflow_ref/human/grch38_cloud/rnaref --geneinfo /project/shared/bicf_workflow_ref/human/gene_info.human.txt -resume
GOI rnaseq example
nextflow run -w $workdir ${baseDir}/rna.nf --input /project/shared/bicf_workflow_ref/workflow_testdata/rnaseq/fastq --output ${basedir}/output --seqrunid 'test' --version $gittag --genome /project/shared/bicf_workflow_ref/human/grch38_cloud/rnaref --geneinfo /project/shared/bicf_workflow_ref/human/gene_info.human.txt --glist /project/shared/bicf_workflow_ref/human/grch38_cloud/panels/UTSW_V4_heme/genelist.txt -resume