All the steps are perform in linux.
This repository provides a step-by-step guide for performing quality control, trimming, alignment, variant calling, and annotation of CRISPR data. The pipeline includes downloading raw sequencing data, processing it, and identifying genetic variants.
Ensure that the following tools are installed on your system:
wget
: Used to download raw sequencing data and reference genomes.gunzip
: For decompressing.gz
files.FastQC
: For quality control of raw reads.fastp
: For trimming sequencing reads.BWA
: For aligning the sequencing reads to a reference genome.samtools
: For sorting, removing duplicates, and converting file formats.GATK
: For variant calling.picard-tools
: Required for file format conversion and preparing BAM files for GATK.SnpEff
: For variant annotation.VEP
: Used for annotating variants based on the genome.
Use wget
to download raw FASTQ files from the SRA repository:
wget -c ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR217/020/SRR21763320/SRR21763320_1.fastq.gz
wget -c ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR217/020/SRR21763320/SRR21763320_2.fastq.gz
Decompress the FASTQ files:
gunzip *.gz
Run quality control using FastQC
:
fastqc *.fastq
Check the following parameters:
- Per base sequence quality
- Overrepresented sequences
- Adapter content
Trim the reads using fastp
:
Create a adapter file first
touch adapter.fasta
These are the Universal Adapters:
Illumina Universal Adapter = AGATCGGAAGAG
Illumina Small RNA 3' Adapter = TGGAATTCTCGG
Illumina Small RNA 5' Adapter = GATCGTCGGACT
Nextera Transposase Sequence = CTGTCTCTTATA
PolyA = AAAAAAAAAAAA
PolyG = GGGGGGGGGGGG
Open newly created adapter.fasta file in Notepad and write:
>H1
AGATCGGAAGAG
after this perform below command.
fastp -i sample.fastq -o trim_sample.fastq --adapter_fasta adapter.fasta
Run quality control on the trimmed reads:
fastqc trim_sample.fastq
Download the reference genome:
wget -c https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr17.fa.gz
gunzip chr17.fa.gz
mv chr17.fa genome.fa
Index the reference genome:
bwa index -a bwtsw genome.fa
Align the reads to the reference genome:
bwa mem -t 2 genome.fa SRR21763320_1.fastq SRR21763320_2.fastq > bwa_SRR21763320.bam
Sort the BAM file using samtools
:
samtools sort bwa_SRR21763320.bam > sorted_SRR21763320.bam
Convert the BAM file to SAM:
samtools view sorted_SRR21763320.bam > sorted_SRR21763320.sam
Remove duplicate reads using samtools
:
samtools rmdup -sS sorted_SRR21763320.bam rmdup_SRR21763320.bam
Download and install GATK:
wget -c https://github.com/broadinstitute/gatk/releases/download/4.3.0.0/gatk-4.3.0.0.zip
Convert the reference genome to Picard-tools format:
picard-tools CreateSequenceDictionary R=genome.fa O=genome.dict
Prepare the BAM file for GATK:
picard-tools AddOrReplaceReadGroups I=rmdup_SRR21763320.bam O=picard_output.bam RGLB=lib1 RGPL=illumina RGPU=run RGSM=SRR21763320 SORT_ORDER=coordinate CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT
Call variants with GATK:
samtools faidx genome.fa
java -jar /path/to/gatk-package-4.3.0.0-local.jar HaplotypeCaller -R genome.fa -I picard_output.bam -O GATK_output.vcf
Filter variants using SnpSift
:
cat GATK_output.vcf | java -jar /path/to/SnpSift.jar filter "(( QUAL>=30) & (DP>=10) & (MQ>=30))" > filter.vcf
Annotate variants using VEP: Refer to the VEP documentation for more details.
java -jar snpEff.jar chr3 GATK_output.vcf > VEP_output.vcf
This project is licensed under the MIT License.
This pipeline was inspired by various bioinformatics tools and resources such as FastQC, BWA, GATK, and VEP.