Updated on 2022-10-25
This tutorial use a small example to go through ReDeeM-V pipeline, which starts with mtDNA library fastq files and end with consensus variant calling
- Inputs: R1.fq (150nt), R2.fq (150nt), i5.fq (150nt)
- Key Output: QCplot.png (QC plot assesing mtDNA mapping as well as library complexity, etc)
- Key Output: QualifiedTotalCts, RawGenotypes.Total.StrandBalance, RawGenotypes.VerySensitive.StrandBalance, RawGenotypes.Sensitive.StrandBalance, RawGenotypes.Specific.StrandBalance
git clone https://github.com/chenweng1991/REDEEM-V.git
- python
- bowtie2
- cutadapt (This is cutadapt 3.7 with Python 3.6.9)
- samtools
- bedtools
- sys
- gzip
- pysam
- click
- os
- pickle
- numpy
- pandas
- progress.bar
- collections
- ggplot2
- dplyr
- gridExtra
- plyr
- labeling
REDEEM_V=ThePathToREDEEM-V #The loacation where the REDEEM-V is downloaded to (eg. /lab/solexa_weissman/cweng/Packages/REDEEM-V/)
MyMultiome=$REDEEM_V/MyMultiome/
mitoConsensus=$REDEEM_V/mitoConsensus/
Download hg38 fasta file and mitochondrial blacklist file
mkdir Genome
cd Genome
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
wget https://raw.githubusercontent.com/caleblareau/mitoblacklist/master/combinedBlacklist/hg38.full.blacklist.bed
gunzip hg38.fa.gz
Make hg38 mito masked bowtie2 index
bedtools maskfasta -fi hg38.fa -bed hg38.full.blacklist.bed -fo hg38.mitoMask.fa
bowtie2-build hg38.mitoMask.fa hg38.mitoMask ## This takes time and will generate following, if on WI server ln -s /lab/solexa_weissman/cweng/Genomes/GRCH38/GRCH38_Bowtie2_MitoMask/*.bt2 ./
For this tutorial, Link the example fq files
ln -s $REDEEM_V/source/Example.R1.fq.gz ./
ln -s $REDEEM_V/source/Example.R2.fq.gz ./
ln -s $REDEEM_V/source/Example.i5.fq.gz ./
Fastqc and multiqc
mkdir fastqc
fastqc -o fastqc Example.R1.fq.gz Example.i5.fq.gz Example.R2.fq.gz
multiqc fastqc
$MyMultiome/MultiomeATAC_mito.sh -n Tutorial_Mito -1 Example.R1.fq.gz -2 Example.R2.fq.gz -i Example.i5.fq.gz -c 0 -t 8 -m $MyMultiome -b Genome/hg38.mitoMask # add -q for quick, which skip the QC steps
After this step a QCplot is generated
In a ReDeeM experiment, the ATAC and RNA data is analyzed by Cellranger arc. The barcodes.tsv.gz from that will be parsed here. Becasue the barcode in barcodes.tsv.gz is RNA barcode, we need to translate into ATAC barcode (via RNAbc2ATAC.R) to be able to match mtDNA data
ln -s $REDEEM_V/source/barcodes.tsv.gz ./
Rscript $MyMultiome/Helpers/RNAbc2ATAC.R $REDEEM_V barcodes.tsv.gz Tutorial_atac.barcodes.tsv
Threads=24
WD=`pwd`/Out_mitoConsensus
python $mitoConsensus/Preprocess.py -i Tutorial_Mito.uniqmapped.bam -c $Threads -b Tutorial_atac.barcodes.tsv -o $WD -g rCRS -bt BC -sd $mitoConsensus
Run variant calling in parallel
echo $WD
for ((i=1;i<=$Threads;i++))
do
bsub python $mitoConsensus/mitoConsensus.py barcodes.$i $WD BC 30
done
Concat together
$mitoConsensus/Finalize.sh $WD $Threads $mitoConsensus
- Location: The results are saved in $WD/final
- Major 5 files:
- QualifiedTotalCts,
- RawGenotypes.Total.StrandBalance(Least stringent),
- RawGenotypes.VerySensitive.StrandBalance(Less stringent),
- RawGenotypes.Sensitive.StrandBalance(Stringent),
- RawGenotypes.Specific.StrandBalance (Most Stringent)
- QualifiedTotalCts is a table with 6 columnes that show mtDNA coverage per position per cell
Cellbarcode | coordinates on mt genome | # unique frag(total) | # unique frag(less stringent) | # unique frag(stringent) | # unique frag(very stringent) |
---|
- RawGenotypes is a table with 14 columnes that show the consensus variant calling. Each row is a molecule with a potential variant
MoleculeID | CellBC | Pos | Variant | V | Ref | FamSize | V-counts | CSS | DB_Cts | SG_Cts | Is+ | Is- | TotalDepth |
---|
- MoleculeID: Cellbarcode+start+end which is the identifier to define a molecule
- CellBC: Cell barcode
- Pos: The coordinate of the variant
- Variant: A description of the variant
- V: The variant base called on Pos
- Ref: The reference base on Pos
- FamSize: The consensus family size, or the total number of PCR copies for the given molecule
- V-count: Number of PCR copies that support the variant
- CSS: Consensus score, which is the proportion of PCR copies that support the variant
- DB_Cts: Number of double cover copies, which are positions that sequenced by both Read1 and Read2
- SG_Cts: Number of single cover copies, which are positions that sequenced by only Read1 or Read2
- Is+: If the variant is discovered on plus strand
- Is- : If the variant is discovered on minus strand
- TotalDepth: On this given position, total number of unique fragment in the given cell
-
QualifiedTotalCts and RawGenotypes.* are the inputs for REDEEM-R for downstream mutation filtering and phylogenetic tree reconstruction
- First, generate barcodes.tsv.16bp from the Cell Ranger result
zcat ../CellRanger/OBC/outs/filtered_feature_bc_matrix/barcodes.tsv.gz | awk '{print substr($1,1,16)}' > barcodes.tsv.16bp
- Second, Create HashDic.csv like below
## Create a HashDic.csv
#TGTCTTTCCTGCCAG,SBM1037
#CTCCTCTGCAATTAC,SBM1253
#CAGTAGTCACGGTCA,SBM1324
#ATTGACCCGCGTTAG,SBM1218
- Third run the script below, which will generate a plot.pdf and a folder pymulti that contains a result table
python3 $REDEEMV/JohnnyCellHash/RunCellHash.py $REDEEMV OBC OBC_Hash_S1_L001_R1_001.fastq.gz.trim OBC_Hash_S1_L001_R3_001.fastq.gz.trim