Skip to content

Latest commit

 

History

History
168 lines (149 loc) · 6.53 KB

Tutorial_20221025.md

File metadata and controls

168 lines (149 loc) · 6.53 KB

Updated on 2022-10-25

Introduction:

This tutorial use a small example to go through ReDeeM-V pipeline, which starts with mtDNA library fastq files and end with consensus variant calling

  • Inputs: R1.fq (150nt), R2.fq (150nt), i5.fq (150nt)
  • Key Output: QCplot.png (QC plot assesing mtDNA mapping as well as library complexity, etc)
  • Key Output: QualifiedTotalCts, RawGenotypes.Total.StrandBalance, RawGenotypes.VerySensitive.StrandBalance, RawGenotypes.Sensitive.StrandBalance, RawGenotypes.Specific.StrandBalance

Prerequisite

Download the package

git clone https://github.com/chenweng1991/REDEEM-V.git

Software dependency

  • python
  • bowtie2
  • cutadapt (This is cutadapt 3.7 with Python 3.6.9)
  • samtools
  • bedtools

python package dependency

  • sys
  • gzip
  • pysam
  • click
  • os
  • pickle
  • numpy
  • pandas
  • progress.bar
  • collections

R package dependency

  • ggplot2
  • dplyr
  • gridExtra
  • plyr
  • labeling

Main pipeline

Assign paths

REDEEM_V=ThePathToREDEEM-V #The loacation where the REDEEM-V is downloaded to (eg. /lab/solexa_weissman/cweng/Packages/REDEEM-V/)
MyMultiome=$REDEEM_V/MyMultiome/
mitoConsensus=$REDEEM_V/mitoConsensus/

Step 0: Make mitoMask bowtie2 index

Download hg38 fasta file and mitochondrial blacklist file

mkdir Genome
cd Genome
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz 
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
wget https://raw.githubusercontent.com/caleblareau/mitoblacklist/master/combinedBlacklist/hg38.full.blacklist.bed
gunzip hg38.fa.gz

Make hg38 mito masked bowtie2 index

bedtools maskfasta  -fi hg38.fa -bed hg38.full.blacklist.bed -fo hg38.mitoMask.fa
bowtie2-build hg38.mitoMask.fa hg38.mitoMask  ## This takes time and will generate following, if on WI server ln -s /lab/solexa_weissman/cweng/Genomes/GRCH38/GRCH38_Bowtie2_MitoMask/*.bt2 ./

Step1: Get sequencing result and QC

For this tutorial, Link the example fq files

ln -s $REDEEM_V/source/Example.R1.fq.gz ./
ln -s $REDEEM_V/source/Example.R2.fq.gz ./
ln -s $REDEEM_V/source/Example.i5.fq.gz ./

Fastqc and multiqc

mkdir fastqc
fastqc -o fastqc Example.R1.fq.gz Example.i5.fq.gz Example.R2.fq.gz
multiqc fastqc

Step2: mapping and basic QC

$MyMultiome/MultiomeATAC_mito.sh -n Tutorial_Mito -1 Example.R1.fq.gz -2 Example.R2.fq.gz -i Example.i5.fq.gz -c 0 -t 8 -m $MyMultiome -b Genome/hg38.mitoMask  # add -q for quick, which skip the QC steps

After this step a QCplot is generated Old1_HSPC_Mito QCplot

Step3: Parse the Cellranger qualified cell barcode

In a ReDeeM experiment, the ATAC and RNA data is analyzed by Cellranger arc. The barcodes.tsv.gz from that will be parsed here. Becasue the barcode in barcodes.tsv.gz is RNA barcode, we need to translate into ATAC barcode (via RNAbc2ATAC.R) to be able to match mtDNA data

ln -s $REDEEM_V/source/barcodes.tsv.gz ./
Rscript $MyMultiome/Helpers/RNAbc2ATAC.R $REDEEM_V barcodes.tsv.gz Tutorial_atac.barcodes.tsv

Step4, prepare for variant calling

Threads=24
WD=`pwd`/Out_mitoConsensus
python $mitoConsensus/Preprocess.py -i Tutorial_Mito.uniqmapped.bam  -c $Threads -b Tutorial_atac.barcodes.tsv -o $WD -g rCRS -bt BC -sd $mitoConsensus

Step5, run consensus variant calling

Run variant calling in parallel

echo $WD
for ((i=1;i<=$Threads;i++))
do
bsub python $mitoConsensus/mitoConsensus.py  barcodes.$i $WD BC 30
done

Concat together

$mitoConsensus/Finalize.sh $WD $Threads $mitoConsensus

Expected result

  • Location: The results are saved in $WD/final
  • Major 5 files:
    • QualifiedTotalCts,
    • RawGenotypes.Total.StrandBalance(Least stringent),
    • RawGenotypes.VerySensitive.StrandBalance(Less stringent),
    • RawGenotypes.Sensitive.StrandBalance(Stringent),
    • RawGenotypes.Specific.StrandBalance (Most Stringent)
  • QualifiedTotalCts is a table with 6 columnes that show mtDNA coverage per position per cell
Cellbarcode coordinates on mt genome # unique frag(total) # unique frag(less stringent) # unique frag(stringent) # unique frag(very stringent)
  • RawGenotypes is a table with 14 columnes that show the consensus variant calling. Each row is a molecule with a potential variant
MoleculeID CellBC Pos Variant V Ref FamSize V-counts CSS DB_Cts SG_Cts Is+ Is- TotalDepth
  1. MoleculeID: Cellbarcode+start+end which is the identifier to define a molecule
  2. CellBC: Cell barcode
  3. Pos: The coordinate of the variant
  4. Variant: A description of the variant
  5. V: The variant base called on Pos
  6. Ref: The reference base on Pos
  7. FamSize: The consensus family size, or the total number of PCR copies for the given molecule
  8. V-count: Number of PCR copies that support the variant
  9. CSS: Consensus score, which is the proportion of PCR copies that support the variant
  10. DB_Cts: Number of double cover copies, which are positions that sequenced by both Read1 and Read2
  11. SG_Cts: Number of single cover copies, which are positions that sequenced by only Read1 or Read2
  12. Is+: If the variant is discovered on plus strand
  13. Is- : If the variant is discovered on minus strand
  14. TotalDepth: On this given position, total number of unique fragment in the given cell
  • Strand biased variant has been removed StrandBiase

  • QualifiedTotalCts and RawGenotypes.* are the inputs for REDEEM-R for downstream mutation filtering and phylogenetic tree reconstruction

(Optional) Cell Hashing

  • First, generate barcodes.tsv.16bp from the Cell Ranger result
zcat ../CellRanger/OBC/outs/filtered_feature_bc_matrix/barcodes.tsv.gz | awk '{print substr($1,1,16)}' > barcodes.tsv.16bp
  • Second, Create HashDic.csv like below
## Create a HashDic.csv
#TGTCTTTCCTGCCAG,SBM1037
#CTCCTCTGCAATTAC,SBM1253
#CAGTAGTCACGGTCA,SBM1324
#ATTGACCCGCGTTAG,SBM1218
  • Third run the script below, which will generate a plot.pdf and a folder pymulti that contains a result table
python3 $REDEEMV/JohnnyCellHash/RunCellHash.py $REDEEMV OBC OBC_Hash_S1_L001_R1_001.fastq.gz.trim OBC_Hash_S1_L001_R3_001.fastq.gz.trim