Marker Tech

A cost-saving genotyping pattern based on the next-generation sequencing (NGS) technology. We also build an online analysis system which provides several services for using the technology in an easier way. However, in order to meet diverse and highly customized needs, the main source codes are provided for users to modify as needed. The code we provide include primer design function and iBP-seq (improved Bulked-PCR Sequencing) analysis pipeline which can do both genotyping and epigenotyping. The primer design function generates special primer sequences flanking the maker for each template sequence. The NGS data for the sequence analysis pipeline should be generated by using the primer design function.

Preparation

Primer design

Environment
- python>=3.6.0
  - primer3-py>=0.6.1
Data
- a template sequence file in fasta format containing template sequences (header format: >Name|301-304; Sequence length>=601)

iBP-seq (genotyping) analysis

Due to the limitation of some modules, such as pysam, we recommend running these main codes on a Linux operating system when you use them.

Both the genotyping service for iBP-seq and epigenotyping service require these following configurations:

Environment
- python>=3.6.0
  - openpyxl>=3.0.7
  - XlsxWriter>=1.4.0
  - seaborn
  - numpy
  - pandas
  - matplotlib==3.4.3
  - sklearn
  - pysam==0.16.0.1
- fastp>=0.21.0
- bwa>=0.7.17
- samtools>=1.12
Data
- a pair of paired-end sequencing data in fastq format
- a barcode file which order is consistent with the actual order of use (separator='\n')
- a mutation table with header include four columns (CHROM, POS, REF, ALT)
- optional: a reference sequence file in fasta format

iBP-seq (epigenotyping) analysis

Environment
- python>=3.6.0
  - openpyxl>=3.0.7
  - XlsxWriter>=1.4.0
  - seaborn
  - numpy
  - pandas
  - matplotlib==3.4.3
  - sklearn
  - pysam==0.16.0.1
- fastp>=0.21.0
- bsmark>=0.23.1
- samtools>=1.12

Ready to work

Before starting the analysis, if the sequencing file is in .gz compressed format, it also needs to be decompressed.

gzip -dk <filename>.fq.gz

And make sure you have the following empty directories in the working directory. If not, create them first.

Genotyping :

mkdir qc fq umi bam csv xlsx png

Epigenotyping:

mkdir qc fq umi sam bam csv xlsx png

Usage

Primer design

The input parameters of PrimerDesigner.py are: strategy code name (iBP-seq: 3), input template file, file name (including .xlsx suffix), shortest primer length, optimal primer length, longest primer length, shortest amplicon length, optimal amplicon length, longest amplicon length, minimum Tm of primer, optimal Tm of primer, maximum Tm of primer.

An example of the command line is as follows:

python3 PrimerDesigner.py 3 testSeq.fa ./filename.xlsx 18 20 25 280 286 320 50 60 65

iBP-seq (genotyping) analysis

Quality control

Filter out low quality (lower than 20) and sequence length less than 72bp (2*20bp primer + 6bp UMI + 18Bbp bridge + 8bp barcode = 72bp) reads.

fastp -i <filename_1>.fq -I <filename_2>.fq -o qc/clean_1.fq -O qc/clean_2.fq -q 20 --length_required 72 -j qc/<filename>.json -h qc/<filename>.html

Data division

Divide the mixed sequencing data to several files by using the barcode file and build-in bridge sequence (selected by the first parameter). The number of barcodes in the file decides the number of divided samples and the order of barcodes decides the order of each sample. Pair-end sequencing genome sequences are stored as pairs of fastq files in the fq directory and UMI sequences are stored as single fastq files in the umi directory.

python3 SeqPurifier.py iBP barcode/<barcode>.txt qc/clean_1.fq qc/clean_2.fq fq umi

Reference alignment

Align sequence to reference sequence or genome.

for ((f=0; f<$(ls -l fq | grep "_1.fq$" | wc -l); f++));
do
	bwa mem -t 4 -M <Reference>.fa fq/${f}_1.fq fq/${f}_2.fq | samtools view -S -b - | samtools sort -o bam/${f}.sorted.bam -;
done

Build index for sorted bam file.

for b in $(ls bam/*);
do
	samtools index ${b};
done

Marker frequency

Calculate each mutated marker frequency for each sample.

python3.7 AltFraction.py iBP barcode/<barcode>.txt mutation/<mut>.csv bam umi csv

KDE Cluster

Identify different genotypes or methylation types by the minimum point of kernel density estimation (KDE) from the sample population.

for t in $(ls csv/*);
do
	python3.7 Minima.py ${t} xlsx png;
done

iBP-seq (epigenotyping) analysis

This pipeline is an application of iBP-seq (genotyping) method in processing methylation data.

Quality control

fastp -i <filename_1>.fq -I <filename_2>.fq -o qc/clean_1.fq -O qc/clean_2.fq -q 20 --length_required 72 -j qc/<filename>.json -h qc/<filename>.html

Data division

python3 SeqPurifier.py mBP barcode/<barcode>.txt qc/clean_1.fq qc/clean_2.fq fq umi

Reference alignment

Reference split

python3 MethReference.py barcode/<barcode>.txt ref/<ref>.fa
# Build index
for i in $(ls -d ref/*);
do
	bismark_genome_preparation ${i}/ ;
done

Alignment

for ((f=0; f<$(ls -l fq | grep "_1.fq$" | wc -l); f++));
do
	ref_name=$(sed -n $[${f}+1]p barcode/<barcode>.txt | sed "s/.*,//" | sed "s/\r//");
	lineage=$(python3 -c "print('/'.join(sorted('${ref_name}'.split('/'))))" | sed "s/\//_/g");
	bismark -N 0 -L 20 --bowtie2 --non_bs_mm /<abs_path>/ref/"${lineage}"/ --sam -o sam/ --q -1 fq/${f}_1.fq -2 fq/${f}_2.fq;
done

SAM processing

# Text processing
for ((f=0; f<$(ls -l fq | grep "_1.fq$" | wc -l); f++)); 
do
    grep "@" sam/${f}_1_bismark_bt2_pe.sam > sam/${f}-header;
    grep "XA:Z:0" sam/${f}_1_bismark_bt2_pe.sam > sam/${f}-temp1;
    grep "XB:Z:0" sam/${f}_1_bismark_bt2_pe.sam > sam/${f}-temp2;
    awk 'NR==FNR {f1[$1]=$0;next} {idx=$1;if(idx in f1)print$0}' sam/${f}-temp1 sam/${f}-temp2 > sam/${f}-temp4;
    awk 'NR==FNR {f1[$1]=$0;next} {idx=$1;if(idx in f1)print$0}' sam/${f}-temp2 sam/${f}-temp1 > sam/${f}-temp3;
    cat sam/${f}-header sam/${f}-temp3 sam/${f}-temp4 > sam/${f}_1_bismark_bt2_pe_nonmismatch.sam; 
done

# SAM to BAM
for ((f=0; f<$(ls -l fq | grep "_1.fq$" | wc -l); f++));
do
    samtools sort -n -O bam -o bam/${f}_1_bismark_bt2_pe_nonmismatch.bam sam/${f}_1_bismark_bt2_pe_nonmismatch.sam;
    lineage=$(sed -n $[${f}+1]p barcode/barcode.txt | sed "s/.*,//" | sed "s/\r//" | sed "s/\//_/g");
    bismark_methylation_extractor --no_overlap --comprehensive --bedGraph --cytosine_report --CX_context --genome_folder /<abs_path>/ref/${lineage}/ -p bam/${f}_1_bismark_bt2_pe_nonmismatch.bam -o bam/;
    gunzip bam/${f}_1_bismark_bt2_pe_nonmismatch.bismark.cov.gz;
    python3 magic.py bam/${f}_1_bismark_bt2_pe_nonmismatch.bismark.cov bam/${f}_1_bismark_bt2_pe_nonmismatch.CX_report.txt Me-information/${f}_result;done
    samtools sort bam/${f}_1_bismark_bt2_pe_nonmismatch.bam -o bam/${f}_1_bismark_bt2_pe_nonmismatch.sorted.bam;
    samtools index bam/${f}_1_bismark_bt2_pe_nonmismatch.sorted.bam;
done

Methylation activity

for ((f=0; f<$(ls -l fq | grep "_1.fq$" | wc -l); f++));
do
	python3 MethFraction.py umi/${f}.fq bam/${f}_1_bismark_bt2_pe_nonmismatch.sorted.bam Me-information/${f}_result csv/${f}.csv;
done

Histogram plot

for ((f=0; f<$(ls -l fq | grep "_1.fq$" | wc -l); f++));
do
	python3 MethHist.py csv/${f}.csv png;
done

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
iBP-seq (epigenotyping)		iBP-seq (epigenotyping)
iBP-seq (genotyping)		iBP-seq (genotyping)
primerDesign		primerDesign
protocols		protocols
.gitignore		.gitignore
README.md		README.md
workflow.md		workflow.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Marker Tech

Preparation

Primer design

iBP-seq (genotyping) analysis

iBP-seq (epigenotyping) analysis

Ready to work

Usage

Primer design

iBP-seq (genotyping) analysis

Quality control

Data division

Reference alignment

Marker frequency

KDE Cluster

iBP-seq (epigenotyping) analysis

Quality control

Data division

Reference alignment

Reference split

Alignment

SAM processing

Methylation activity

Histogram plot

Copyright

About

Releases 4

Packages

Languages

Hanryi/MarkerTech

Folders and files

Latest commit

History

Repository files navigation

Marker Tech

Preparation

Primer design

iBP-seq (genotyping) analysis

iBP-seq (epigenotyping) analysis

Ready to work

Usage

Primer design

iBP-seq (genotyping) analysis

Quality control

Data division

Reference alignment

Marker frequency

KDE Cluster

iBP-seq (epigenotyping) analysis

Quality control

Data division

Reference alignment

Reference split

Alignment

SAM processing

Methylation activity

Histogram plot

Copyright

About

Topics

Resources

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages