A cost-saving genotyping pattern based on the next-generation sequencing (NGS) technology. We also build an online analysis system which provides several services for using the technology in an easier way. However, in order to meet diverse and highly customized needs, the main source codes are provided for users to modify as needed. The code we provide include primer design function and iBP-seq (improved Bulked-PCR Sequencing) analysis pipeline which can do both genotyping and epigenotyping. The primer design function generates special primer sequences flanking the maker for each template sequence. The NGS data for the sequence analysis pipeline should be generated by using the primer design function.
- Environment
- python>=3.6.0
- primer3-py>=0.6.1
- python>=3.6.0
- Data
- a template sequence file in fasta format containing template sequences (header format: >Name|301-304; Sequence length>=601)
Due to the limitation of some modules, such as pysam, we recommend running these main codes on a Linux operating system when you use them.
Both the genotyping service for iBP-seq and epigenotyping service require these following configurations:
- Environment
- python>=3.6.0
- openpyxl>=3.0.7
- XlsxWriter>=1.4.0
- seaborn
- numpy
- pandas
- matplotlib==3.4.3
- sklearn
- pysam==0.16.0.1
- fastp>=0.21.0
- bwa>=0.7.17
- samtools>=1.12
- python>=3.6.0
- Data
- a pair of paired-end sequencing data in fastq format
- a barcode file which order is consistent with the actual order of use (separator='\n')
- a mutation table with header include four columns (CHROM, POS, REF, ALT)
- optional: a reference sequence file in fasta format
- Environment
- python>=3.6.0
- openpyxl>=3.0.7
- XlsxWriter>=1.4.0
- seaborn
- numpy
- pandas
- matplotlib==3.4.3
- sklearn
- pysam==0.16.0.1
- fastp>=0.21.0
- bsmark>=0.23.1
- samtools>=1.12
- python>=3.6.0
Before starting the analysis, if the sequencing file is in .gz compressed format, it also needs to be decompressed.
gzip -dk <filename>.fq.gz
And make sure you have the following empty directories in the working directory. If not, create them first.
Genotyping :
mkdir qc fq umi bam csv xlsx png
Epigenotyping:
mkdir qc fq umi sam bam csv xlsx png
The input parameters of PrimerDesigner.py are: strategy code name (iBP-seq: 3), input template file, file name (including .xlsx suffix), shortest primer length, optimal primer length, longest primer length, shortest amplicon length, optimal amplicon length, longest amplicon length, minimum Tm of primer, optimal Tm of primer, maximum Tm of primer.
An example of the command line is as follows:
python3 PrimerDesigner.py 3 testSeq.fa ./filename.xlsx 18 20 25 280 286 320 50 60 65
Filter out low quality (lower than 20) and sequence length less than 72bp (2*20bp primer + 6bp UMI + 18Bbp bridge + 8bp barcode = 72bp) reads.
fastp -i <filename_1>.fq -I <filename_2>.fq -o qc/clean_1.fq -O qc/clean_2.fq -q 20 --length_required 72 -j qc/<filename>.json -h qc/<filename>.html
Divide the mixed sequencing data to several files by using the barcode file and build-in bridge sequence (selected by the first parameter). The number of barcodes in the file decides the number of divided samples and the order of barcodes decides the order of each sample. Pair-end sequencing genome sequences are stored as pairs of fastq files in the fq directory and UMI sequences are stored as single fastq files in the umi directory.
python3 SeqPurifier.py iBP barcode/<barcode>.txt qc/clean_1.fq qc/clean_2.fq fq umi
Align sequence to reference sequence or genome.
for ((f=0; f<$(ls -l fq | grep "_1.fq$" | wc -l); f++));
do
bwa mem -t 4 -M <Reference>.fa fq/${f}_1.fq fq/${f}_2.fq | samtools view -S -b - | samtools sort -o bam/${f}.sorted.bam -;
done
Build index for sorted bam file.
for b in $(ls bam/*);
do
samtools index ${b};
done
Calculate each mutated marker frequency for each sample.
python3.7 AltFraction.py iBP barcode/<barcode>.txt mutation/<mut>.csv bam umi csv
Identify different genotypes or methylation types by the minimum point of kernel density estimation (KDE) from the sample population.
for t in $(ls csv/*);
do
python3.7 Minima.py ${t} xlsx png;
done
This pipeline is an application of iBP-seq (genotyping) method in processing methylation data.
fastp -i <filename_1>.fq -I <filename_2>.fq -o qc/clean_1.fq -O qc/clean_2.fq -q 20 --length_required 72 -j qc/<filename>.json -h qc/<filename>.html
python3 SeqPurifier.py mBP barcode/<barcode>.txt qc/clean_1.fq qc/clean_2.fq fq umi
python3 MethReference.py barcode/<barcode>.txt ref/<ref>.fa
# Build index
for i in $(ls -d ref/*);
do
bismark_genome_preparation ${i}/ ;
done
for ((f=0; f<$(ls -l fq | grep "_1.fq$" | wc -l); f++));
do
ref_name=$(sed -n $[${f}+1]p barcode/<barcode>.txt | sed "s/.*,//" | sed "s/\r//");
lineage=$(python3 -c "print('/'.join(sorted('${ref_name}'.split('/'))))" | sed "s/\//_/g");
bismark -N 0 -L 20 --bowtie2 --non_bs_mm /<abs_path>/ref/"${lineage}"/ --sam -o sam/ --q -1 fq/${f}_1.fq -2 fq/${f}_2.fq;
done
# Text processing
for ((f=0; f<$(ls -l fq | grep "_1.fq$" | wc -l); f++));
do
grep "@" sam/${f}_1_bismark_bt2_pe.sam > sam/${f}-header;
grep "XA:Z:0" sam/${f}_1_bismark_bt2_pe.sam > sam/${f}-temp1;
grep "XB:Z:0" sam/${f}_1_bismark_bt2_pe.sam > sam/${f}-temp2;
awk 'NR==FNR {f1[$1]=$0;next} {idx=$1;if(idx in f1)print$0}' sam/${f}-temp1 sam/${f}-temp2 > sam/${f}-temp4;
awk 'NR==FNR {f1[$1]=$0;next} {idx=$1;if(idx in f1)print$0}' sam/${f}-temp2 sam/${f}-temp1 > sam/${f}-temp3;
cat sam/${f}-header sam/${f}-temp3 sam/${f}-temp4 > sam/${f}_1_bismark_bt2_pe_nonmismatch.sam;
done
# SAM to BAM
for ((f=0; f<$(ls -l fq | grep "_1.fq$" | wc -l); f++));
do
samtools sort -n -O bam -o bam/${f}_1_bismark_bt2_pe_nonmismatch.bam sam/${f}_1_bismark_bt2_pe_nonmismatch.sam;
lineage=$(sed -n $[${f}+1]p barcode/barcode.txt | sed "s/.*,//" | sed "s/\r//" | sed "s/\//_/g");
bismark_methylation_extractor --no_overlap --comprehensive --bedGraph --cytosine_report --CX_context --genome_folder /<abs_path>/ref/${lineage}/ -p bam/${f}_1_bismark_bt2_pe_nonmismatch.bam -o bam/;
gunzip bam/${f}_1_bismark_bt2_pe_nonmismatch.bismark.cov.gz;
python3 magic.py bam/${f}_1_bismark_bt2_pe_nonmismatch.bismark.cov bam/${f}_1_bismark_bt2_pe_nonmismatch.CX_report.txt Me-information/${f}_result;done
samtools sort bam/${f}_1_bismark_bt2_pe_nonmismatch.bam -o bam/${f}_1_bismark_bt2_pe_nonmismatch.sorted.bam;
samtools index bam/${f}_1_bismark_bt2_pe_nonmismatch.sorted.bam;
done
for ((f=0; f<$(ls -l fq | grep "_1.fq$" | wc -l); f++));
do
python3 MethFraction.py umi/${f}.fq bam/${f}_1_bismark_bt2_pe_nonmismatch.sorted.bam Me-information/${f}_result csv/${f}.csv;
done
for ((f=0; f<$(ls -l fq | grep "_1.fq$" | wc -l); f++));
do
python3 MethHist.py csv/${f}.csv png;
done
copyright (c) Li Lab of HZAU, All rights reserved.