Skip to content

A tool for genome-wide prediction of R-loops using deep learning method

License

Notifications You must be signed in to change notification settings

PEHGP/deepRloopPre

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

deepRloopPre

Introduction

R-loops are three-stranded nucleic acid structures composed of a DNA:RNA hybrid strand and a single-stranded DNA. R-loops are widespread in different species and participate in a variety of biological processes. The ssDRIP-seq technology developed in our laboratory can efficiently and strand-specifically detect the whole genome R-loops and was widely used in multiple species. Here, we develop a prediction tool deepRloopPre based on deep learning from the ssDRIP-seq data, which could predict the locations and profiles of strand-specific genome-wide R-loops. The deepRloopPre is only dependent on DNA sequences and its performance is better than other R-loop prediction tools.

deepRloopPre contains the following files:

  • deepRloopPredict.py (main executable script for predicting R-loop)
  • deepRloopData.py (main executable script for getting your own training set)
  • deepRloopTrain.py (main executable script for training your own model)
  • deepRloopEval.py (main executable script for evaluating model)

In addition, we also provide a online tool, but only one sequence of 128kb can be predicted at a time.

Installation

Create a new environment:

$ conda create -n deepRloopPre python=3.7

Install deepRloopPre:

$ conda activate deepRloopPre
$ git clone https://github.com/PEHGP/deepRloopPre
$ cd deepRloopPre
$ conda env update -n deepRloopPre -f ml.yaml
$ python setup.py install

Quick start

Just prediction

$ deepRloopPredict.py --h5 Arabidopsis_thaliana.hdf5 --fasta your_genome.fasta --prefix test

You can download the trained model here.

Training your own model step by step

1. Format your training data

Getting the deepRloopData.py input file test_drip.bdg.

--effectiveGenomeSize parameter needs to be set to your own.

$ bamCoverage -v -p 30 -b test_fwd.bam -o test_drip.bdg --binSize 1 --effectiveGenomeSize 2131846805 --normalizeUsing RPGC --outFileFormat bedgraph

Getting the deepRloopData.py input file test_qpois.bdg.

-g parameter needs to be set to your own.

$ macs2 callpeak -f BAMPE --trackline -B -t test_fwd.bam -g 2131846805 -n test_fwd --keep-dup all
$ macs2 bdgcmp -t test_fwd_treat_pileup.bdg -c test_fwd_control_lambda.bdg -m qpois --o-prefix test

Getting the training data.

$ deepRloopData.py --fasta genome.fasta --drip test_drip.bdg --qpois test_qpois.bdg --prefix test

2. Training your model

$ deepRloopTrain.py --npz dataset.npz --prefix test

3. Using your model

$ deepRloopPredict.py --h5 your_model.hdf5 --fasta your_genome.fasta --prefix test

4. Evaluation your model

$ deepRloopEval.py getvalue --truepeak observed_peaks.bed --predpeak test_final_predict.bed --truebw observed_drip.bw --predbw test_predict.bw --prefix test 
$ deepRloopEval.py plotpr --truepeak observed_peaks.bed --predpeak test_all_predict.bed --prefix test

Prediction

$ deepRloopPredict.py -h
usage: $ deepRloopPredict.py --h5 model.hdf5 --fasta genome.fasta --prefix test

A tool for predicting R-loop profiles and R-loop locations.

Required arguments:
  --h5 H5               Saved model (default: None)
  --fasta FASTA         FASTA format file, usually a genome FASTA file. (default: None)
  --prefix PREFIX

Optional arguments:
  --help, -h            show this help message and exit
  --chromsize CHROMESIZE
                        If this file is not provided, all the sequences in FASTA file will be predicted. If this file is provided, only the sequences
                        contained in this file will be predicted. chrom_size.txt file has two columns separated by tabs. The first column is the sequence name
                        and the second column is the sequence length. (default: None)
  --strand {fwd,rev}    fwd:Predict current sequence. rev:Predict the reverse complement sequence. (default: fwd)
  --threshold THRESHOLD
                        The threshold score is used to judge whether an region is an R-loop region, the value range is from 0 to 1. (default: 0.5)
  --mem {high,low}      low option can reduce memory, but the program will run slower. (default: high)
  --version             show program's version number and exit

chrom_size.txt file has two columns separated by tabs. The file format is as follows:

Chr1	30427671
Chr2	19698289
Chr3	23459830
Chr4	18585056
Chr5	26975502

Prefix is set to test and the deepRloopPredict.py output as follows:

  • test_predict.bw R-loop profile prediction results of bigWig format
  • test_predict.bdg R-loop profile prediction results of bedGraph format
  • test_predict.bed original R-loop location prediction results of bed format
  • test_final_predict.bed final R-loop location prediction results of bed format. Merging R-loop region with a distance of less than 300bp and then deleting region with a distance of less than 100bp.

Training your own data

1. Format your training data

$ deepRloopData.py -h
usage: $ deepRloopData.py --fasta genome.fasta --drip test_drip.bdg --qpois test_qpois.bdg --prefix test

format the data for training.

Required arguments:
  --fasta FASTA         FASTA format file, usually a genome FASTA file. (default: None)
  --drip DRIP           R-loop profiles in bdg format obtained from ssDRIP-seq. (default: None)
  --qpois QPOIS         -np.log10(qvalue) in bdg format obtained from macs2. (default: None)
  --prefix PREFIX

Optional arguments:
  --help, -h            show this help message and exit
  --chromsize CHROMESIZE
                        If this file is not provided, all the sequences in FASTA file will be formated. If this file is provided, only the sequences
                        contained in this file will be formated. chrom_size file has two columns separated by tabs. The first column is the sequence name
                        and the second column is the sequence length. (default: None)
  --strand {fwd,rev}    fwd:format current sequence. rev:format the reverse complement sequence. (default: fwd)
  --logq QSCORE         The threshold score (-np.log10(qvalue)) is used to judge whether an region is an R-loop region. (default: 2)
  --chrlist CHRLIST [CHRLIST ...]
                        A list of space-delimited sequence names containing those sequence that should be used as validation set. If this parameter is not
                        provided, 20% of the data in the training set will be randomly selected as the validation set. (default: None)
  --version             show program's version number and exit

The deepRloopData.py input test_drip.bdg can be obtained from BAM file:

$ bamCoverage -v -p 30 -b test_fwd.bam -o test_drip.bdg --binSize 1 --effectiveGenomeSize 2131846805 --normalizeUsing RPGC --outFileFormat bedgraph

--effectiveGenomeSize parameter needs to be set to your own.

test_fwd.bam BAM needs to be obtained by aligning raw data to the genome.If you use ssDRIP-seq, you can refer to this wiki.

Of course, you can also generate test_drip.bdg based on the bedGraph format by yourself.

The deepRloopData.py input test_qpois.bdg can be obtained as follows:

$ macs2 callpeak -f BAMPE --trackline -B -t test_fwd.bam -g 2131846805 -n test_fwd --keep-dup all
$ macs2 bdgcmp -t test_fwd_treat_pileup.bdg -c test_fwd_control_lambda.bdg -m qpois --o-prefix test

-g means genome size and needs to be set to your own.
-f means format of tag file and needs to be set to your own. If it is BAM for paired-end reads, BAMPE is used.

Prefix is set to test and the deepRloopData.py output is test.npz.

2. Training your model

$ deepRloopTrain.py -h
usage: $ deepRloopTrain.py --npz dataset.npz --prefix test

Building your own deepRloopPre model.

Required arguments:
  --npz NPZ        Training set in npz format. (default: None)
  --prefix PREFIX

Optional arguments:
  --help, -h       show this help message and exit
  --epoch EPOCH    Epoch (default: 1000)
  --bs BATCHSIZE   batch size (default: 20)
  --h5 H5          regression h5 model for transfer training of classification (default: None)
  --version        show program's version number and exit

The input dataset.npz can be obtained from deepRloopData.py.
The output is a series of files in hdf5 format, among which the newly generated hdf5 file is the final trained model.

3. Evaluation your model

$ deepRloopEval.py -h
usage: deepRloopEval.py [-h] [--version]  ...

Evaluation of prediction results.This program can output F1-score, Precision, Recall for R-loop location and Spearman for R-loop abundance or draw a precision-recall curve.
A detailed sub-commands help is available by typing:
        deepRloopEval.py getvalue -h
        deepRloopEval.py plotpr -h
	deepRloopEval.py plotpeak -h
	deepRloopEval.py plotgene -h
	deepRloopEval.py plotcorr -h

optional arguments:
  -h, --help  show this help message and exit
  --version   show program's version number and exit

commands:
  
    getvalue  getting the F1-score, Precision, Recall, Jaccard, AP, Spearman.
    plotpr    plot precision-recall curve for many samples.
    plotpeak  plot R-loop peak region metaplot.
    plotgene  plot sense/antisense metaplot or region metaplot.
    plotcorr  plot observed and predicted sense/antisense correlation.

example usage:
        deepRloopEval.py getvalue --truepeak observed_peaks.bed --predpeak predicted_peaks.bed --prefix test
        deepRloopEval.py plotpr --target target --prefix test
	deepRloopEval.py plotpeak --truepeak observed_peaks.bed --predpeak predicted_all.bed --truebw true.bw --predbw pred.bw --prefix test
	deepRloopEval.py plotgene --genebed gene.bed --truefwdbw truefwd.bw --truerevbw truerev.bw --predfwdbw predfwd.bw --predrevbw predrev.bw --prefix test
	deepRloopEval.py plotcorr --genebed gene.bed --truefwdbw truefwd.bw --truerevbw truerev.bw --predfwdbw predfwd.bw --predrevbw predrev.bw --genebody --tssextend 300 --ttsextend 300 --prefix test

The observed_peaks.bed is true R-loop locations.
The predicted_peaks.bed is predicted R-loop locations.
getvalue can calculate the spearman of the R-loop profiles, and you need to use --truebw and --predbw.

Third-party software

Citing this work

Li, K., Wu, Z., Zhou, J., Xu, W., Li, L., Liu, C., Li, W., Zhang, C., & Sun, Q. (2022). R-loopAtlas: An integrated R-loop resource from 254 plant species sustained by a deep-learning-based tool. Molecular plant, S1674-2052(22)00447-6.

About

A tool for genome-wide prediction of R-loops using deep learning method

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages