This repo contains the source code of the algorithm for finding the optimal set of Topologically Associating Domains (TADs) using a combination of Hi-C and ChIP-seq/DNAme data. The algorithm is implemented in Python and is suitable for the identification of optimized TAD partitions across different resolutions in Drosophila and mammalian species.
To successfully run optimalTAD
you will need to install the following dependencies:
- C++11
- Python 3
- MPICH2 or Open MPI (for parallel computing)
- boost (for macOS)
Also, one can set up a conda environment with all required dependencies installed using the following .yml files:
Linux users:
conda env create -f environment_optimaltad_linux64.yml
macOS users:
brew install boost # if boost libraries are not installed yet
conda env create -f environment_optimaltad_osx.yml
Please note that the algorithm was tested on Linux and macOS operating systems only, therefore we can't guarantee that it will work on Windows as well.
Clone this repo and install optimalTAD
using pip:
git clone https://github.com/cosmoskaluga/optimalTAD
cd optimalTAD
./install.sh
To launch the algorithm type the following at the command line:
optimalTAD [-h] [--hic HIC [HIC ...]] [--chipseq CHIPSEQ [CHIPSEQ ...]] [--output OUTPUT] [--np NP] [--resolution RESOLUTION] [--stepsize STEPSIZE] [--gamma_max GAMMA_MAX] [--hic_format HIC_FORMAT] [--empty_row_imputation]
[--truncation] [--log2_hic] [--log2_chip] [--zscore_chip] [--balance | --no-balance] [--mammal] [--window_size_min WINDOW_SIZE_MIN] [--window_size_max WINDOW_SIZE_MAX]
Required and optional arguments:
-h, --help Help message
--hic HIC [HIC ...] Iteratively corrected Hi-C matrices in .hdf5 or .cool format
--chipseq CHIPSEQ [CHIPSEQ ...] Epigenetic data (ChIP-seq in .bedgraph or .bw format)
--output OUTPUT Output directory (='./output')
--np [NP] Number of processors (=1)
--resolution [RESOLUTION] Resolution of Hi-C matrices (=1)
--stepsize [STEPSIZE] Step size to increment gamma parameter in Armatus (=0.05)
--gamma_max [GAMMA_MAX] Max value of the gamma parameter (=4)
--hic_format [HIC_FORMAT] Hi-C matrices input format for armatus (=txt.gz)
--empty_row_imputation Empty line imputation (=False)
--truncation Truncation of a Hi-C-matrix (=False)
--log2_hic log2 transformation of Hi-C matrix (=False)
--log2_chip log2 transformation of ChIP-seq values (=False)
--zscore_chip Z-score transformation of ChIP-seq values(=False)
--balance, --no-balance Hi-C matrix is iteratively normalized (='--balance')
--mammal Input data is derived from mammalian species (=False)
--window_size_min [WINDOW_SIZE_MIN] Minimal window size in insulation score method (for mammals only!)
--window_size_max [WINDOW_SIZE_MAX] Maximal window size in insulation score method (for mammals only!)
All listed arguments can also be specified in the config.ini
configuration file.
Both Hi-C and ChIP-Seq data are required for optimalTAD
running. We strongly recommend you perform iterative correction (Imakaev et al, 2012) on your Hi-C data before running optimalTAD
. ChIP-seq coverage track should be normalized by input and stored in .bedgraph or .bw file. No further preparation of ChIP-seq data is required, the algorithm will binarize coverage with respect to a chosen resolution of the Hi-C map and provide log2 and z-score transformation if needed.
Note
: optimalTAD
utilizes the two well-known TAD calling algorithms to produce all possible TAD sets and the choice of method depends on the species type input data originates from. The default method is Armatus, which is recommended for the analysis of Drosophila's topological domains. If you work with mammalian contact maps (--mammals
argument), optimalTAD
switches to the Insulation Score (IS) method from cooltools package.
First, execute test_data.sh
script:
chmod a+x ./test_data.sh
./test_data.sh
It will create testdata
folder containing Hi-C and ChIP-seq data of Drosophila chromosome 2L. Next, run optimalTAD
:
optimalTAD run
Hi-C data with the obtained optimal TAD set can be visualized using the function below:
optimalTAD visualize [-h] [--samplename SAMPLENAME] [--region REGION] [--resolution RESOLUTION] [--chipseq CHIPSEQ] [--log2_chip] [--zscore_chip] [--rnaseq RNASEQ]
with the following arguments:
-h, --help Help message
--samplename SAMPLENAME Samplename of Hi-C data (for example, LacZ_1)
--region REGION Genome region to plot (for example, chr2L:1,000,000-5,000,000)
--resolution RESOLUTION Resolution of Hi-C matrix (=1)
--chipseq CHIPSEQ Path to epigenetic data
--log2_chip log2 transformation of epigenetic data
--zscore_chip Z-score transformation of epigenetic data
--rnaseq RNASEQ Add additional track to the plot
optimalTAD
documentation is available on readthedocs.
"optimalTAD: annotation of topologically associating domains based on chromatin marks enrichment." Dmitrii N. Smirnov, Anna D. Kononkova, Debra Toiber, Mikhail S. Gelfand, Ekaterina E. Khrameeva bioRxiv 2023.03.06.531254; doi: https://doi.org/10.1101/2023.03.06.531254
Scripts reproducing the main analyses from the manuscript can be found here: https://github.com/cosmoskaluga/optimaltad_manuscript_analysis