A tool for computing a data distribution for phylogenetic inference with site repeats via judicious hypergraph partitioning.
Hypergraphs are an extension of graphs. Each edge is allowed to cover any subset (with at least two elements) from the set of vertices (rather than just connecting two vertices).
In judicious partitioning, like in "classical" graph partitioning, the graph is divided into a given number of disjoint subgraphs (blocks) such that the number of cuts of edges is minimized. In hypergraphs, an edge can be cut more than once and in our instance, each cut is counted separately. In judicious partitioning, the maximum number of hyperedges per block is minimized instead of the number of vertices per block.
- Boost
- Intel's ThreadBuildingBlocks (TBB)
- CMake and Make for building HyperPhylo
All dependencies should be available in your package manager.
The programm can be build using gcc or clang. However, note that the resulting binaries built with clang resulted in a lower runtime in our experiments.
First, adjust TBB's path in the CMakeList.txt in JudicousPartitioning by adjusting the following lines:
include_directories(include <tbb-path>/include/tbb/)
link_directories(<tbb-path>/lib/)
Where <tbb-path>
is the path to your TBB installation.
Then, run the following commands to build HyperPhylo:
cd JudiciousPartitioning
mkdir build
cd build
cmake ..
make JudiciousPartitioning
Optional flags include -DCMAKE_CXX_FLAGS="-DNDEBUG"
to disable assertions and -DOPENMP_ENABLED=FALSE
to disable OpenMP.
The programm can be run as follows:
Usage: ./JudiciousPartitioning repeats_file k1[,k2[,k3...]] [partition_number]
Where repeats_file
is a file describing the site repeats and partition_number
is the number of the partition to be split (defaults to partition 0). A split with the respective number of block whill be computed for each given k.
A repeats file is generated from the partitioned MSA and a phylogenetic tree. The repeats file starts with the number of partitions, a space, and the number of internal nodes of the tree. This is followed by a blocks describing the partitions (partition blocks). A partition block starts with the partition name, a space, and the number of sites in this partition. Then, it contains one line per internal node of the tree. Each element in the line corresponds to a site, and is the repeats class identifier of this site. Elements are separated by spaces. All sites that have the same repeats class identifier belong to the same repeats class.
Example of a repeats file with two partitions and a tree with three internal nodes.
2 3
partition_0 6
0 1 2 3 4 4
0 0 0 1 2 3
0 1 2 3 4 5
partition_1 4
0 1 2 3
0 1 1 2
0 1 2 3
The output describes which CPU receives which site(s) from which partition.
It contains one data block per CPU. A block starts with the CPU name, a space, and the number of partitions that CPU works on. Then, it contains one line per partition. Each line comprises the partition name, and a sequence of site identifiers, separated with spaces.
A site identifier corresponds to the index of the the site in its original input data partition, starting from 0.
Additionally, the runtime will be printed as Runtime: xxxx ms
Example output:
2
CPU1 1
partition_0 5 0 1 2 3 4
CPU2 2
partition_0 1 5
partition_1 4 0 1 2 3
Runtime: 100 ms