Skip to content

Latest commit

 

History

History
181 lines (134 loc) · 6.18 KB

README.md

File metadata and controls

181 lines (134 loc) · 6.18 KB

mope

Mitochondrial Ontogenetic Phylogeny Estimation

Mope is a program for making inferences about the dynamics of mitochondrial heteroplasmy during different life stages, using approaches from population genetics and phylogenetics. See the preprint or the paper.

Requirements

Mope is written in Python and requires Python 2.7+ or Python 3.1+. The following python modules are required:

All of these, except for emcee, future, and lru-dict, are included with the Anaconda Python distribution.

To install these dependencies, you can use pip:

# pip install -U numpy scipy pandas h5py emcee future lru-dict

Mope also requires Cython, the Python development headers and libraries, and the GNU Scientific Library development libraries. Many systems will already have Cython. On Ubuntu, the remaining libraries can be installed with the following command:

# apt-get install python-dev libgsl0-dev

Installation

To install the most up-to-date version of mope, clone this repository and use Python and distutils:

# git clone https://github.com/ammodramus/mope
# cd mope/
# python setup.py install

This may require superuser priveleges; to install in your home directory replace install with install --user

mope can also be installed using pip; however the version on PyPI may not be updated as frequently. To install with pip:

# pip install mope

Again, this command may require superuser priveleges on some systems; in this case, use the command

# pip install --user mope

A successful installation will install the mope library for use in Python and an executable script called mope, which can be used to run data analysis and inference, perform simulations, and execute a number of other utility functionalities. This executable script should be in the user's PATH after installation.

To obtain example files and scripts, clone this repository rather than install by pip. Mope is not supported on Windows.

Obtaining allele frequency transition files

Likelihood calculations with mope require precomputed allele frequency transition distributions. Mope can download these automatically:

mope download-transitions

Transition distributions can also downloaded here (1.1 GB, md5). Note that if you are downloading this file programatically (e.g., using wget or curl), you will need to rename the downloaded file to transitions.tar.gz, due to limitations of our filehosting service.

To generate allele frequency transition distributions locally, run

mope generate-commands

This command will generate many commands to be run in parallel so that the transition distributions can be generated more efficiently.

Usage

For usage, try

# mope run --help

or see examples/.

Format specifications

Data input

For inference with mope, allele frequency data can be provided in two formats, either as allele frequency data or as allele count data. In each case, the data takes the form of a tab-delimited table.

Required columns are the data columns, having the names of the different tissues in the ontogenetic phylogeny (and corresponding to the leaf nodes of the phylogeny) and any age columns for ages corresponding to ontogenetic phylogeny components that accumulate drift and mutation with time.

For allele frequency data, the above data columns contain the allele frequencies. For allele count data, the data columns contain the counts of the focal heteroplasmic allele and additional (required) coverage columns contain the total coverage. Count columns must be named x_n, where x is the name of a data column.

See examples/data/ for an example dataset in each format.

Ontogenetic tree file

Ontogenetic trees are specified in a modified NEWICK format. Each node requires a unique name, and a length. Only alphanumeric characters and underscores are allowed in node names.

Node lengths specify the name of the parameter pair (i.e., the genetic drift and mutation/selection parameters) associated with the branch. Optionally, this parameter name may be multiplied by an age variable, indicating that this parameter is to be interpreted as a branch length that depends on some age. (Note that this age name must be a variable in the data file.)

It is also possible to specify that the genetic drift for a certain parameter is to be modeled as a bottleneck. This done by appending ^ to the parameter name.

These three ways of specifying a node are demonstrated here for a node named mother_blood:

mother_blood:blo               # simple genetic drift, no dependence on age
mother_blood:blo*mother_age    # rate of accumulation of drift, with mother_age
mother_blood:blo^              # mother_blood is a bottleneck

For a complete example, here is the ontogenetic phylogeny used in the original study.

(((mother_blood:blo*mother_age)mother_fixed_blood:fblo,(mother_cheek:buc*mother_age)mother_fixed_cheek:fbuc)somM:som,((((child_blood:blo*child_age)child_fixed_blood:fblo,(child_cheek:buc*child_age)child_fixed_cheek:fbuc)som1:som)loo:loo*mother_birth_age)eoo:eoo)emb;

Parameters file (simulations only)

The parameters file specifies simulation parameters. It is a whitespace-delimited table of parameter names (first column, must match tree file) and their values (second column). See examples/params/ for examples.