Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic goals #1

Open
jeromekelleher opened this issue Mar 29, 2019 · 2 comments
Open

Basic goals #1

jeromekelleher opened this issue Mar 29, 2019 · 2 comments

Comments

@jeromekelleher
Copy link
Member

This issue is to outline and discuss the basic goals and initial outlines of the repository. Pinging @hyanwong, @awohns, @leospeidel, @brianzhang01 and @pierpal for input. Please respond in this thread with any thoughts.

Goal

The goal of this repo is to create standardised ways of comparing two (or more) tree sequences. These can be both simple metrics and also standardised plots using matplotlib and/or seaborn. Various aspects of the tree sequences such as topological distances under tree metrics, overall coalescence time distributions, etc should be considered. Basically, we want to have an easy to use and robust toolkit that will have all of the useful ways of comparing tree sequences in one place.

Initial functionality: truth-to-estimate comparisons

  • Compute the weighted KC-distance along the chromosome (as used in the tsinfer paper).
  • Compute pairwise TMRCA heatmap (like fig 2c,d in the Relate paper)

Repo structure

The repository should be structured as an installable Python package, which we will distribute via PyPI and conda-forge. As such, dependencies should be kept to a minimum (and certainly be packages that are easily installed via pip/conda). We should consider Jupyter notebooks as a first-class user of the module, so that quick analyses of tree sequences can be done in notebooks in a user-friendly way.

@brianzhang01
Copy link
Member

I'm pretty new to the whole area, but am starting to think about this a bit. Agreed that taking all pairwise TMRCA's and doing all sorts of distributional / summary statistic comparisons is a good idea.

I've also been reading a bit about machine learning approaches to tree-like structures. That's gotten me into parse trees for natural language processing, which has its set of metrics: https://tech.grammarly.com/blog/the-dirty-little-secret-of-constituency-parser-evaluation.

Here are two writeups I found on the genetics side. The TREESPACE package (R) may be worth some study.
https://cran.r-project.org/web/packages/Quartet/vignettes/Tree-distance-metrics.pdf
https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.12676

@hyanwong
Copy link
Member

hyanwong commented Aug 24, 2023

In tskit-dev/tsdate#310, @nspope and @petrelharp have developed a nice way of finding "equivalent" nodes between tree sequences, which could be useful in this repo. It would be an efficient alternative to comparing all pairwise tMRCAs, and less biased by polytomies (see tskit-dev/tsdate#301 (comment))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants