Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to two-locus branch stats (python) #3043

Merged
merged 1 commit into from
Nov 5, 2024

Conversation

lkirk
Copy link
Contributor

@lkirk lkirk commented Oct 25, 2024

During the validation of the original branch LD algorithm, we realized that the LD matrix could get "poisoned" with NaN values if we attempted to make an adjustment to a node that did not contain any samples, which occurs with some frequency. Once a NaN enters our running sum, the rest of the values in the LD matrix are NaN. Another issue we're seeing is that this algorithm is rather slow. It's tough to avoid with something that is inherently quadratic, but calling the summary function twice for each parent of the added/removed edge was something we were trying to avoid.

We tore things apart and simplified the algorithm so that we no longer have to do adjustments as we're adding and removing edges. This new version removes the LD contribution from all modified nodes and adds the contribution from all nodes at the end of the routine, once we know the final state of samples under each node.

In addition, we're seeing that we call the summary function ~30% fewer times, which is a nice improvement because that's the most expensive operation. In addition, we no longer need to track child samples from the edge being added. I've done a small, microbenchmark to show this improvement: here. We do see a runtime improvement, but I'm curious to know how this looks in the C implementation,

We do need to iterate over the affected nodes (which need to be stored uniquely), so I've come up with an algorithm for pulling items out of a bit array. The python version contained the get_items function for debugging, but now it'll be used in the iteration over modified nodes. I also did a small shootout (in C) for these types of algorithms and this beats linear search by ~50%.

During the validation of the original algorithm, we realized that the LD
matrix could get "poisoned" with NaN values if we attempted to make an
adjustment to a node that did not contain any samples, which occurs with
some frequency.

We tore things apart and simplified the algorithm so that we no longer
have to do adjustments as we're adding and removing edges. This new
version removes the LD contribution from all modified nodes and adds the
contribution from all nodes at the end of the routine, once we know the
final state of samples under each node.

:param row: Row from the array to list from.
:returns: A generator of integers stored in the array.
"""
lookup = [0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 31, 27,
13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9] # fmt: skip
Copy link
Contributor Author

@lkirk lkirk Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minimal perfect hash to obtain the set bit, given the least significant bit in the chunk. It's on-par with something I tried based on __builtin_ctz, and is much more portable

Copy link

codecov bot commented Oct 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.85%. Comparing base (9acedd2) to head (a31802c).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3043   +/-   ##
=======================================
  Coverage   89.85%   89.85%           
=======================================
  Files          29       29           
  Lines       32128    32128           
  Branches     5763     5763           
=======================================
  Hits        28868    28868           
  Misses       1859     1859           
  Partials     1401     1401           
Flag Coverage Δ
c-tests 86.69% <ø> (ø)
lwt-tests 80.78% <ø> (ø)
python-c-tests 89.05% <ø> (ø)
python-tests 98.98% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates look pretty straightforward here @lkirk, so seem worth the effort.

I think it's important to accept the limitations of the quadratic nature of the calculation though, and I'd imagine we'd be well into diminishing returns after this batch of optimisations.

@lkirk
Copy link
Contributor Author

lkirk commented Nov 5, 2024

@jeromekelleher Agreed, thanks for taking a look. Commits are squashed and checks are passing. Anything else you need on my end?

@jeromekelleher
Copy link
Member

Nope, happy to merge if you are

@lkirk
Copy link
Contributor Author

lkirk commented Nov 5, 2024

Yes, this is ready for merge

@jeromekelleher jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Nov 5, 2024
@mergify mergify bot merged commit 73ef4cc into tskit-dev:main Nov 5, 2024
21 checks passed
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants