Updates to two-locus branch stats (python) #3043

lkirk · 2024-10-25T09:44:16Z

During the validation of the original branch LD algorithm, we realized that the LD matrix could get "poisoned" with NaN values if we attempted to make an adjustment to a node that did not contain any samples, which occurs with some frequency. Once a NaN enters our running sum, the rest of the values in the LD matrix are NaN. Another issue we're seeing is that this algorithm is rather slow. It's tough to avoid with something that is inherently quadratic, but calling the summary function twice for each parent of the added/removed edge was something we were trying to avoid.

We tore things apart and simplified the algorithm so that we no longer have to do adjustments as we're adding and removing edges. This new version removes the LD contribution from all modified nodes and adds the contribution from all nodes at the end of the routine, once we know the final state of samples under each node.

In addition, we're seeing that we call the summary function ~30% fewer times, which is a nice improvement because that's the most expensive operation. In addition, we no longer need to track child samples from the edge being added. I've done a small, microbenchmark to show this improvement: here. We do see a runtime improvement, but I'm curious to know how this looks in the C implementation,

We do need to iterate over the affected nodes (which need to be stored uniquely), so I've come up with an algorithm for pulling items out of a bit array. The python version contained the get_items function for debugging, but now it'll be used in the iteration over modified nodes. I also did a small shootout (in C) for these types of algorithms and this beats linear search by ~50%.

During the validation of the original algorithm, we realized that the LD matrix could get "poisoned" with NaN values if we attempted to make an adjustment to a node that did not contain any samples, which occurs with some frequency. We tore things apart and simplified the algorithm so that we no longer have to do adjustments as we're adding and removing edges. This new version removes the LD contribution from all modified nodes and adds the contribution from all nodes at the end of the routine, once we know the final state of samples under each node.

lkirk · 2024-10-25T09:45:50Z

python/tests/test_ld_matrix.py


        :param row: Row from the array to list from.
        :returns: A generator of integers stored in the array.
        """
+        lookup = [0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 31, 27,
+                  13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9]  # fmt: skip


This is a minimal perfect hash to obtain the set bit, given the least significant bit in the chunk. It's on-par with something I tried based on __builtin_ctz, and is much more portable

codecov · 2024-10-25T09:49:22Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.85%. Comparing base (9acedd2) to head (a31802c).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3043   +/-   ##
=======================================
  Coverage   89.85%   89.85%           
=======================================
  Files          29       29           
  Lines       32128    32128           
  Branches     5763     5763           
=======================================
  Hits        28868    28868           
  Misses       1859     1859           
  Partials     1401     1401

Flag	Coverage Δ
c-tests	`86.69% <ø> (ø)`
lwt-tests	`80.78% <ø> (ø)`
python-c-tests	`89.05% <ø> (ø)`
python-tests	`98.98% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

jeromekelleher

Updates look pretty straightforward here @lkirk, so seem worth the effort.

I think it's important to accept the limitations of the quadratic nature of the calculation though, and I'd imagine we'd be well into diminishing returns after this batch of optimisations.

lkirk · 2024-11-05T17:02:39Z

@jeromekelleher Agreed, thanks for taking a look. Commits are squashed and checks are passing. Anything else you need on my end?

jeromekelleher · 2024-11-05T19:34:51Z

Nope, happy to merge if you are

lkirk · 2024-11-05T19:40:55Z

Yes, this is ready for merge

lkirk commented Oct 25, 2024

View reviewed changes

lkirk requested a review from jeromekelleher October 25, 2024 09:55

jeromekelleher reviewed Nov 5, 2024

View reviewed changes

jeromekelleher approved these changes Nov 5, 2024

View reviewed changes

jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Nov 5, 2024

mergify bot merged commit 73ef4cc into tskit-dev:main Nov 5, 2024
21 checks passed

mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to two-locus branch stats (python) #3043

Updates to two-locus branch stats (python) #3043

lkirk commented Oct 25, 2024

lkirk Oct 25, 2024 •

edited

Loading

codecov bot commented Oct 25, 2024 •

edited

Loading

jeromekelleher left a comment

lkirk commented Nov 5, 2024

jeromekelleher commented Nov 5, 2024

lkirk commented Nov 5, 2024

Updates to two-locus branch stats (python) #3043

Updates to two-locus branch stats (python) #3043

Conversation

lkirk commented Oct 25, 2024

lkirk Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Oct 25, 2024 • edited Loading

Codecov Report

jeromekelleher left a comment

Choose a reason for hiding this comment

lkirk commented Nov 5, 2024

jeromekelleher commented Nov 5, 2024

lkirk commented Nov 5, 2024

lkirk Oct 25, 2024 •

edited

Loading

codecov bot commented Oct 25, 2024 •

edited

Loading