-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to two-locus branch stats (python) #3043
Conversation
During the validation of the original algorithm, we realized that the LD matrix could get "poisoned" with NaN values if we attempted to make an adjustment to a node that did not contain any samples, which occurs with some frequency. We tore things apart and simplified the algorithm so that we no longer have to do adjustments as we're adding and removing edges. This new version removes the LD contribution from all modified nodes and adds the contribution from all nodes at the end of the routine, once we know the final state of samples under each node.
|
||
:param row: Row from the array to list from. | ||
:returns: A generator of integers stored in the array. | ||
""" | ||
lookup = [0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 31, 27, | ||
13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9] # fmt: skip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a minimal perfect hash to obtain the set bit, given the least significant bit in the chunk. It's on-par with something I tried based on __builtin_ctz
, and is much more portable
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3043 +/- ##
=======================================
Coverage 89.85% 89.85%
=======================================
Files 29 29
Lines 32128 32128
Branches 5763 5763
=======================================
Hits 28868 28868
Misses 1859 1859
Partials 1401 1401
Flags with carried forward coverage won't be shown. Click here to find out more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updates look pretty straightforward here @lkirk, so seem worth the effort.
I think it's important to accept the limitations of the quadratic nature of the calculation though, and I'd imagine we'd be well into diminishing returns after this batch of optimisations.
@jeromekelleher Agreed, thanks for taking a look. Commits are squashed and checks are passing. Anything else you need on my end? |
Nope, happy to merge if you are |
Yes, this is ready for merge |
During the validation of the original branch LD algorithm, we realized that the LD matrix could get "poisoned" with NaN values if we attempted to make an adjustment to a node that did not contain any samples, which occurs with some frequency. Once a NaN enters our running sum, the rest of the values in the LD matrix are NaN. Another issue we're seeing is that this algorithm is rather slow. It's tough to avoid with something that is inherently quadratic, but calling the summary function twice for each parent of the added/removed edge was something we were trying to avoid.
We tore things apart and simplified the algorithm so that we no longer have to do adjustments as we're adding and removing edges. This new version removes the LD contribution from all modified nodes and adds the contribution from all nodes at the end of the routine, once we know the final state of samples under each node.
In addition, we're seeing that we call the summary function ~30% fewer times, which is a nice improvement because that's the most expensive operation. In addition, we no longer need to track child samples from the edge being added. I've done a small, microbenchmark to show this improvement: here. We do see a runtime improvement, but I'm curious to know how this looks in the C implementation,
We do need to iterate over the affected nodes (which need to be stored uniquely), so I've come up with an algorithm for pulling items out of a bit array. The python version contained the
get_items
function for debugging, but now it'll be used in the iteration over modified nodes. I also did a small shootout (in C) for these types of algorithms and this beats linear search by ~50%.