-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRISPR-Cas9 distance correction solver #211
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #211 +/- ##
==========================================
- Coverage 79.05% 78.76% -0.29%
==========================================
Files 85 87 +2
Lines 7080 7210 +130
==========================================
+ Hits 5597 5679 +82
- Misses 1483 1531 +48
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @sprillo for this fantastic PR! I commend you for lots of great work and appreciate your efforts as always to provide a clear and detailed overview of your code.
I left a series of small-ish comments that I think should be resolved before merging this in, mostly around nit-picky line lengths and some more substantial comments regarding the estimation of the collision probabilities.
More importantly, I am also surprised by the strategy you employed to implement this functionality. I always thought that we would implement this functionality by operating on the character matrix of a tree and producing a corrected dissimilarity_map
, not unlike the CassiopeiaSolver.compute_dissimilarity_map
functionality. Do you think it's potentially overkill that we have embedded the usage of this functionality to only be pertitent to the CRISPRCas9DistanceCorrectionSolver
. In my eyes, it would be ideal to be able to get transformed distances for all sorts of tasks, solving lineages being one of them. Does that make sense?
Always happy to be convinced of this design if it's necessary and thanks in advance for your thoughts.
Invert an increasing function. | ||
|
||
Finds x s.t. f(x) = y. Search for x is limited to the range [lower, upper]. | ||
Uses 30 fixed iterations of binary search. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not make this a parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I made it a parameter num_binary_search_iters
and exposed it in the public functions that use it (crispr_cas9_corrected_hamming_distance
and crispr_cas9_corrected_ternary_hamming_distance
). As a note, 30 iterations gets an accuracy of 2**-31 = 10^-9. I don't think we'll want to set it any higher. The only reason to set it lower would be to speed things up. We would need to do some benchmarking to figure out if it's worth it. The contribution of the inversion procedure to the solver is O(n^2 num_binary_search_iters). This is similar to the cost of computing the dissimilarity map, which is O(n^2 k), and that one needed numba-izing. We could also explore other option to speed up the _inverse
function, e.g. using some python library.
return (upper + lower) / 2.0 | ||
|
||
|
||
def _hamming_distance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this function might fit better under cas.solver.dissimilarity_functions
. We already have a hamming_distance
function that skips missing data, but does not scale the final distance. It might be worthwhile to just update that function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although there is some duplication, I think in terms of readability it is better to have the simple versions of Hamming distance and ternary Hamming distance here in the same file as the solver. For example, weighted_hamming_distance
from cas.solver.dissimilarity_functions
does more than just the ternary case, allowing for aribtrary weights
, which are not needed in this case; the hamming_distance
in cas.solver.dissimilarity_functions
has ignore_missing_state
which is not used (and does not normalize). I agree we could extend the hamming_distance
in cas.solver.dissimilarity_functions
to normalize, but again I think it hampers readability. For such small components that are private to the module, I don't think there's really an issue in code duplication if it improves readability.
Of course, if in the future we want to use cas.solver.dissimilarity_functions
instead, we can do it easily since this falls into the implementation detail category.
Here, by `scaled` we mean that the Hamming distance is divided by the total | ||
number of characters considered, thus bringing it into the range [0, 2]. | ||
|
||
Here, `ternary` means that we score two mutated states that are different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird/confusing that we're calling this concept ternary
now (even if it is correct). Elsewhere we call this weighted_hamming_distance
or modified_hamming_distance
. Do you think it's confusing that here we call it "ternary"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ternary Hamming distance is a particular case of a weighted Hamming distance with specific weights, no? I know that by default the weighted Hamming distance of cas.solver.dissimilarity_functions
will do a ternary weighting scheme, but I don't think we want to speak of "weighted Hamming distance" in the context of the "CRISPRCas9DistanceCorrectionSolver". It's like using the word "polygons" to talk about "triangles". Am I missing something?
|
||
The "corrected" distance is an estimate of true tree distance. | ||
|
||
It is assumed that the tree is ultrametric of depth 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep "ternary" as the name, I would suggest defining it here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
The estimated collision probability is: | ||
q = sum q_i^2 | ||
where q_i is the estimated collision probability for state i using | ||
#alleles=i / #alleles>0 (i.e. the number of times state i appears | ||
in the character matrix, divided by the total number of mutations). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see below that this is just the fallback in case that a user does not provide a collision probability, and that the user has the opportunity to define their own collision probability estimator, but I can't help but feel we can provide more support in case the state probabilities are passed into the solver. Why don't we add an extra optional argument called priors
that will allow us to compute the more traditional collision probability for the user? And only if priors=None
will we perform this naive estimation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, if the user knows the collision probability they can just inject it into the CRISPRCas9DistanceCorrectionSolver
using the crispr_cas9_hardcoded_collision_probability_estimator
, as follows:
collision_probability_computed_from_priors = sum([p ** 2 for p in priors])
solver = CRISPRCas9DistanceCorrectionSolver(
collision_probability_estimator=partial(
crispr_cas9_hardcoded_collision_probability_estimator,
collision_probability=collision_probability_computed_from_priors,
)
)
This is exactly how I'm doing it for the experiments on the KP data, and I think it keeps the implementation of CRISPRCas9DistanceCorrectionSolver
straightforward and clear. I am not a fan of conditional logic, but if you are thinking of something like this (where we use the tree's priors if they exist), I am happy to add it:
class CRISPRCas9DistanceCorrectionSolver(CassiopeiaSolver):
def __init__(
...,
use_priors_to_compute_collision_probability: bool = True,
):
...
self._use_priors_to_compute_collision_probability = use_priors_to_compute_collision_probability
def solve(...):
if self._use_priors_to_compute_collision_probability and tree.priors is not None:
collision_probability = sum([p ** 2 for p in priors])
else:
collision_probability = collision_probability_estimator(...)
I guess it would do more handholding for the user, at the expense of readability. Another option is to make the CollisionProbabilityEstimatorType
a function of the CassiopeiaTree
instead of a function of the character matrix, and then default it to the estimator that uses the priors if they are available in the tree and otherwise the current default. In this case, I am not a fan of taking a complex CassiopeiaTree
object as an input because it makes it confusing what parts of the tree are being used to do the computation. Indeed, the CassiopeiaTree
essentially acts like a huge data clump, and therefore saying that a function takes in CassiopeiaTree
doesn't clarify what is being used unless you carefully read the docstring (in this case, the tree.priors would be used). I generally prefer functions that depend on simpler objects. They also require less context to be tested and are more reusable.
mutation_proportion_estimator: MutationProportionEstimatorType = crispr_cas9_default_mutation_proportion_estimator, # noqa | ||
collision_probability_estimator: CollisionProbabilityEstimatorType = crispr_cas9_default_collision_probability_estimator, # noqa | ||
distance_corrector_name: str = "crispr_cas9_corrected_ternary_hamming_distance", # noqa | ||
distance_solver: DistanceSolver = NeighborJoiningSolver(add_root=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these lines can be formatted to be <80 characters long I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had been formatting my code with isort code.py && black --line-length 80 code.py
, and if I added line breaks to these arguments black
formated them back to one lone line each. But now I flanked the code with "# fmt: off"
and "# fmt: on"
so that black
won't do this. Now all lines should be <= 80, with the caveat that "# fmt: off"
and "# fmt: on"
are there.
@mattjones315 I think |
Hi @sprillo -- apologies for the delay getting back to this. Also apologies for the miscommunication, I had meant As for the interface that I had imagined, I could see us using compositional programming within the
In fact, currently the So, while I'm not convinced either of the two approaches above are categorically better / simpler than your proposed implementation, I wonder if you might be able to convince me that my two solutions are non-optimal. Thanks in advance for the discussion! |
Implement distance correction scheme for the CRISPR-Cas9 model.
The class implementing this method is
CRISPRCas9DistanceCorrectionSolver
in thecassiopeia/solver/distance_correction/_crispr_cas9_distance_correction_solver.py
module. (Thedistance_correction
subpackage is meant to contain any distance correction methods that might be implemented in the future, possibly for models other than CRISPR-Cas9.)The solver composes together four steps: (1) mutation proportion estimation, (2) collision probability estimation, (3) distance correction with the estimated mutation proportion and collision probability, and (4) tree topology reconstruction using the corrected distances. The solver is parameterized by these four steps. Note: Due to Numba compilation issues in the DistanceSolver, the function performing the third step is not injected into the solver, but rather determined by using a string identifier specifying the function name.
In the code, I declared some types to improve readability at select places. Underscores are used to denote functions or classes that are internal and are not exposed to the users, mostly to improve legibility.
Tests should be quite comprehensive. However, some tests are marked as slow since they require simulation (they take ~30s on my machine). To run the slow tests, use the
--runslow
flag, e.g.:python -m pytest test/solver_tests/distance_correction_tests --runslow
. (In particular, CodeCov complains but coverage with the slow tests is good.)