CRISPR-Cas9 distance correction solver #211

sprillo · 2023-08-11T22:12:10Z

Implement distance correction scheme for the CRISPR-Cas9 model.

The class implementing this method is CRISPRCas9DistanceCorrectionSolver in the cassiopeia/solver/distance_correction/_crispr_cas9_distance_correction_solver.py module. (The distance_correction subpackage is meant to contain any distance correction methods that might be implemented in the future, possibly for models other than CRISPR-Cas9.)

The solver composes together four steps: (1) mutation proportion estimation, (2) collision probability estimation, (3) distance correction with the estimated mutation proportion and collision probability, and (4) tree topology reconstruction using the corrected distances. The solver is parameterized by these four steps. Note: Due to Numba compilation issues in the DistanceSolver, the function performing the third step is not injected into the solver, but rather determined by using a string identifier specifying the function name.

In the code, I declared some types to improve readability at select places. Underscores are used to denote functions or classes that are internal and are not exposed to the users, mostly to improve legibility.

Tests should be quite comprehensive. However, some tests are marked as slow since they require simulation (they take ~30s on my machine). To run the slow tests, use the --runslow flag, e.g.:python -m pytest test/solver_tests/distance_correction_tests --runslow. (In particular, CodeCov complains but coverage with the slow tests is good.)

codecov · 2023-08-11T22:17:07Z

Codecov Report

Patch coverage: 63.07% and project coverage change: -0.29% ⚠️

Comparison is base (f895301) 79.05% compared to head (cd53086) 78.76%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #211      +/-   ##
==========================================
- Coverage   79.05%   78.76%   -0.29%     
==========================================
  Files          85       87       +2     
  Lines        7080     7210     +130     
==========================================
+ Hits         5597     5679      +82     
- Misses       1483     1531      +48

Files Changed	Coverage Δ
...rection/_crispr_cas9_distance_correction_solver.py	`62.50% <62.50%> (ø)`
cassiopeia/solver/__init__.py	`100.00% <100.00%> (ø)`
cassiopeia/solver/distance_correction/__init__.py	`100.00% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mattjones315

Thank you @sprillo for this fantastic PR! I commend you for lots of great work and appreciate your efforts as always to provide a clear and detailed overview of your code.

I left a series of small-ish comments that I think should be resolved before merging this in, mostly around nit-picky line lengths and some more substantial comments regarding the estimation of the collision probabilities.

More importantly, I am also surprised by the strategy you employed to implement this functionality. I always thought that we would implement this functionality by operating on the character matrix of a tree and producing a corrected dissimilarity_map, not unlike the CassiopeiaSolver.compute_dissimilarity_map functionality. Do you think it's potentially overkill that we have embedded the usage of this functionality to only be pertitent to the CRISPRCas9DistanceCorrectionSolver. In my eyes, it would be ideal to be able to get transformed distances for all sorts of tasks, solving lineages being one of them. Does that make sense?

Always happy to be convinced of this design if it's necessary and thanks in advance for your thoughts.

mattjones315 · 2023-08-16T21:54:12Z