Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNV frequency-based substitution matrix construction for CRAM writer #325

Merged
merged 1 commit into from
Oct 22, 2024

Conversation

athos
Copy link
Member

@athos athos commented Oct 17, 2024

This PR introduces a new feature to the CRAM writer, allowing it to construct a substitution matrix that takes into account the observed frequencies of SNVs.

The change introduces a substitution matrix builder, which works similarly to the existing tag dictionary builder used for constructing the tag dictionary. The substitution matrix builder provides two main functions:

  • assign-code!: This function is called during preprocessing for each :subst read feature in the records and returns a unique base substitution code (BS code) for every pair of ref and alt bases
  • build-subst-matrix: After preprocessing, this function is called to generate the substitution matrix based on the SNV frequency information gathered by assign-code!

Note that since BS codes cannot be assigned until all records within a container are scanned, assign-code! initially returns a unique reference (not the actual BS code) for each SNV. When build-subst-matrix is invoked, BS codes are assigned for each SNV, and the references are accordingly updated all at once. This implementation is intended to reduce the overhead of scanning each :subst read features again to assign the final BS codes after the records are fully processed.

BS code assignment follows a simple frequency-based method: the most frequently observed alt base for each ref base is assigned BS code 0, the next most frequent is assigned code 1, and so forth.

@athos athos self-assigned this Oct 17, 2024
@athos athos requested review from alumi and a team as code owners October 17, 2024 02:14
@athos athos requested review from r6eve and removed request for a team October 17, 2024 02:14
Copy link

codecov bot commented Oct 17, 2024

Codecov Report

Attention: Patch coverage is 91.30435% with 4 lines in your changes missing coverage. Please review.

Project coverage is 89.96%. Comparing base (943f181) to head (b07d48d).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
src/cljam/io/cram/encode/subst_matrix.clj 89.74% 0 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #325      +/-   ##
==========================================
- Coverage   89.97%   89.96%   -0.01%     
==========================================
  Files         102      103       +1     
  Lines        9236     9271      +35     
  Branches      481      485       +4     
==========================================
+ Hits         8310     8341      +31     
  Misses        445      445              
- Partials      481      485       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing the feature! LGTM 👍

@athos athos force-pushed the feature/freq-based-subst-mat branch from b79ae91 to b07d48d Compare October 18, 2024 06:14
Copy link
Contributor

@r6eve r6eve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM including the comments.

@r6eve r6eve merged commit fb6193b into master Oct 22, 2024
17 checks passed
@r6eve r6eve deleted the feature/freq-based-subst-mat branch October 22, 2024 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants