SNV frequency-based substitution matrix construction for CRAM writer #325
+193
−71
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces a new feature to the CRAM writer, allowing it to construct a substitution matrix that takes into account the observed frequencies of SNVs.
The change introduces a substitution matrix builder, which works similarly to the existing tag dictionary builder used for constructing the tag dictionary. The substitution matrix builder provides two main functions:
assign-code!
: This function is called during preprocessing for each :subst read feature in the records and returns a unique base substitution code (BS code) for every pair of ref and alt basesbuild-subst-matrix
: After preprocessing, this function is called to generate the substitution matrix based on the SNV frequency information gathered byassign-code!
Note that since BS codes cannot be assigned until all records within a container are scanned,
assign-code!
initially returns a unique reference (not the actual BS code) for each SNV. Whenbuild-subst-matrix
is invoked, BS codes are assigned for each SNV, and the references are accordingly updated all at once. This implementation is intended to reduce the overhead of scanning each:subst
read features again to assign the final BS codes after the records are fully processed.BS code assignment follows a simple frequency-based method: the most frequently observed alt base for each ref base is assigned BS code 0, the next most frequent is assigned code 1, and so forth.