Releases: ppillot/biomsalign
Releases · ppillot/biomsalign
v0.3.3
v0.3.2
v0.3.1
Fixes
- Optional arguments were not obeyed anymore due to the new minimization process.
- Kmer based distances are now computed by taking into account common Kmers and common missing Kmers, in conformity with the Simple Matching Distance (by opposition with the Tanimoto Distance used previously). Kmers based distances seem to not be suitable when comparing unrelated sequences of various length. By taking into account also the Kmers that are commonly absent, these sequences are relatively disfavoured instead of appearing more related due to their size.
v0.3.0
What's Changed
- Distance matrix for global multiple sequences alignement, based on Kmer fingerprinting, has been improved. The distance computation uses a Jacquard similarity score and is corrected for sequence length by estimating a background matching probability.
Fixes
- Kmer computation from nucleic sequences in the distance matrix function
Full Changelog: v0.2.0-beta...v0.3.0
Diagonal based alignment improvements
Diagonal filtering improvements
Correctness and overall speed have been improved:
- A special case of longest path was ignored
- A better definition of optimal path has made possible some additional filtering, thus reducing the search space and time complexity
- A better implementation of path storage, using a traceback vector has reduced the memory footprint'
- A garbage collection like method has been added to reduce the number of iterations dedicated to paths collections maintenance
Other fixes
- A common case for stale kmers in the minimizing window was not taken into account
- In the regular Multiple Sequence Alignment procedure, a bug was preventing the detection of gap openings. The formulae for computing gap opening/closing penalty has been fixed.
Diagonal based alignments fixes
BioMSA v0.1.4
Fixes and improvements to diagonal based alignment heuristic
- Previously only high confidence seeds for diagonals were retained, where the seed is a common window between both sequences. In cases of sequences with low homology this proved to be not sufficient. Now all common kmers are evaluated and an optimal list of diagonals seeds is built during an extra step.
- Low quality seeds (kmers that are replicated in both sequences and convey less information) are discarded which avoids combinatorial explosions
- Some fixes have been made to the diagonal extension mechanism where some boundary rules were not made symmetrical between both aligned sequences
Initial pre-release
BioMSA v0.1.3
Initial release.
The library is functionnal. Protein and nucleic sequences can be aligned. Long sequences (>1600 residues) alignment relies on a diagonal finding strategy based on minimizers to speed up the process ×100.
Known issues
In multiple sequences alignments where the diagonal finding method is used, the center-star procedure involved in merging the pairwise alignments can cause unrealistic results in regions where the center sequence is notably different from its siblings.