Skip to content

New cluster merging strategy, less aggressive seqlet pruning

Pre-release
Pre-release
Compare
Choose a tag to compare
@AvantiShri AvantiShri released this 13 Nov 00:13
· 176 commits to master since this release

Corresponds to PR #73

Changes:

  • Seqlet pruning updated (function trim_to_positions_with_min_support in modisco.core). Previously, the limits of min_support would be determined by making a histogram of the locations to which seqlet centers align to, and then trimming away positions that didn't have some minimum support. But a location may be supported by the flanks of seqlets, even if is not supported by seqlet centers. Updating this to look at the support from any seqlet overlap greatly reduces the amount of seqlets that get unnecessarily trimmed away.
  • The previous merging strategy had two components: it looked at both the similarity of motifs as measured by cross-correlation of their contribution score tracks, as well as the density of the clusters (clusters that are less tightly packed should be merged more readily). The density was measured using a t-sne-like strategy, which was a bit ad-hoc and produced values that were hard to interpret intuitively. Now, I still retain the cross-correlation-like similarity, but the 'density' notion is quantified by looking at the distribution of within-cluster and between-cluster pairwise seqlet similarities.

Other small changes:

  • Previously, the aforementioned cross-correlation metric in the pattern merging function was implemented by calling scipy.signal.correlate2d, which doesn't do a normalization (thus, correlation values weren't limited to the range -1 to 1). This was ok because I would normalize each track prior to calling scipy.signal.correlate2d - but as a result, the values were scaled according to the number of tracks (e.g. if there were two tasks, each task would generate a contribution score track, and I would have to divide the correlation values to by 2 to put them in the -1 to 1 range). Previously, this scaling was all adjusted for under-the-hood. Now, I just switched to avoid using scipy.signal.correlate2d so that there is no need for all that adjustment.
  • plot_weights_given_ax now has default values specified for many of the arguments, so it is easier to call