New cluster merging strategy, less aggressive seqlet pruning
Pre-release
Pre-release
Corresponds to PR #73
Changes:
- Seqlet pruning updated (function
trim_to_positions_with_min_support
in modisco.core). Previously, the limits ofmin_support
would be determined by making a histogram of the locations to which seqlet centers align to, and then trimming away positions that didn't have some minimum support. But a location may be supported by the flanks of seqlets, even if is not supported by seqlet centers. Updating this to look at the support from any seqlet overlap greatly reduces the amount of seqlets that get unnecessarily trimmed away. - The previous merging strategy had two components: it looked at both the similarity of motifs as measured by cross-correlation of their contribution score tracks, as well as the density of the clusters (clusters that are less tightly packed should be merged more readily). The density was measured using a t-sne-like strategy, which was a bit ad-hoc and produced values that were hard to interpret intuitively. Now, I still retain the cross-correlation-like similarity, but the 'density' notion is quantified by looking at the distribution of within-cluster and between-cluster pairwise seqlet similarities.
Other small changes:
- Previously, the aforementioned cross-correlation metric in the pattern merging function was implemented by calling scipy.signal.correlate2d, which doesn't do a normalization (thus, correlation values weren't limited to the range -1 to 1). This was ok because I would normalize each track prior to calling scipy.signal.correlate2d - but as a result, the values were scaled according to the number of tracks (e.g. if there were two tasks, each task would generate a contribution score track, and I would have to divide the correlation values to by 2 to put them in the -1 to 1 range). Previously, this scaling was all adjusted for under-the-hood. Now, I just switched to avoid using scipy.signal.correlate2d so that there is no need for all that adjustment.
plot_weights_given_ax
now has default values specified for many of the arguments, so it is easier to call