Skip to content

Created v0.2.1-alpha tag at the request of people using older version

Pre-release
Pre-release
Compare
Choose a tag to compare
@AvantiShri AvantiShri released this 22 Mar 17:05
· 598 commits to master since this release
e2c536e

The major changes since this release are in the form of thresholding. Specifically, some key changes are:

  • At the time of this release, Laplace distribution thresholding was not in use. Rather, thresholding was based on finding a point of high curvature
  • At the time of this release, the limit of 20K seqlets was applied to the number of seqlets generated per task, and it took only the most important 20K seqlets for each task. Since then, the limit of 20K has been applied per metacluster (as this is most directly related to clustering time), and it is not guaranteed to be taking the most important seqlets - rather, the first 20K seqlets are taken from the ordering generated when the SeqletsOverlapResolver OrderedDict gets unrolled, which is effectively going to be ordering the seqlets by the index of the sequence they originate from, with priority given to the first task specified by the user. I know this aspect is opaque (I only realized it recently myself because the feature that limits seqlets by metacluster was implemented in an external pull request, and I didn't drill into how the ordering was done at the time when I approved the feature). The reason I have not yet forced ordering by the important seqlets is that I am concerned that doing so might under-sample weaker-affinity motifs that may be of interest. My hope is to just go straight to scaling up TF-MoDISCo with a "subsample, soak & repeat" strategy (that is, we subsample seqlets, find highly represented motifs, "soak up" seqlets from the full set that match these motifs, then repeat with the remaining seqlets).

Another major difference was that the backend was in theano, though this should not alter the results.

@suragnair