Releases: kundajelab/tfmodisco
Make pairwise distances passed to scikit nearest neighbors nonnegative
Minor bugfix release corresponding to pull request #40
@rosaxma received the error ValueError: Negative values in data passed to 'pairwise_distances'. Precomputed distance need to have non-negative values
when scikit's NearestNeighbors functions were called. This fix shifts all the distances upwards so that they are all nonnegative, which appears to eliminate the error without affecting the results. I am not sure why this error wasn't encountered before - it may have to do with the particular version of scikit.
Created v0.2.1-alpha tag at the request of people using older version
The major changes since this release are in the form of thresholding. Specifically, some key changes are:
- At the time of this release, Laplace distribution thresholding was not in use. Rather, thresholding was based on finding a point of high curvature
- At the time of this release, the limit of 20K seqlets was applied to the number of seqlets generated per task, and it took only the most important 20K seqlets for each task. Since then, the limit of 20K has been applied per metacluster (as this is most directly related to clustering time), and it is not guaranteed to be taking the most important seqlets - rather, the first 20K seqlets are taken from the ordering generated when the SeqletsOverlapResolver OrderedDict gets unrolled, which is effectively going to be ordering the seqlets by the index of the sequence they originate from, with priority given to the first task specified by the user. I know this aspect is opaque (I only realized it recently myself because the feature that limits seqlets by metacluster was implemented in an external pull request, and I didn't drill into how the ordering was done at the time when I approved the feature). The reason I have not yet forced ordering by the important seqlets is that I am concerned that doing so might under-sample weaker-affinity motifs that may be of interest. My hope is to just go straight to scaling up TF-MoDISCo with a "subsample, soak & repeat" strategy (that is, we subsample seqlets, find highly represented motifs, "soak up" seqlets from the full set that match these motifs, then repeat with the remaining seqlets).
Another major difference was that the backend was in theano, though this should not alter the results.
Added support for different manual positive and negative thresholds
Release corresponds to pull request #39
Percentile-based thresholding is triggered if the number of passing windows produced through null-distribution-based thresholding does not fall within min_passing_windows_frac and max_passing_windows_frac. By default, the percentiles are taken w.r.t. the absolute values. This feature adds an argument separate_pos_neg_thresholds which can be set to True when instantiating a TfModiscoWorkflow object to take the percentiles for positive values and negative values separately, as opposed to taking the percentiles w.r.t. the absolute values. The default value of the argument is False, for backward compatibility. A notebook testing out the feature is at https://github.com/kundajelab/tfmodisco/blob/68bef1575ddec5f55e7605f64fd3753d43d2ca5c/test/nb_test/NoRevcompAndSepPosNegThresh.ipynb
There were a couple of other very minor changes that can cause differences within numerical precision. The first was that in window_sum_function
in line 103 of coordproducers.py, the running window sums are now computed using np.cumsum
, rather than with a python loop. The second was that in lines 548 and 549 of coordproducers.py, the criterion for meeting the threshold has been changed to y > pos_threshold
and y < neg_threshold
, whereas previously it was y >= pos_threshold
and y <= neg_threshold
.
Added support for NOT using reverse complements when computing the similarity matrix
Pull request here: #38
To avoid using reverse complements (e.g. if working with splicing motifs), set the argument revcomp=False when calling a TfModiscoWorkflow instance on your data. If reloading a saved TfModisco results object, then you also have to set revcomp=False when calling prep_track_set
. Otherwise, the revcomp argument is by default True (for backwards compatibility). Permalink to a notebook demonstrating the functionality is here: https://github.com/kundajelab/tfmodisco/blob/d88a1dba7f59f6dc8f62aa267ac42eb5e53037d4/test/nb_test/NoRevcomp.ipynb
Added min_metacluster_size_frac
Improved null distributions
Key changes:
- Added the ability to have a user-supplied per-position null distribution. Example notebook using gkmexplain scores provided in https://github.com/kundajelab/tfmodisco/blob/master/examples/H1ESC_Nanog_gkmsvm/TF%20MoDISco%20Nanog.ipynb. It looks pretty good (orange distribution is the null):
-
Previously, for metaclustering, scores for different tasks would be normalized using the cdf of the laplace distribution that was fit to that task. Because there is not necessarily a laplace distribution anymore, I now just use the percentile of the magnitude of the score for normalization. The key lines are:
https://github.com/kundajelab/tfmodisco/blob/f4f94d6dbb82d7d320068d91ffa30a12e9faadf3/modisco/coordproducers.py#L452-L453 -
For the case where the user does want to use the laplace distribution for the null, I fixed its over-aggressive tendency; previously, the curve for the laplace distribution would often lie above the true distribution, which is clearly inappropriate. This should be fixed now by looking at percentiles along the entire distribution, computing the corresponding laplace curve that would best fit each percentile, and then taking the curve with the steepest decrease. Of course, the laplace distribution may still not be an appropriate fit, but at least it will be less aggressive.
-
Previously, I would determine the FDR by looking at the proportion of null values above a particular threshold relative to the proportion of true values above that threshold. One potential drawback of lumping everything above a threshold together is that the FDR for values that are just barely above the threshold may be considerably worse than the FDR for values that are well above the threshold. To get around this, I fit an Isotonic Regression curve to get point estimates of the probability that a seqlet is a true positive given its importance score, and use the point estimates to draw the FDR threshold. That way, the FDR is controlled for values both at the threshold as well as values above the threshold.
-
Default FDR cutoff is now 0.2 rather than 0.05, as in my experience the 0.05 cutoff will tend to miss low-affinity seqlets even when the null distribution is a good fit.
As an aside, I am very amused to discover that although the version corresponding the the TF-MoDISco arxiv technical note was 0.4.2.2, and that is clearly the version that the hyperlinks point to, the title and abstract both say 0.4.4.2. I was clearly very sleep deprived when I wrote that up.
First version on pypi
Had to add a MANIFEST.in to make sure the louvain binaries got included. Works on google colab as demontrated in this notebook where it's used in conjunction with gkmexplain: https://github.com/kundajelab/gkmexplain/blob/6782c7b6dfc077962c59d60b60bc23ddbdf9f61a/lsgkmexplain_NFE2.ipynb
This version can be installed and run on Colaboratory
As demonstrated in this SVM example: https://github.com/kundajelab/ssvmimp/blob/efd6ea383146ef04822da5fc09e230a5722f0b65/lsgkmexplain.ipynb
Reverse-complement seqlet loading bugfix
Corresponding to this pull request: #29
Tensorflow backend that actually works
Fixes bug in tensorflow backend caused due to difference in dimension ordering between theano and tensorflow.