Skip to content

Improved null distributions

Pre-release
Pre-release
Compare
Choose a tag to compare
@AvantiShri AvantiShri released this 12 Feb 02:26
· 425 commits to master since this release
f4f94d6

Key changes:

screen shot 2019-02-11 at 6 10 40 pm

  • Previously, for metaclustering, scores for different tasks would be normalized using the cdf of the laplace distribution that was fit to that task. Because there is not necessarily a laplace distribution anymore, I now just use the percentile of the magnitude of the score for normalization. The key lines are:
    https://github.com/kundajelab/tfmodisco/blob/f4f94d6dbb82d7d320068d91ffa30a12e9faadf3/modisco/coordproducers.py#L452-L453

  • For the case where the user does want to use the laplace distribution for the null, I fixed its over-aggressive tendency; previously, the curve for the laplace distribution would often lie above the true distribution, which is clearly inappropriate. This should be fixed now by looking at percentiles along the entire distribution, computing the corresponding laplace curve that would best fit each percentile, and then taking the curve with the steepest decrease. Of course, the laplace distribution may still not be an appropriate fit, but at least it will be less aggressive.

  • Previously, I would determine the FDR by looking at the proportion of null values above a particular threshold relative to the proportion of true values above that threshold. One potential drawback of lumping everything above a threshold together is that the FDR for values that are just barely above the threshold may be considerably worse than the FDR for values that are well above the threshold. To get around this, I fit an Isotonic Regression curve to get point estimates of the probability that a seqlet is a true positive given its importance score, and use the point estimates to draw the FDR threshold. That way, the FDR is controlled for values both at the threshold as well as values above the threshold.

  • Default FDR cutoff is now 0.2 rather than 0.05, as in my experience the 0.05 cutoff will tend to miss low-affinity seqlets even when the null distribution is a good fit.

As an aside, I am very amused to discover that although the version corresponding the the TF-MoDISco arxiv technical note was 0.4.2.2, and that is clearly the version that the hyperlinks point to, the title and abstract both say 0.4.4.2. I was clearly very sleep deprived when I wrote that up.