Releases: kundajelab/tfmodisco
Small fixes
Corresponds to PR #76
Fix for error when slicing coordinates for revcomp when coordinates go over the edges of the sequence
Also makes a fix for backwards compatibility with numpy version where np.pad requires mode to be provided as an argument
New cluster merging strategy, less aggressive seqlet pruning
Corresponds to PR #73
Changes:
- Seqlet pruning updated (function
trim_to_positions_with_min_support
in modisco.core). Previously, the limits ofmin_support
would be determined by making a histogram of the locations to which seqlet centers align to, and then trimming away positions that didn't have some minimum support. But a location may be supported by the flanks of seqlets, even if is not supported by seqlet centers. Updating this to look at the support from any seqlet overlap greatly reduces the amount of seqlets that get unnecessarily trimmed away. - The previous merging strategy had two components: it looked at both the similarity of motifs as measured by cross-correlation of their contribution score tracks, as well as the density of the clusters (clusters that are less tightly packed should be merged more readily). The density was measured using a t-sne-like strategy, which was a bit ad-hoc and produced values that were hard to interpret intuitively. Now, I still retain the cross-correlation-like similarity, but the 'density' notion is quantified by looking at the distribution of within-cluster and between-cluster pairwise seqlet similarities.
Other small changes:
- Previously, the aforementioned cross-correlation metric in the pattern merging function was implemented by calling scipy.signal.correlate2d, which doesn't do a normalization (thus, correlation values weren't limited to the range -1 to 1). This was ok because I would normalize each track prior to calling scipy.signal.correlate2d - but as a result, the values were scaled according to the number of tracks (e.g. if there were two tasks, each task would generate a contribution score track, and I would have to divide the correlation values to by 2 to put them in the -1 to 1 range). Previously, this scaling was all adjusted for under-the-hood. Now, I just switched to avoid using scipy.signal.correlate2d so that there is no need for all that adjustment.
plot_weights_given_ax
now has default values specified for many of the arguments, so it is easier to call
Bugfix, TF2 compatibility, access to motifs pre final reassigment
Corresponds to PR #70
Description of changes:
- When I did refactoring to include support for MEME initialization, I had a stray line that effectively caused the "sign consistency check" (which discards motifs for which the signs of the overall contribution scores disagrees with what you expect for the metacluster - such motifs can arise because seqlets get recentered during the various intermediate processing steps) to be bypassed (this effectively means a few extra motifs that seemed to have the wrong sign could have been returned). Related to the error encountered in #66
- Made some minor fixes for tensorflow 2 support
- The final step of tf-modisco is a "reassignment" step where motifs that have a small number of seqlets are disbanded, and an attempt is made to "reassign" their seqlets to the other motifs. If they so desire, users can now access what the tfmodisco motifs are prior to this final reassignment step.
Agkm implementation, ic-based motif centering
Corresponds to PR #63. Should fix some issues where modisco seems to produce very low-IC motifs; the problem was arising during motif post-processing when the motif was previously recentered around the region of highest average importance; this would sometimes go awry because the high average importance may have been driven by only a few seqlets; now, the motif centering is done based on information content.
There's also support for computing advanced gapped kmer embeddings (which work better than the regular gapped kmer embeddings and also use less memory), but it is still in pure python and I am looking at ways to speed it up.
Interactive plots for visualizing heterogeneity within a motif
Corresponds to Pull Request #62. Seqlets comprising a motif are visualized in a tsne plot, and the user can select a subset of the seqlets (by dragging a rectangle around them on the plot) to aggregate and visualize on the fly. Good for dissecting heterogeneity within a motif.
Visualizing a subset of seqlets within the TAL motif from the TAL-GATA toy dataset:
Can have seqlet embeddings based on filter activations
Corresponds to PR #61. Instead of deriving an embedding for coarse-grained similarity embedding using gapped k-mers, can derive the embedding from a neural network model (e.g. a by averaging the conv filter activations). Example notebook in https://github.com/kundajelab/tfmodisco/blob/36972870853e6631b2d32f1e489676a8241b385c/examples/simulated_TAL_GATA_deeplearning/TF_MoDISco_TAL_GATA_With_Filter_Embeddings.ipynb.
Minor fixes, travis tests running successfully
Updated MEME arguments, leiden init, dependency list
- Incorporates changes from PR #60, which added the -revcomp flag to MEME if "revcomp=True" was specified in TfModiscoWorkflow (is true by default), and also switched
-mod
tozoops
(zoops
stands for "zero or one occurrences per sequence"; this concords with the default for the web and also seems more appropriate for seqlets than theanr
mode, which stands for "any number of repetitions") - Updated the Leiden clustering to take the best of both worlds over the singleton initialization (i.e. what is done without preclustering using MEME) and the MEME initialization.
- Updated dependency list in setup.py to be more complete
- Updated the test suite. Attempted to add a travis build but it looks like installing MEME via travis is nontrivial.
Support for MEME-based initialization, Leiden community detection
Corresponds to PR #57, notes duplicated below:
An initial clustering can be specified using the initclusterer_factory
argument of TfModiscoSeqletsToPatternsFactory
. See this notebook for an example. Here's an example for MEME-based initialization (which is what's supported at the time of writing):
initclusterer_factory=modisco.clusterinit.memeinit.MemeInitClustererFactory(
meme_command="meme", base_outdir="meme_out",
max_num_seqlets_to_use=10000,
nmotifs=10,
n_jobs=4)
Explanation of the arguments:
meme_command
: this is justmeme
if the meme executable is in the PATH; if it's not in the path, thenmeme_command
should specify the full path to the executable, e.g./software/meme/5.0.1/bin/meme
on the kundajelab servers.base_outdir
: output directory for writing the meme results (will be relative to the current working directory unless an absolute path is provided). Within this directory, subdirectories will be created for each metacluster.max_num_seqlets_to_use
: to prevent MEME from taking too long, the number of seqlets to use for running MEME will be capped to this.nmotifs
: the number of motifs for MEME to find. Only significant motifs (e value < 0.05) will be used for the clustering.njobs
: specifies the value of the-p
argument of MEME, and also specifies the number of parallel jobs to launch when doing motif scanning with the MEME PWMs.
The cluster initialization with MEME is achieved as follows: the PWMs produced by MEME are used to scan all the seqlets, and only PWM matches that exceed the Bayes optimal threshold specified by MEME are considered. Seqlets that contain no PWM matches are assigned to their own cluster. The remaining seqlets are each assigned to a cluster corresponding to the PWM for which they had the strongest match by log-odds score.
The cluster initialization affects the TF-MoDISco workflow in two places:
- First, the fine-grained similarity is computed not just on the set of nearest-neighbors that have the highest coarse-grained similarity across all seqlets, but also on the set of nearest-neighbors that have the highest coarse-grained-similarity within each initialized cluster.
- Second, it is used to initialize Leiden community detection.
Empirically, this seems to result in TF-MoDISco clusters that get the "best of both worlds" from MEME and TF-MoDISco.
Other changes:
- Moved from Louvain -> Leiden for the main community detection step. Note that I am no longer doing consensus clustering with Leiden because it didn't appear to work well (consistent with this discussion on twitter); instead, I am just taking the best modularity over 50 runs of Leiden with different random seeds. To go back to using Louvain for the main community detection step, set the
use_louvain
argument toTrue
inTfModiscoSeqletsToPatternsFactory
- but note that the cluster initialization functionality isn't supported with Louvain.* - Updated the Nanog notebook to showcase the MEME initialization functionality
- Updated the Nanog notebook to use better normalization (I'm now just doing mean normalization across ACGT at each position, which I think is more intuitive and has a similar effect as the normalization I described in the GkmExplain paper). Also updated the notebook to apply normalization to the importance scores of the dinuc-shuffled null (previously, the scores for the null distribution weren't normalized)
- Added tests for the MEME-based initialization
*The reason I don't support cluster initialization with Louvain is that, when using Louvain, the number of clusters can only decrease from one iteration to the next (with Leiden, the number of clusters can go up because there's a cluster-splitting step - in other words, if initialization was used with Louvain, the number of discovered clusters would be capped at the number of clusters present during initialization, which is undesirable). By the way, Louvain is still used in the "spurious merging detection" step of the post-processing; the reason is that in this step I attempt to split each cluster into two subclusters, and when using Louvain this cap on the number of subclusters can be achieved by initializing Louvain to have only 2 clusters (since the number of clusters in Louvain only decreases with each iteration).
Dependency fix for leidenalg and tqdm
Corresponds to PR #59. Updated setup.py to include leidenalg and tqdm.