Releases: kundajelab/tfmodisco
Ability to specify a plot save directory
PR here: #56, example usage here: https://github.com/kundajelab/tfmodisco/blob/682faf6ef2dc40bbf4f3b0fdd57a23000e8737a1/test/test_tfmodisco_workflow.py#L117. Helps to avoid the issue of seqlet score distribution plots getting overridden when multiple tf-modisco jobs are launched from the same directory.
Scikit version compatibility fix + relaxing of numerical precision in assert statements
Changes:
- Compatibility with scikit-learn >= 0.22 from #55 (retaining compatibility with versions < 0.22 as well).
- Relaxing of assert statement numerical precision thresholds requested by @mmtrebuchet (#54 and #53).
Bugfix for reducing threshold for numerical precision for symmetry check
The threshold I had to check for symmetry of the coarse-grained affinity matrix within numerical precision was too stringent (presumably because the dot product involves summation so numerical error gets added); relaxed the threshold in commit adee311. @atseng95 this should fix the error you messaged Abhi about (I made the numerical threshold much more lax than is probably required - 1e-5 might have been enough - but this is just so that no one gets stuck on that error in the future in some weird edge case).
Functionality for just extracting seqlets
Corresponds to pull request #51 - for situations where the user just wants to extract the seqlets. Demo notebook at https://github.com/kundajelab/tfmodisco/blob/master/examples/H1ESC_Nanog_gkmsvm/JustExtractSeqletsNanog.ipynb
Backward compatibility for numpy, minor adjustment to gkmer embedding calc
The bugfix in #47 broke backward compatibility with some earlier versions of numpy. This tagged release incorporates a fix to restore backward compatibility (commit 6be7ea5) and also makes a minor adjustment to the gapped kmer embedding calculation such that forward and reverse-complement versions of a seqlet now give exactly symmetrical embeddings within numerical precision (commit 19461fa).
To elaborate on the reason the forward and reverse versions of a seqlet would not give perfectly symmetrical embeddings prior to this fix: consider the case of gapped kmers with a word length of 3 and one gap. Previously, I was treating *NN and NN* (e.g. *AA and AA*) as though they were redundant with each other, so I only used one of them when computing the embedding. However, *AA vs. AA* can produce different results due to the difference in padding; concretely, a seqlet that had a sequence AAGGG contains the AA* gapped kmer but does NOT contain the *AA gapped kmer. Thus, when I was only including the AA* and TT* gapped kmers in my embedding and was NOT including the *AA and *TT gapped kmers, then a seqlet that had the sequence AAGGG would be recorded as containing the AA* gapped kmer but its reverse complement CCCTT would NOT be recorded as having any TT-containing gapped kmer; thus, symmetry was broken. With this fix, I now include BOTH AA* and *AA as well as BOTH TT* and *TT as features in the gapped kmer embedding; thus, a AAGGG seqlet is recorded as having a match to AA* while the reverse complement CCCTT is recorded as having a match to *TT, and symmetry is preserved.
Further reduced memory usage and Nan bugfix
Relative to v0.5.4.0, this release incorporates the PRs #47 and #50. The first feature addresses the occurrence of Nan values in modisco.affinitymat.NumpyCosineSimilarity, and the second reduces the memory footprint of graph2binary (thanks @hy395!). (Memory usage must be released even further in subsequent releases - see #49 for discussion).
Updated hit scoring notebook
Corresponds to pull request #46
- Updated hit scoring strategy in the demo notebook to showcase the combination of the "masked hypothetical CWM cosine similarity" and the "sum of scores" metrics.
- Added associated functions for computing those scores to modisco.util.
- Put in some functionality for trimming motifs (the "AggregatedSeqlet" class in the codebase) according to the information content, or according to the the sum of the absolute value of some score track (e.g. trimming motifs based on the hypothetical contribution scores).
- Did some minor refactoring of the code for computing information-content scaled versions of the position probability matrices.
Version prior to changing hit scoring strategy in demo nbs
The main reason for creating this version tag is that I'm about the change the hit scoring strategy in the demo notebook so I can send the newer version of the hit scoring to David & Han. The change between version 0.5.3.0 and 0.5.3.1 is that I added an option to skip the fine-grained clustering step (I don't recommend people actually use this option; I had just added in to see how things behaved without the fine-grained step). I had also added in a version of the demo notebook that ran on Google colab, which I will also update when I put in the newer hit scoring.
Reduced memory usage
Corresponds to PR #45; some modifications for cutting down on the memory footprint.
Ability to have arbitrary auxiliary tracks for visualization purposes
The auxiliary tracks are not used during the clustering but can be useful for visualization purposes (e.g. if you want to visualize the value of methylation/conservation/dnase footprints at a modisco motif). In the demo notebook at https://github.com/kundajelab/tfmodisco/blob/886f4815c89756a5d010a191c944061d8760c564/test/nb_test/talgata/TF%20MoDISco%20TAL%20GATA%20with%20Activations.ipynb, I use it to visualize the activations of the conv layer for each motif. The extra data tracks are supplied in the call to TfModiscoWorkflow via the other_tracks
argument. other_tracks
accepts a list of instances of modisco.core.DataTrack
.
If the data are such that there is no concept of reverse complements (e.g. RNA-based data), then when instantiating the DataTrack objects, leave the value of rev_tracks
to None (and also make sure revcomp=False
when calling TfModiscoWorkflow). Otherwise, rev_tracks should be the value that fwd_tracks would have if the reverse-complement of the input sequence was provided (e.g. for conv layer activations, you can reverse-complement the original input sequence and recompute the conv layer activations). (At the time of writing, I have not personally tested out how TFMoDISco behaves for RNA-type data extensively, though others have)
If the data is such that there is no positional axis (e.g. if you want to visualize the activations of the fully-connected layer for each motif), set has_pos_axis
to False when instantiating the DataTrack object. Note that I have not tested the functionality with has_pos_axis=False
at all.