Skip to content

Releases: kundajelab/tfmodisco

Ability to specify a plot save directory

19 Apr 01:01
Compare
Choose a tag to compare

PR here: #56, example usage here: https://github.com/kundajelab/tfmodisco/blob/682faf6ef2dc40bbf4f3b0fdd57a23000e8737a1/test/test_tfmodisco_workflow.py#L117. Helps to avoid the issue of seqlet score distribution plots getting overridden when multiple tf-modisco jobs are launched from the same directory.

Scikit version compatibility fix + relaxing of numerical precision in assert statements

21 Feb 23:31
f0a910c
Compare
Choose a tag to compare

Changes:

  • Compatibility with scikit-learn >= 0.22 from #55 (retaining compatibility with versions < 0.22 as well).
  • Relaxing of assert statement numerical precision thresholds requested by @mmtrebuchet (#54 and #53).

Bugfix for reducing threshold for numerical precision for symmetry check

12 Dec 20:35
Compare
Choose a tag to compare

The threshold I had to check for symmetry of the coarse-grained affinity matrix within numerical precision was too stringent (presumably because the dot product involves summation so numerical error gets added); relaxed the threshold in commit adee311. @atseng95 this should fix the error you messaged Abhi about (I made the numerical threshold much more lax than is probably required - 1e-5 might have been enough - but this is just so that no one gets stuck on that error in the future in some weird edge case).

Functionality for just extracting seqlets

06 Dec 03:59
38c0bf4
Compare
Choose a tag to compare

Corresponds to pull request #51 - for situations where the user just wants to extract the seqlets. Demo notebook at https://github.com/kundajelab/tfmodisco/blob/master/examples/H1ESC_Nanog_gkmsvm/JustExtractSeqletsNanog.ipynb

Backward compatibility for numpy, minor adjustment to gkmer embedding calc

05 Dec 07:47
Compare
Choose a tag to compare

The bugfix in #47 broke backward compatibility with some earlier versions of numpy. This tagged release incorporates a fix to restore backward compatibility (commit 6be7ea5) and also makes a minor adjustment to the gapped kmer embedding calculation such that forward and reverse-complement versions of a seqlet now give exactly symmetrical embeddings within numerical precision (commit 19461fa).

To elaborate on the reason the forward and reverse versions of a seqlet would not give perfectly symmetrical embeddings prior to this fix: consider the case of gapped kmers with a word length of 3 and one gap. Previously, I was treating *NN and NN* (e.g. *AA and AA*) as though they were redundant with each other, so I only used one of them when computing the embedding. However, *AA vs. AA* can produce different results due to the difference in padding; concretely, a seqlet that had a sequence AAGGG contains the AA* gapped kmer but does NOT contain the *AA gapped kmer. Thus, when I was only including the AA* and TT* gapped kmers in my embedding and was NOT including the *AA and *TT gapped kmers, then a seqlet that had the sequence AAGGG would be recorded as containing the AA* gapped kmer but its reverse complement CCCTT would NOT be recorded as having any TT-containing gapped kmer; thus, symmetry was broken. With this fix, I now include BOTH AA* and *AA as well as BOTH TT* and *TT as features in the gapped kmer embedding; thus, a AAGGG seqlet is recorded as having a match to AA* while the reverse complement CCCTT is recorded as having a match to *TT, and symmetry is preserved.

Further reduced memory usage and Nan bugfix

23 Nov 20:41
Compare
Choose a tag to compare

Relative to v0.5.4.0, this release incorporates the PRs #47 and #50. The first feature addresses the occurrence of Nan values in modisco.affinitymat.NumpyCosineSimilarity, and the second reduces the memory footprint of graph2binary (thanks @hy395!). (Memory usage must be released even further in subsequent releases - see #49 for discussion).

Updated hit scoring notebook

18 Nov 09:36
Compare
Choose a tag to compare
Pre-release

Corresponds to pull request #46

  • Updated hit scoring strategy in the demo notebook to showcase the combination of the "masked hypothetical CWM cosine similarity" and the "sum of scores" metrics.
  • Added associated functions for computing those scores to modisco.util.
  • Put in some functionality for trimming motifs (the "AggregatedSeqlet" class in the codebase) according to the information content, or according to the the sum of the absolute value of some score track (e.g. trimming motifs based on the hypothetical contribution scores).
  • Did some minor refactoring of the code for computing information-content scaled versions of the position probability matrices.

Version prior to changing hit scoring strategy in demo nbs

16 Nov 02:19
Compare
Choose a tag to compare

The main reason for creating this version tag is that I'm about the change the hit scoring strategy in the demo notebook so I can send the newer version of the hit scoring to David & Han. The change between version 0.5.3.0 and 0.5.3.1 is that I added an option to skip the fine-grained clustering step (I don't recommend people actually use this option; I had just added in to see how things behaved without the fine-grained step). I had also added in a version of the demo notebook that ran on Google colab, which I will also update when I put in the newer hit scoring.

Reduced memory usage

24 Aug 00:00
d99acc8
Compare
Choose a tag to compare
Reduced memory usage Pre-release
Pre-release

Corresponds to PR #45; some modifications for cutting down on the memory footprint.

Ability to have arbitrary auxiliary tracks for visualization purposes

07 Aug 20:17
Compare
Choose a tag to compare

The auxiliary tracks are not used during the clustering but can be useful for visualization purposes (e.g. if you want to visualize the value of methylation/conservation/dnase footprints at a modisco motif). In the demo notebook at https://github.com/kundajelab/tfmodisco/blob/886f4815c89756a5d010a191c944061d8760c564/test/nb_test/talgata/TF%20MoDISco%20TAL%20GATA%20with%20Activations.ipynb, I use it to visualize the activations of the conv layer for each motif. The extra data tracks are supplied in the call to TfModiscoWorkflow via the other_tracks argument. other_tracks accepts a list of instances of modisco.core.DataTrack.

If the data are such that there is no concept of reverse complements (e.g. RNA-based data), then when instantiating the DataTrack objects, leave the value of rev_tracks to None (and also make sure revcomp=False when calling TfModiscoWorkflow). Otherwise, rev_tracks should be the value that fwd_tracks would have if the reverse-complement of the input sequence was provided (e.g. for conv layer activations, you can reverse-complement the original input sequence and recompute the conv layer activations). (At the time of writing, I have not personally tested out how TFMoDISco behaves for RNA-type data extensively, though others have)

If the data is such that there is no positional axis (e.g. if you want to visualize the activations of the fully-connected layer for each motif), set has_pos_axis to False when instantiating the DataTrack object. Note that I have not tested the functionality with has_pos_axis=False at all.