Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
amtseng committed Nov 1, 2022
1 parent c1cbf7c commit d266458
Showing 1 changed file with 29 additions and 63 deletions.
92 changes: 29 additions & 63 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,73 +1,39 @@
## TF-MoDISco
## TF-MoDISco: Transcription-Factor Motif Discovery from Importance Scores

[![CircleCI](https://circleci.com/gh/kundajelab/tfmodisco.svg?style=shield)](https://app.circleci.com/pipelines/github/kundajelab/tfmodisco) [![license](https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000)](https://github.com/kundajelab/tfmodisco/blob/master/LICENSE) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4728132.svg)](https://doi.org/10.5281/zenodo.4728132)

**NOTE: we are still refining the multi-task version of TF-MoDISco. If you encounter difficulties running TF-MoDISco with multiple tasks, our recommendation is to run it on one task at a time.**
This repository contains the code developed for the associated manuscript, _Distilling consolidated DNA sequence motifs and cooperative motif syntax from neural-network models of in vivo transcription-factor binding profiles_. The analysis scripts and notebooks used to reproduce the results in this manuscript can be found at [this repository](https://github.com/kundajelab/neural_motif_discovery).

**NOTE: although in the GkmExplain paper, TF-MoDISco was only run on importance scores computed on the testing set, this is NOT a requirement of TF-MoDISco; the TF-MoDISco algorithm can just as easily be run on importance scores derived from the full dataset, if the user feels that will increase the sensitivity of the results**
General users should visit the [TF-MoDISco-lite repository](https://github.com/jmschrei/tfmodisco-lite/) for a more efficient, actively maintained, and easier-to-use version of the same algorithm.

Installation:
At the time of writing, the latest version on pypi is version 0.5.13.0 and can be installed using `pip install modisco`. To install from this source code, clone the repo and then run `pip install --editable /path/to/cloned/repo`.
### Structure of TF-MoDISco

The TF-MoDISco algorithm starts with a set of importance scores on genomic sequences, and can perform the following tasks:

1. Identify high-importance windows of the sequences, termed "seqlets"
2. Cluster recurring similar seqlets into motifs
3. Scan through importance scores across the genome to call motif instances (AKA "hit scoring")

### Installing TF-MoDISco

`pip install modisco`

Alternatively, for a specific tagged version or commit, install from source code by cloning this repository, checking out the desired version, and running `pip install -e /path/to/cloned/repo`.

### Required inputs to run the algorithm

In order to run the TF-MoDISco algorithm, the following data is required as an input:

- An N x L x 4 NumPy array of one-hot encoded genomic sequences, where N is the number of sequences and L is the sequence length (the 4 bases are in A, C, G, T order); this denotes the identity of the sequence
- A parallel N x L x 4 NumPy array of contribution scores; each position contains the importance of the base specified in the corresponding one-hot encoded sequence (i.e. each base position should have at most one nonzero entry out of the 4, which measures importance at the base in the sequence)
- An optional parallel N x L x 4 NumPy array of _hypothetical_ contribution scores, which measures the hypothetical contribution of _every_ base (not just the one that is present in the sequence); equivalently, the element-wise product of this array with the one-hot encoded genomic sequences should be identical to the array of contribution scores

### Other resources

A technical note describing version 0.5.6.5 is available at [https://arxiv.org/abs/1811.00416](https://arxiv.org/abs/1811.00416).
Video of talk at NeurIPS MLCB 2017: https://www.youtube.com/watch?v=fXPGVJg956E

Please see the following example notebooks:
[Video of talk at NeurIPS MLCB 2017](https://www.youtube.com/watch?v=fXPGVJg956E)

Example notebooks for running the algorithm:
- [TF MoDISco TAL GATA](examples/simulated_TAL_GATA_deeplearning/TF_MoDISco_TAL_GATA.ipynb): a self-contained example notebook that uses pre-computed importance scores (generated by a neural network) as input. Scores were generated using deeplift as illustated in [this notebook](examples/simulated_TAL_GATA_deeplearning/Generate%20Importance%20Scores.ipynb). If deeplift doesn't work with your architecture, you could alternatively generate scores using DeepSHAP (DeepSHAP is an extension of DeepLIFT that can work with more diverse architectures) as illustrated in [this notebook](https://github.com/AvantiShri/shap/blob/276bb8cae899a79dedab15c294cd440e57d5695e/notebooks/deep_explainer/Tensorflow%20DeepExplainer%20Genomics%20Example%20With%20Hypothetical%20Importance%20Scores.ipynb) (heads-up: that notebook uses a custom branch of the DeepSHAP repository).
- [TF MoDISco Nanog](examples/H1ESC_Nanog_gkmsvm/TF%20MoDISco%20Nanog.ipynb): a self-contained example notebook that uses pre-computed importance scores and an empirically-generated null distribution (generated by a gkm-SVM) as input. Scores were generated using gkmexplain as illustated in [this notebook](examples/H1ESC_Nanog_gkmsvm/Nanog_GkmExplain_Generate_Data.ipynb). This notebook also illustrates how to use a MEME-based initialization to potentially boost the performance of TF-MoDISco.

TF-MoDISco has been used in the following papers:
- [Deep learning at base-resolution reveals motif syntax of the cis-regulatory code](https://www.biorxiv.org/content/10.1101/737981v1) (Avsec et al.)
- [GkmExplain: fast and accurate interpretation of nonlinear gapped k-mer SVMs](https://academic.oup.com/bioinformatics/article/35/14/i173/5529147) (Shrikumar, Prakash & Kundaje)

Full paper on the way.

## Loading a saved TF-MoDISco HDF5 File

In the example notebooks, you will notice that the output of TF-MoDISco is saved as an HDF5 file. Below is documentation on how to load this output and what the different attributes mean. If you catch something that appears to be out-of-date or doesn't make sense, please file a github issue to let me know.

The easiest way to load the hdf5 file is to create a TfModiscoResults object via the function `modisco.tfmodisco_workflow.workflow.TfModiscoResults.from_hdf5(...)`. The use of this function is demonstrated in cell 10 of [this notebook](https://github.com/kundajelab/tfmodisco/blob/948d62c5143f4e05469f63610e7c9cf2033f0f76/examples/simulated_TAL_GATA_deeplearning/With%20Hit%20Scoring%20TF%20MoDISco%20TAL%20GATA.ipynb); the only catch is that it requires the data for all the importance score tracks to be provided via a `TrackSet` object (the `TrackSet` object is needed to recreate the seqlets from the data stored in the hdf5 file). Below I have documented the important attributes of the `TfModiscoResults` class and the key subclasses. If for whatever reason you are specifically interested in the hdf5 format, let me know and I can detail where all these attributes wind up in the hdf5 file. Alternatively you can inspect the `save_hdf5` functions of the relevant classes to see how the attributes are stored.

`tfmodisco_workflow.workflow.TfModiscoResults`:
- `.task_names`: list of the task names that `TfModiscoWorkflow` object was called with
- `.multitask_seqlet_creation_results`: instance of `core.MultitaskSeqletCreationResults`; stores all the information about the seqlets that were identified across all tasks during the seqlet identification step. See below.
- `.metaclustering_results`: instance of `metaclusterers.MetaclusteringResults`, which stores details on the metaclusters obtained from doing metaclustering on the seqlets. See below.
- `.metacluster_idx_to_submetacluster_results`: dictionary that maps the metacluster number to an instance of `tfmodisco_workflow.workflow.SubMetaclusterResults`. `SubMetaclusterResults` stores the results of applying clustering to the seqlets within a metacluster, including the motifs found for the metacluster. See below.

`tfmodisco_workflow.workflow.SubMetaclusterResults`:
- `.metacluster_size`: the number of seqlets in this metacluster
- `.activity_pattern`: the activity pattern of this metacluster. The activity pattern of a metacluster is a vector of length=number-of-tasks, and the entries in the vector are -1, 0 or 1 for each task. The activity pattern indicates how the seqlets in that metacluster contribute to the different tasks.
- `.seqlets`: the seqlets that fell within this metacluster
- `.seqlets_to_patterns_result`: an instance of `tfmodisco_workflow.seqlets_to_patterns.SeqletsToPatternsResults`; this stores information on the motifs ("patterns") identified within the metacluster. See below.

`tfmodisco_workflow.seqlets_to_patterns.SeqletsToPatternsResults`
- `.success`: whether or not the motif discovery for this metacluster terminated successfully
- `.patterns`: a list of instances of core.AggregatedSeqlet, which represent the motifs. See below.
- `.total_time_taken`: the total time taken for performing motif discovery for this metacluster.

`core.AggregatedSeqlet`: (this is the class used to represent motifs)
- `.seqlets` returns a list of seqlets for this motif. seqlets are instances of the `core.Seqlet` class. See below
- `[track_name].fwd`: returns the forward strand version of track_name; this is the average value over all seqlets in the motif. (Note: in case my notation is unclear, I mean that you can use the dictionary lookup syntax, i.e. do `motif[track_name].fwd` to get the data). `track_name` is a string.
- `[track_name].rev`: returns the reverse complement of track_name; this is the average value over all seqlets in the motif

`core.Seqlet`:
- `.coor`: returns an instance of core.SeqletCoordinates; see below
- `[track_name].fwd`: returns the forward strand version of track_name
- `[track_name].rev`: returns the reverse complement version of track_name

`core.SeqletCoordinates`:
- `.example_idx`: the index of the example from which this seqlet originated. This index corresponds to the data that was provided in the call to TfModiscoWorkflow.
- `.start`: the location of the start of the seqlet within the example
- `.end`: the location of the end of the seqlet within the example
- `.is_revcomp`: whether the seqlet is on the forward or the reverse strand

`core.MultitaskSeqletCreationResults`
- `.multitask_seqlet_creator`: instance of `core.MultitaskSeqletCreator`; stores the information needed to create the seqlets given new data
- `.final_seqlets`: the final list of seqlets produced across tasks
- `.task_name_to_coord_producer_results`: mapping from the task name to an instance of coordproducers.CoordProducerResults, which stores information on the seqlet coordinates identified for that particular task, as well as the thresholding cutoffs used.

`metaclusterers.MetaclusteringResults`
- `.metacluster_indices`: a vector where `metacluster_indices[i]` returns the metacluster number for the seqlet at index i. You should find that the ordering matches the ordering of `TfModiscoResults.multitask_seqlet_creation_results.final_seqlets`.
- `.metaclusterer`: an instance of `metaclusterers.AbstractMetaclusterer`, which can be used to assign metaclusters to seqlets obtained on new data.
- `.attribute_vectors`: mostly for debugging purposes; this would be the attributes that were extracted to the seqlets and supplied to the `Metaclusterer` for metaclustering; think of them as the seqlet features that were used for metaclustering. Again, the order of the `attribute_vectors` should match the ordering of `TfModiscoResults.multitask_seqlet_creation_results.final_seqlets`.
- `.metacluster_idx_to_activity_pattern`: mapping from the metacluster to the pattern of activity across tasks. The activity pattern of a metacluster is a vector of length=number-of-tasks, and the entries in the vector are -1, 0 or 1 for each task. It indicates how the seqlets in that metacluster contribute to the different tasks.

0 comments on commit d266458

Please sign in to comment.