This is not a Resilience project and the code/opinions/recommendations are my personal work and mine alone.
We'll use the Therapeutics Data Commons Python package to download open-source (CC BY 4.0) datasets that are meaningful in pharmaceutical research. In this repository, we'll use a dataset called TCR-Epitope Binding Affinity. The code will be in the notebook notebooks/tdc-tcr-epitope-binding-affinity.ipynb.
We show how to create a deep learning model for predicting if a T-cell receptor (TCR) and protein epitope will bind to each other. A model that can predict how well a TCR bindings to an epitope can lead to more effective treatments that use immunotherapy. For example, in anti-cancer therapies it is important for the T-cell receptor to bind to the protein marker in the cancer cell so that the T-cell (actually the T-cell's friends in the immune system) can kill the cancer cell.
HuggingFace is a Python library that provides a "one-stop shop" to train and deploy AI models. In this case, we use HuggingFace to get a pre-trained version of Facebook's open-source Evolutionary Scale Model (ESM-2). This model turns protein sequences into a vector of numbers that the computer can use in a mathematical model. The vector of numbers uniquely encodes (aka embeds) a protein sequence in the same way that the Dewey Decimal System and ISBN uniquely encode a book into a set of numbers (and letters). This representation is also referred to as a latent space.
Then, we'll show how to combine this embedding with a simple neural network to create a binary classifier for the TCR-epitope binding affinity prediction (True=They Bind, False=They don't bind).
The Therapeutics Data Commons (TDC) dataset can be automatically downloaded via their open-sourced Python library. However, it will take significant time (hours) to compute the Evolutionary Scale Model (ESM-2) embedding vectors.
To save you time, I've uploaded the preprocessed data as Pickle files on Zenodo. If you download those 3 files, then the Python script will skip the embedding step.
To install all of the required Python packages, you'll need to create a conda environment. Follow the conda website directions to download and install conda
(Anaconda works too). Once you have conda
installed, run the command:
conda env create -f environment.yml
Once the environment is successfully created, activate it by running:
conda activate tdc-tcr-epitope-binding-affinity-env
At this point you should be able to run the Jupyter Notebook:
jupyter notebook notebooks/tdc-tcr-epitope-binding-affinity-model.ipynb
If you don't want to install conda
, then you can run the Jupyter notebook from within a container.
To create an Apptainer, run the command:
apptainer build tdc-tcr-epitope-binding-affinity.sif tdc-tcr-epitope-binding-affinity.def
Then, run:
apptainer shell tdc-tcr-epitope-binding-affinity.sif
At this point, you'll be able to run the Jupyter Notebook:
jupyter notebook notebooks/tdc-tcr-epitope-binding-affinity-model.ipynb
To create a Docker, run the command:
docker build -t tdc-tcr-epitope-binding-affinity .
Now you can run:
docker run tdc-tcr-epitope-binding-affinity
And finally you can run the Jupyter Notebook:
jupyter notebook notebooks/tdc-tcr-epitope-binding-affinity-model.ipynb
-
Weber, Anna, Jannis Born, and María Rodriguez Martínez. "TITAN: T-cell receptor specificity prediction with bimodal attention networks." Bioinformatics 37.Supplement_1 (2021): i237-i244.
-
Bagaev, Dmitry V., et al. "VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium." Nucleic Acids Research 48.D1 (2020): D1057-D1062.
-
Dines, Jennifer N., et al. "The immunerace study: A prospective multicohort study of immune response action to covid-19 events with the immunecode™ open access database." medRxiv (2020).
-
Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, Rob Fergus. "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences." bioRxiv 622803; doi: https://doi.org/10.1101/622803 https://www.biorxiv.org/content/10.1101/622803v4
-
Zeming Lin et al, Evolutionary-scale prediction of atomic-level protein structure with a language model, Science (2023). DOI: 10.1126/science.ade2574 https://www.science.org/doi/10.1126/science.ade2574
https://huggingface.co/facebook/esm2_t36_3B_UR50D
Checkpoint name | Number of layers | Number of parameters |
---|---|---|
esm2_t48_15B_UR50D | 48 | 15B |
esm2_t36_3B_UR50D | 36 | 3B |
esm2_t33_650M_UR50D | 33 | 650M |
esm2_t30_150M_UR50D | 30 | 150M |
esm2_t12_35M_UR50D | 12 | 35M |
esm2_t6_8M_UR50D | 6 | 8M |
The TDC dataset is a CC-BY-4.0. The ESM-2 model is MIT license.