CATHe

CATHe (short for CATH embeddings) is a deep learning tool designed to detect remote homologues (up to 20% sequence similarity) for superfamilies in the CATH database. CATHe consists of an artificial neural network model which was trained on sequence embeddings from the ProtT5 protein Language Model (pLM). It was able to achieve an accuracy of 85.6% +- 0.4%, and outperform the other baseline models derived from both, simple machine learning algorithms such as Logistic Regression, and homology-based inference using BLAST.

Requirements

pip3 install -r requirements.txt

Data

The dataset used for training, optimizing, and testing CATHe was derived from the CATH database. The datasets, along with the weights for the CATHe artificial neural network can be downloaded from Zenodo from this link: Dataset.

CATHe Predictions

Folder /src/cathe-predict

Before running the scripts, download the following files from Zenodod Dataset (mentioned above) and place them in the "/src/cathe-predict" folder:

a) CATHe.h5

This is the CATHe neural network model.

b) Y_Train_SF.csv, Y_Test_SF.csv, Y_Val_SF.csv

These are the superfamily label files from the dataset.

Additionally, set the location of the protein fasta file in fasta_to_ds.py script.

Run the following command to make predictions using CATHe.

python3 cathe_predictions.py

The CATHe predictions would be stored in a file named "results.csv" in the same folder. The results.csv has 4 columns: ['Record', 'Sequence', 'CATHe_Predicted_SFAM', 'CATHe_Prediction_Probability']. The "Record" column is the identifier for the protein sequence. The "Sequence" column stores the primary sequence of the protein. The "CATHe_Predicted_SFAM" is the CATH superfamily prediction made by CATHe, and the probability of this prediction is mentioned in the "CATHe_Prediction_Probability" column.

Pre-Print

If you found this work useful, please consider citing the following article:

@article {CATHe2022,
	author = {Nallapareddy, Vamsi and Bordin, Nicola and Sillitoe, Ian and Heinzinger, Michael and Littmann, Maria and Waman, Vaishali and Sen, Neeladri and Rost, Burkhard and Orengo, Christine},
	title = {CATHe: Detection of remote homologues for CATH superfamilies using embeddings from protein language models},
	elocation-id = {2022.03.10.483805},
	year = {2022},
	doi = {10.1101/2022.03.10.483805},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2022/03/13/2022.03.10.483805},
	eprint = {https://www.biorxiv.org/content/early/2022/03/13/2022.03.10.483805.full.pdf},
	journal = {bioRxiv}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CATHe

Requirements

Data

CATHe Predictions

Pre-Print

About

Releases

Packages

Languages

License

vam-sin/CATHe

Folders and files

Latest commit

History

Repository files navigation

CATHe

Requirements

Data

CATHe Predictions

Pre-Print

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages