Repository containing data, code and walkthrough for methods in the paper CodonBERT: large language models for mRNA design and optimization.
Dependency management is done via poetry.
pip install poetry
poetry install
Ensure you have CUDA drivers if you plan on using a GPU.
The CodonBERT Pytorch model can be downloaded here. The artifact is under a license. The code and repository are under a software license.
Pretraining and finetuning scripts are under benchmarks/CodonBERT.
To extract embeddings from the model, use extract_embed.py. The dataset sample.fasta is included for reference.
Pre-training dataset are under benchmarks/CodonBERT/data/pre-train, train_seqs_id_1.csv.zip and train_seqs_id_2.csv.zip list the NCBI IDs for all 10 million sequences for pre-training. train_samples.csv provides a training sample. eval.csv stores the held-out dataset for evaluation.
To run finetuning, the --task
flag must be used. All downstream datasets are under benchmarks/CodonBERT/data/fine-tune.
As part of the release, we are sharing an internal dataset. Additionally, the data from other published datasets mentioned in the paper that were used for benchmarking are also included.
- mRFP_Expression.csv is from Revealing determinants of translation efficiency via whole-gene codon randomization and machine learning
- Fungal_expression.csv is from Kingdom-Wide Analysis of Fungal Protein-Coding and tRNA Genes Reveals Conserved Patterns of Adaptive Evolution
- E.Coli_proteins.csv is from MPEPE, a predictive approach to improve protein expression in E. coli based on deep learning
- Tc-Riboswitches.csv is from Tuning the Performance of Synthetic Riboswitches using Machine Learning
- mRNA_Stability.csv are from iCodon customizes gene expression based on the codon composition
- CoV_Vaccine_Degradation.csv is from Deep learning models for predicting RNA degradation via dual crowdsourcing
- The average of the deg_Mg_50C values at each nucleotide is treated as the sequence-level target. Deg_Mg_50C has the highest correlation with other labels, including deg_pH10, deg_Mg_pH10, and deg_50C.
Code for training the TextCNN model is in the textcnn directory.
Edit main.py
to point to the desired embeddings and run python main.py
to train the model.
The notebooks folder contains walkthrough Jupyter notebooks for benchmarking the TFIDF model as well as the TextCNN model with a pre-trained word2vec embedding representation. These use datamodel_mRFP as a test dataset.
If you find the model useful in your research, please cite our paper:
@article {Li2023.09.09.556981,
author = {Sizhen Li and Saeed Moayedpour and Ruijiang Li and Michael Bailey and Saleh Riahi and Milad Miladi and Jacob Miner and Dinghai Zheng and Jun Wang and Akshay Balsubramani and Khang Tran and Minnie Zacharia and Monica Wu and Xiaobo Gu and Ryan Clinton and Carla Asquith and Joseph Skalesk and Lianne Boeglin and Sudha Chivukula and Anusha Dias and Fernando Ulloa Montoya and Vikram Agarwal and Ziv Bar-Joseph and Sven Jager},
title = {CodonBERT: Large Language Models for mRNA design and optimization},
elocation-id = {2023.09.09.556981},
year = {2023},
doi = {10.1101/2023.09.09.556981},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2023/09/12/2023.09.09.556981},
eprint = {https://www.biorxiv.org/content/early/2023/09/12/2023.09.09.556981.full.pdf},
journal = {bioRxiv}
}