A framework for training genomic language models (gLMs) that controls for phylogenetic biases
To setup the environment for PhyloGPN, follow these instructions:
-
Install a Python virtual environment with the necessary packages and activate it by running
python3 -m venv venv && source venv/bin/activate && pip install -r requirements.txt
-
Download the PhyloGPN checkpoint:
gdown "https://drive.google.com/uc?id=1MSxLYbZKSnWjbM_w1cHGVFffrwh8j64V" -O ./PhyloGPN/checkpoint.pt
For a tutorial on how to obtain likelihoods, rate parameters, and embeddings from PhyloGPN, refer to notebooks/example.ipynb
For instructions on how to benchmark PhyloGPN and PhyloGPN-pooled in BEND,
refer to notebooks/bend.ipynb
To reproduce our results for the variant effect prediction evaluation, follow these instructions:
-
Download the required raw data:
invoke download-hg38
invoke download-omim
invoke download-latest-clinvar
invoke download-dms-data
invoke download-and-process-gnomad
-
Process the raw data:
invoke chunk-hg38
invoke process-clinvar
invoke process-dms-data
-
Generate log likelihood ratios:
invoke generate-vep-results --model phylogpn
invoke generate-vep-results --model caduceus_131k
invoke generate-vep-results --model hyenadna_medium_160k
invoke generate-vep-results --model nucleotide_transformer_v2_500m
-
Process the results:
invoke merge-clinvar-results
invoke merge-dms-results
invoke merge-omim-results
The results should be in data/clinvar_eval.csv
, data/omim_eval.csv
, and data/dms_eval.csv
.