Code associated with paper "Evaluation of large language models for discovery of gene set function"
conda create -n llm_eval python=3.11.5
conda activate llm_eval
conda env config vars set OPENAI_API_KEY="<your api key>"
conda deactivate # reactivate
conda activate llm_eval
echo $OPENAI_API_KEY # make sure the key setup
%python
import os
import openai
openai.api_key = os.environ["OPENAI_API_KEY"]
From OpenAI website for the best practice for API key safety
The code was developed using Python 3.11.5.
git clone git@github.com:idekerlab/llm_evaluation_for_gene_set_interpretation.git
cd llm_evaluation_for_gene_set_interpretation
pip install -r requirements.txt
DDOT is required for downloading GO and can be installed in one of two ways:
To install DDOT by downloading the zip file of the source tree:
wget https://github.com/idekerlab/ddot/archive/refs/heads/python3.zip
unzip python3.zip
cd ddot-python3
python setup.py bdist_wheel
pip install dist/ddot*py3*whl
To install DDOT by cloning the repo:
git clone --branch python3 https://github.com/idekerlab/ddot.git
cd ddot
python setup.py bdist_wheel
pip install dist/ddot*py3*whl
The notebooks are numbered according to the evaluation steps
-
Data Preperation (this step can be omitted for testing purposes)
The data is already in the data directory (refer to the README in this directory for detail information about the data)
If need to download GO, follow the code below:
## download and parse GO_BP terms outdir = 'data/GO_BP/' namespace = 'biological_process' python process_the_gene_ontology.py $outdir --namespace $namespace
and the notebook for parsing GO terms
The addition of contamination to the gene set is filed in this notebook
If need to download Omics data, run notebook. The notebook processes the omics data and saves them into a tab delimited text file.
-
Query GPT-4 for names and supporting analysis and run functional enrichment
GO gene set GPT-4 analysis is stored in Run_LLM_analysis
GO gene set analysis with different models
Batch run 1000 GO terms using slurm job with the parameter file
omic gene set GPT-4 analysis and omics gene set gProfiler
## example code to process from 1st to 5th terms in the table # run in the command line input_file='data/GO_term_analysis/toy_example.csv' #input table path config='./jsonFiles/GOLLMrun_config.json' #configuration file set_index='GO' #index of the table gene_column='Genes' #name of the gene list column start=0 end=5 out_file='data/GO_term_analysis/LLM_processed_toy_example_gpt_4' #output path prefix source activate llm_eval # Run the Python script for the given range python query_llm_for_analysis.py --config $config \ --initialize \ --input $input_file \ --input_sep ','\ --set_index $set_index \ --gene_column $gene_column\ --gene_sep ' ' \ --start $start \ --end $end \ --output_file $out_file
-
Semantic Similarity evaluation of names
GO gene set analysis evalution
# get the ranking of similarities from the GO gene set analysis python rank_GOterm_LLM_sim_rand.py --input_file ./data/GO_term_analysis/LLM_processed_toy_example_w_contamination_gpt_4.tsv --emb_file data/all_go_terms_embeddings_dict.pkl --topn 3 --output_file ./data/GO_term_analysis/simrank_LLM_processed_toy_example.tsv --background_file data/GO_term_analysis/all_go_sim_scores_toy.txt
-
Further evaluation of the performance: model comparison evaluation, gene set functional enrichment, and gene set similarity comparison Evaluation Task 1 related
Model Comparison
Analysis related to Fig. 2A Compare the semantic similarities between models
Analysis related to Fig. 3 Run GO gene set functional enrichment for control
Compare the confidence score between real, contaminated, and random gene sets
Check broader concepts of the LLM names
Analysis for Fig. 2d
Analysis for whether the best matching GO term is a broader concept as the queried term
Evaluation Task 2 related Count genes supporting LLM name, then calculate LLM name Jaccard Index
Analysis related to Fig.4
Evaluate LLM name matching with any significantly enriched GO term name, use this notebook
-
Development and assessment of the citation module
-
Quantification of citation module check citation module
-
Visualization of results
extended data fig.1 + Fig.2 + Fig.3
Hu M, Alkhairy S, Lee I, Pillich RT, Bachelder R, Ideker T, Pratt D. Evaluation of large language models for discovery of gene set function. Preprint at https://doi.org/10.48550/arXiv.2309.04019 (2023)