Official Repository for "Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small"
by Maheep Chaudhary and Atticus Geiger.
Access our paper on ArXiv.
We evaluate different open-source Sparse Autoencoders for GPT-2 small by different organisations, specifically by OpenAI, Apollo Research, and Joseph Bloom on the RAVEL dataset. We compare them against neurons and DAS based on how much they are able to disentangle the concept in neurons or latent space.
The below graphs show the performance:
🔴 NOTE: The run.sh
file contains the files to be run and should be edited to run for particular layer. The arguments of the script shell can be mapped using the arguments in the code. It is advisable to make a new environment before running the any files.
First clone the repository:
git clone https://github.com/MaheepChaudhary/SAE-Ravel.git
To download different SAEs and set up the environment, one can run:
chmod +x setup.sh run.sh eval_run.sh
./setup.sh
We ran the evaluation for 6 SAEs for the SAE for the Apollo research could be download just by changing id of wandb inside the code. These are the following ids of 6 SAEs:
- Layer 1 e2e SAE: bst0prdd
- Layer 1 e2e+ds SAE: e26jflpq
- Layer 5 e2e SAE: tvj2owza
- Layer 5 e2e+ds SAE: 2lzle2f0
- Layer 9 e2e SAE: vnfh4vpi
- Layer 9 e2e+ds SAE: u50mksr8
For training the mask for models or DAS, one can run the command:
./run.sh
The evaluation of SAE for their quality in terms of loss and accuracy can be executed using the command:
./eval_run.sh
Starting with the folders, the ./data/
folder contains all the data prepared and the .py
files used for it. The ./figure/
folder contains all the related images. The ./saved_models/
is just a proxy folder where the models when saved are located.
The individual files have the following meaning:
imports.py
: contains all the libraries and modules to be importedmodels.py
: contains all the code for model preparation where intervention is being performed, apart from that it also contains the code for evaluating the SAEs.main.py
: Runs the code in models.py for training the mask for every models and DAS, while doing intervention.eval_sae.py
: contains the code for running the evaluation function inmodels.py
.visualisation.py
: contains the code for creating graphs.setup.sh
: contains the code to setup the environment and downloading the needed SAEs.run.sh
: contains the code to run the script for running the files for training.eval_run.sh
: contains the code to running the SAE evaluation files.
If you find this repository useful in your research, please consider citing our paper:
@misc{chaudhary2024evaluatingopensourcesparseautoencodersongpt2small,
title={Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small},
author={Maheep Chaudhary and Atticus Geiger},
year={2024},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={},
}