Rigorously Assessing Natural Language Explanations of Neurons

We develop two modes of evaluation for natural language explanations that claim individual neurons represent a concept in a text input. We apply our framework to the GPT-4-generated explanations of GPT-2 XL neurons of Bills et al. (2023) and show that even the most confident explanations have high error rates and little to no causal efficacy.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
README.md		README.md
eval_data.tar.gz		eval_data.tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rigorously Assessing Natural Language Explanations of Neurons

About

Languages

explanare/eval-neuron-explanation

Folders and files

Latest commit

History

Repository files navigation

Rigorously Assessing Natural Language Explanations of Neurons

About

Topics

Resources

Stars

Watchers

Forks

Languages