We develop two modes of evaluation for natural language explanations that claim individual neurons represent a concept in a text input. We apply our framework to the GPT-4-generated explanations of GPT-2 XL neurons of Bills et al. (2023) and show that even the most confident explanations have high error rates and little to no causal efficacy.
-
Notifications
You must be signed in to change notification settings - Fork 0
A framework for evaluating auto-interp pipelines, i.e., natural language explanations of neurons.
explanare/eval-neuron-explanation
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
A framework for evaluating auto-interp pipelines, i.e., natural language explanations of neurons.