Experiments done during SERI MATS (Summer 2023)

Relation to research writeups

`/refusal`

Activation steering with a "refusal vector" to cause llama-2-chat model to stop refusing to answer harmful questions.

Red-teaming language models via activation engineering

`/sycophancy`

Activation steering to modulate sycophancy in llama-2-chat and llama-2 base model.

`/steering`

Activation addition experiments (pure act-adds from single forward passes)

`/intermediate_decoding`

Logit-lens experiments (directly decoding intermediate activations by passing them through unembedding layer)

Decoding intermediate activations in llama-2-7b

Other directories

`/data_generation`

Code for generating LLM-generated datasets using gpt-4, 3.5 and Claude APIs

`/probability_calibration`

Early stage experiments to try and measure whether LLMs are aware of their internal uncertainty over a prediction

`/unlearning`

Early stage attempt at Google's Machine Unlearning Challenge