Awesome Representation Engineering

This repository tracks the latest research on representation engineering (RepE), which was originally introduced by Zou et al. (2023). The goal is to offer a comprehensive list of papers and resources relevant to the topic. Work that falls under the umbrella of representation engineering are also included.

Note

If you believe your paper on representation engineering (or related topics) is not included, or if you find a mistake, typo, or information that is not up to date, please open an issue, and I will address it as soon as possible.

If you want to add a new paper, feel free to either open an issue or create a pull request.

Also:

Important

Note that representation engineering is a relatively new framework, so the categorization below reflects my subjective understanding of the techniques. The first list includes work that explicitly uses the term "representation engineering." Other closely related work is grouped in the later lists.

If you disagree with the categorization or have suggestions for improvement, please let me know by opening an issue.

Papers

Representation engineering

Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control
- Author(s): Yuxin Xiao, Chaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, Jieping Ye
- Date: 2024-11
- Venue: NeurIPS 2024
- Code: -
Towards Reliable Evaluation of Behavior Steering Interventions in LLMs
- Author(s): Itamar Pres, Laura Ruis, Ekdeep Singh Lubana, David Krueger
- Date: 2024-10
- Venue: NeurIPS 2024 Workshop on Foundation Model Interventions
- Code: -
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
- Author(s): Yu Zhao, Alessio Devoto, Giwon Hong, Xiaotang Du, Aryo Pradipta Gema, Hongru Wang, Xuanli He, Kam-Fai Wong, Pasquale Minervini
- Date: 2024-10
- Venue: -
- Code:
A Timeline and Analysis for Representation Plasticity in Large Language Models
- Author(s): Akshat Kannan
- Date: 2024-10
- Venue: -
- Code:
Gradient-based Jailbreak Images for Multimodal Fusion Models
- Author(s): Javier Rando, Hannah Korevaar, Erik Brinkman, Ivan Evtimov, Florian Tramèr
- Date: 2024-10
- Venue: -
- Code:
Towards Inference-time Category-wise Safety Steering for Large Language Models
- Author(s): Amrita Bhattacharjee, Shaona Ghosh, Traian Rebedea, Christopher Parisien
- Date: 2024-10
- Venue: -
- Code: -
Words in Motion: Representation Engineering for Motion Forecasting
- Author(s): Omer Sahin Tas, Royden Wagner
- Date: 2024-06
- Venue: -
- Code:
Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets
- Author(s): Duanyu Feng, Bowen Qin, Chen Huang, Youcheng Huang, Zheng Zhang, Wenqiang Lei
- Date: 2024-06
- Venue: -
- Code:
PaCE: Parsimonious Concept Engineering for Large Language Models
- Author(s): Jinqi Luo, Tianjiao Ding, Kwan Ho Ryan Chan, Darshan Thaker, Aditya Chattopadhyay, Chris Callison-Burch, René Vidal
- Date: 2024-06
- Venue: -
- Code:
Improving Alignment and Robustness with Circuit Breakers
- Author(s): Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, Dan Hendrycks
- Date: 2024-06
- Venue: -
- Code:
ConTrans: Weak-to-Strong Alignment Engineering via Concept Transplantation
- Author(s): Weilong Dong, Xinwei Wu, Renren Jin, Shaoyang Xu, Deyi Xiong
- Date: 2024-05
- Venue: -
- Code: -
Towards General Conceptual Model Editing via Adversarial Representation Engineering
- Author(s): Yihao Zhang, Zeming Wei, Jun Sun, Meng Sun
- Date: 2024-04
- Venue: -
- Code:
Towards Uncovering How Large Language Model Works: An Explainability Perspective
- Author(s): Haiyan Zhao, Fan Yang, Bo Shen, Himabindu Lakkaraju, Mengnan Du
- Date: 2024-02
- Venue: -
- Code: -
Tradeoffs Between Alignment and Helpfulness in Language Models
- Author(s): Yotam Wolf, Noam Wies, Dorin Shteyman, Binyamin Rothberg, Yoav Levine, Amnon Shashua
- Date: 2024-01
- Venue: -
- Code: -
Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering
- Author(s): Tianlong Li, Shihan Dou, Wenhao Liu, Muling Wu, Changze Lv, Xiaoqing Zheng, Xuanjing Huang
- Date: 2024-01
- Venue: -
- Code: -
Aligning Large Language Models with Human Preferences through Representation Engineering
- Author(s): Wenhao Liu, Xiaohua Wang, Muling Wu, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang
- Date: 2023-12
- Venue: -
- Code: -
Representation Engineering: A Top-Down Approach to AI Transparency
- Author(s): Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks
- Date: 2023-10
- Venue: -
- Code:

Steering vectors

Can sparse autoencoders be used to decompose and interpret steering vectors?
- Author(s): Harry Mayne, Yushi Yang, Adam Mahdi
- Date: 2024-11
- Venue: -
- Code:
Extracting Unlearned Information from LLMs with Activation Steering
- Author(s): Atakan Seyitoğlu, Aleksei Kuvshinov, Leo Schwinn, Stephan Günnemann
- Date: 2024-11
- Venue: NeurIPS 2024 Workshop on Safe Generative AI
- Code: -
Improving Steering Vectors by Targeting Sparse Autoencoder Features
- Author(s): Sviatoslav Chalnev, Matthew Siu, Arthur Conmy
- Date: 2024-11
- Venue: -
- Code:
Improving Instruction-Following in Language Models through Activation Steering
- Author(s): Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, Besmira Nushi
- Date: 2024-10
- Venue: -
- Code: -
Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors
- Author(s): Weixuan Wang, Jingyuan Yang, Wei Peng
- Date: 2024-10
- Venue: -
- Code:
Activation Scaling for Steering and Interpreting Language Models
- Author(s): Niklas Stoehr, Kevin Du, Vésteinn Snæbjarnarson, Robert West, Ryan Cotterell, Aaron Schein
- Date: 2024-10
- Venue: EMNLP 2024 Findings
- Code:
Uncovering Latent Chain of Thought Vectors in Language Models
- Author(s): Jason Zhang, Scott Viteri
- Date: 2024-09
- Venue: ICLR 2024 Tiny Paper
- Code: -
Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective
- Author(s): Van-Cuong Pham, Thien Huu Nguyen
- Date: 2024-09
- Venue: -
- Code: -
Analyzing the Generalization and Reliability of Steering Vectors
- Author(s): Daniel Tan, David Chanin, Aengus Lynch, Dimitrios Kanoulas, Brooks Paige, Adria Garriga-Alonso, Robert Kirk
- Date: 2024-07
- Venue: -
- Code: -
Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs
- Author(s): Sara Price, Arjun Panickssery, Sam Bowman, Asa Cooper Stickland
- Date: 2024-07
- Venue: -
- Code: -
Steering Without Side Effects: Improving Post-Deployment Control of Language Models
- Author(s): Asa Cooper Stickland, Alexander Lyzhov, Jacob Pfau, Salsabila Mahdi, Samuel R. Bowman
- Date: 2024-06
- Venue: -
- Code:
Who's asking? User personas and the mechanics of latent misalignment
- Author(s): Asma Ghandeharioun, Ann Yuan, Marius Guerard, Emily Reif, Michael A. Lepori, Lucas Dixon
- Date: 2024-06
- Venue: -
- Code: -
Controlling Large Language Model Agents with Entropic Activation Steering
- Author(s): Nate Rahn, Pierluca D'Oro, Marc G. Bellemare
- Date: 2024-06
- Venue: -
- Code: -
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
- Author(s): Tianlong Wang, Xianfeng Jiao, Yifan He, Zhongzhi Chen, Yinghao Zhu, Xu Chu, Junyi Gao, Yasha Wang, Liantao Ma
- Date: 2024-06
- Venue: -
- Code: -
Activation Steering for Robust Type Prediction in CodeLLMs
- Author(s): Francesca Lucchetti, Arjun Guha
- Date: 2024-04
- Venue: -
- Code: -
Extending Activation Steering to Broad Skills and Multiple Behaviours
- Author(s): Teun van der Weij, Massimo Poesio, Nandi Schoots
- Date: 2024-03
- Venue: -
- Code:
Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models
- Author(s): Chen Qian, Jie Zhang, Wei Yao, Dongrui Liu, Zhenfei Yin, Yu Qiao, Yong Liu, Jing Shao
- Date: 2024-02
- Venue: -
- Code:
MiMiC: Minimally Modified Counterfactuals in the Representation Space
- Author(s): Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, Ponnurangam Kumaraguru
- Date: 2024-02
- Venue: -
- Code: -
Investigating Bias Representations in Llama 2 Chat via Activation Steering
- Author(s): Dawn Lu, Nina Rimsky
- Date: 2024-02
- Venue: -
- Code: -
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance
- Author(s): Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Ke Ren, Botian Jiang, Xipeng Qiu
- Date: 2024-01
- Venue: -
- Code:
Steering Llama 2 via Contrastive Activation Addition
- Author(s): Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner
- Date: 2024-12
- Venue: -
- Code:
Improving Activation Steering in Language Models with Mean-Centring
- Author(s): Ole Jorgensen, Dylan Cope, Nandi Schoots, Murray Shanahan
- Date: 2023-12
- Venue: -
- Code: -
Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment
- Author(s): Haoran Wang, Kai Shu
- Date: 2023-11
- Venue: -
- Code:
The Linear Representation Hypothesis and the Geometry of Large Language Models
- Author(s): Kiho Park, Yo Joong Choe, Victor Veitch
- Date: 2023-11
- Venue: -
- Code:
Activation Addition: Steering Language Models Without Optimization
- Author(s): Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, Monte MacDiarmid
- Date: 2023-08
- Venue: -
- Code:
Extracting Latent Steering Vectors from Pretrained Language Models
- Author(s): Nishant Subramani, Nivedita Suresh, Matthew E. Peters
- Date: 2022-05
- Venue: -
- Code:

Concept activation vectors

Visual-TCAV: Concept-based Attribution and Saliency Maps for Post-hoc Explainability in Image Classification
- Author(s): Antonio De Santis, Riccardo Campi, Matteo Bianchi, Marco Brambilla
- Date: 2024-11
- Venue: -
- Code: -
Decision Trees for Interpretable Clusters in Mixture Models and Deep Representations
- Author(s): Maximilian Fleissner, Maedeh Zarvandi, Debarghya Ghoshdastidar
- Date: 2024-11
- Venue: -
- Code: -
Exploiting Text-Image Latent Spaces for the Description of Visual Concepts
- Author(s): Laines Schmalwasser, Jakob Gawlikowski, Joachim Denzler, Julia Niebling
- Date: 2024-10
- Venue: -
- Code: -
KTCR: Improving Implicit Hate Detection with Knowledge Transfer driven Concept Refinement
- Author(s): Samarth Garg, Vivek Hruday Kavuri, Gargi Shroff, Rahul Mishra
- Date: 2024-10
- Venue: -
- Code: -
LG-CAV: Train Any Concept Activation Vector with Language Guidance
- Author(s): Qihan Huang, Jie Song, Mengqi Xue, Haofei Zhang, Bingde Hu, Huiqiong Wang, Hao Jiang, Xingen Wang, Mingli Song
- Date: 2024-10
- Venue: NeuroIPS 2024
- Code:
Looking into Concept Explanation Methods for Diabetic Retinopathy Classification
- Author(s): Andrea M. Storås, Josefine V. Sundgaard
- Date: 2024-10
- Venue: -
- Code: -
EQ-CBM: A Probabilistic Concept Bottleneck with Energy-based Models and Quantized Vectors
- Author(s): Sangwon Kim, Dasom Ahn, Byoung Chul Ko, In-su Jang, Kwang-Ju Kim
- Date: 2024-09
- Venue: ACCV 2024
- Code: -
TextCAVs: Debugging vision models using text
- Author(s): Angus Nicolson, Yarin Gal, J. Alison Noble
- Date: 2024-08
- Venue: -
- Code: -
Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector
- Author(s): Zhihao Xu, Ruixuan Huang, Xiting Wang, Fangzhao Wu, Jing Yao, Xing Xie
- Date: 2024-04
- Venue: -
- Code: -
Explaining Explainability: Understanding Concept Activation Vectors
- Author(s): Angus Nicolson, Lisa Schut, J. Alison Noble, Yarin Gal
- Date: 2024-04
- Venue: -
- Code: -
Demystifying Embedding Spaces using Large Language Models
- Author(s): Guy Tennenholtz, Yinlam Chow, Chih-Wei Hsu, Jihwan Jeong, Lior Shani, Azamat Tulepbergenov, Deepak Ramachandran, Martin Mladenov, Craig Boutilier
- Date: 2023-10
- Venue: ICLR 2024
- Code: -

Other relevant papers

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
- Author(s): Samuel Marks, Max Tegmark
- Date: 2023-10
- Venue: -
- Code:

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Representation Engineering

Table of Contents

Papers

Representation engineering

Steering vectors

Concept activation vectors

Other relevant papers

Blog Posts

About

Releases

Packages

License

chrisliu298/awesome-representation-engineering

Folders and files

Latest commit

History

Repository files navigation

Awesome Representation Engineering

Table of Contents

Papers

Representation engineering

Steering vectors

Concept activation vectors

Other relevant papers

Blog Posts

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages