This repository tracks the latest research on representation engineering (RepE), which was originally introduced by Zou et al. (2023). The goal is to offer a comprehensive list of papers and resources relevant to the topic. Work that falls under the umbrella of representation engineering are also included.
Note
If you believe your paper on representation engineering (or related topics) is not included, or if you find a mistake, typo, or information that is not up to date, please open an issue, and I will address it as soon as possible.
If you want to add a new paper, feel free to either open an issue or create a pull request.
Also:
Important
Note that representation engineering is a relatively new framework, so the categorization below reflects my subjective understanding of the techniques. The first list includes work that explicitly uses the term "representation engineering." Other closely related work is grouped in the later lists.
If you disagree with the categorization or have suggestions for improvement, please let me know by opening an issue.
- Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control
- Author(s): Yuxin Xiao, Chaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, Jieping Ye
- Date: 2024-11
- Venue: NeurIPS 2024
- Code: -
- Towards Reliable Evaluation of Behavior Steering Interventions in LLMs
- Author(s): Itamar Pres, Laura Ruis, Ekdeep Singh Lubana, David Krueger
- Date: 2024-10
- Venue: NeurIPS 2024 Workshop on Foundation Model Interventions
- Code: -
- Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
- A Timeline and Analysis for Representation Plasticity in Large Language Models
- Gradient-based Jailbreak Images for Multimodal Fusion Models
- Towards Inference-time Category-wise Safety Steering for Large Language Models
- Author(s): Amrita Bhattacharjee, Shaona Ghosh, Traian Rebedea, Christopher Parisien
- Date: 2024-10
- Venue: -
- Code: -
- Words in Motion: Representation Engineering for Motion Forecasting
- Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets
- PaCE: Parsimonious Concept Engineering for Large Language Models
- Improving Alignment and Robustness with Circuit Breakers
- ConTrans: Weak-to-Strong Alignment Engineering via Concept Transplantation
- Author(s): Weilong Dong, Xinwei Wu, Renren Jin, Shaoyang Xu, Deyi Xiong
- Date: 2024-05
- Venue: -
- Code: -
- Towards General Conceptual Model Editing via Adversarial Representation Engineering
- Towards Uncovering How Large Language Model Works: An Explainability Perspective
- Author(s): Haiyan Zhao, Fan Yang, Bo Shen, Himabindu Lakkaraju, Mengnan Du
- Date: 2024-02
- Venue: -
- Code: -
- Tradeoffs Between Alignment and Helpfulness in Language Models
- Author(s): Yotam Wolf, Noam Wies, Dorin Shteyman, Binyamin Rothberg, Yoav Levine, Amnon Shashua
- Date: 2024-01
- Venue: -
- Code: -
- Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering
- Author(s): Tianlong Li, Shihan Dou, Wenhao Liu, Muling Wu, Changze Lv, Xiaoqing Zheng, Xuanjing Huang
- Date: 2024-01
- Venue: -
- Code: -
- Aligning Large Language Models with Human Preferences through Representation Engineering
- Author(s): Wenhao Liu, Xiaohua Wang, Muling Wu, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang
- Date: 2023-12
- Venue: -
- Code: -
- Representation Engineering: A Top-Down Approach to AI Transparency
- Author(s): Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks
- Date: 2023-10
- Venue: -
- Code:
- Can sparse autoencoders be used to decompose and interpret steering vectors?
- Extracting Unlearned Information from LLMs with Activation Steering
- Author(s): Atakan Seyitoğlu, Aleksei Kuvshinov, Leo Schwinn, Stephan Günnemann
- Date: 2024-11
- Venue: NeurIPS 2024 Workshop on Safe Generative AI
- Code: -
- Improving Steering Vectors by Targeting Sparse Autoencoder Features
- Improving Instruction-Following in Language Models through Activation Steering
- Author(s): Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, Besmira Nushi
- Date: 2024-10
- Venue: -
- Code: -
- Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors
- Activation Scaling for Steering and Interpreting Language Models
- Uncovering Latent Chain of Thought Vectors in Language Models
- Author(s): Jason Zhang, Scott Viteri
- Date: 2024-09
- Venue: ICLR 2024 Tiny Paper
- Code: -
- Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective
- Author(s): Van-Cuong Pham, Thien Huu Nguyen
- Date: 2024-09
- Venue: -
- Code: -
- Analyzing the Generalization and Reliability of Steering Vectors
- Author(s): Daniel Tan, David Chanin, Aengus Lynch, Dimitrios Kanoulas, Brooks Paige, Adria Garriga-Alonso, Robert Kirk
- Date: 2024-07
- Venue: -
- Code: -
- Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs
- Author(s): Sara Price, Arjun Panickssery, Sam Bowman, Asa Cooper Stickland
- Date: 2024-07
- Venue: -
- Code: -
- Steering Without Side Effects: Improving Post-Deployment Control of Language Models
- Who's asking? User personas and the mechanics of latent misalignment
- Author(s): Asma Ghandeharioun, Ann Yuan, Marius Guerard, Emily Reif, Michael A. Lepori, Lucas Dixon
- Date: 2024-06
- Venue: -
- Code: -
- Controlling Large Language Model Agents with Entropic Activation Steering
- Author(s): Nate Rahn, Pierluca D'Oro, Marc G. Bellemare
- Date: 2024-06
- Venue: -
- Code: -
- Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
- Author(s): Tianlong Wang, Xianfeng Jiao, Yifan He, Zhongzhi Chen, Yinghao Zhu, Xu Chu, Junyi Gao, Yasha Wang, Liantao Ma
- Date: 2024-06
- Venue: -
- Code: -
- Activation Steering for Robust Type Prediction in CodeLLMs
- Author(s): Francesca Lucchetti, Arjun Guha
- Date: 2024-04
- Venue: -
- Code: -
- Extending Activation Steering to Broad Skills and Multiple Behaviours
- Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models
- MiMiC: Minimally Modified Counterfactuals in the Representation Space
- Author(s): Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, Ponnurangam Kumaraguru
- Date: 2024-02
- Venue: -
- Code: -
- Investigating Bias Representations in Llama 2 Chat via Activation Steering
- Author(s): Dawn Lu, Nina Rimsky
- Date: 2024-02
- Venue: -
- Code: -
- InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance
- Steering Llama 2 via Contrastive Activation Addition
- Improving Activation Steering in Language Models with Mean-Centring
- Author(s): Ole Jorgensen, Dylan Cope, Nandi Schoots, Murray Shanahan
- Date: 2023-12
- Venue: -
- Code: -
- Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment
- The Linear Representation Hypothesis and the Geometry of Large Language Models
- Activation Addition: Steering Language Models Without Optimization
- Extracting Latent Steering Vectors from Pretrained Language Models
- Visual-TCAV: Concept-based Attribution and Saliency Maps for Post-hoc Explainability in Image Classification
- Author(s): Antonio De Santis, Riccardo Campi, Matteo Bianchi, Marco Brambilla
- Date: 2024-11
- Venue: -
- Code: -
- Decision Trees for Interpretable Clusters in Mixture Models and Deep Representations
- Author(s): Maximilian Fleissner, Maedeh Zarvandi, Debarghya Ghoshdastidar
- Date: 2024-11
- Venue: -
- Code: -
- Exploiting Text-Image Latent Spaces for the Description of Visual Concepts
- Author(s): Laines Schmalwasser, Jakob Gawlikowski, Joachim Denzler, Julia Niebling
- Date: 2024-10
- Venue: -
- Code: -
- KTCR: Improving Implicit Hate Detection with Knowledge Transfer driven Concept Refinement
- Author(s): Samarth Garg, Vivek Hruday Kavuri, Gargi Shroff, Rahul Mishra
- Date: 2024-10
- Venue: -
- Code: -
- LG-CAV: Train Any Concept Activation Vector with Language Guidance
- Looking into Concept Explanation Methods for Diabetic Retinopathy Classification
- Author(s): Andrea M. Storås, Josefine V. Sundgaard
- Date: 2024-10
- Venue: -
- Code: -
- EQ-CBM: A Probabilistic Concept Bottleneck with Energy-based Models and Quantized Vectors
- Author(s): Sangwon Kim, Dasom Ahn, Byoung Chul Ko, In-su Jang, Kwang-Ju Kim
- Date: 2024-09
- Venue: ACCV 2024
- Code: -
- TextCAVs: Debugging vision models using text
- Author(s): Angus Nicolson, Yarin Gal, J. Alison Noble
- Date: 2024-08
- Venue: -
- Code: -
- Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector
- Author(s): Zhihao Xu, Ruixuan Huang, Xiting Wang, Fangzhao Wu, Jing Yao, Xing Xie
- Date: 2024-04
- Venue: -
- Code: -
- Explaining Explainability: Understanding Concept Activation Vectors
- Author(s): Angus Nicolson, Lisa Schut, J. Alison Noble, Yarin Gal
- Date: 2024-04
- Venue: -
- Code: -
- Demystifying Embedding Spaces using Large Language Models
- Author(s): Guy Tennenholtz, Yinlam Chow, Chih-Wei Hsu, Jihwan Jeong, Lior Shani, Azamat Tulepbergenov, Deepak Ramachandran, Martin Mladenov, Craig Boutilier
- Date: 2023-10
- Venue: ICLR 2024
- Code: -
Other relevant posts: