- Principled RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation
- Social Choice for AI Alignment: Dealing with Diverse Human Feedback
- AI Alignment and Social Choice: Fundamental Limitations and Policy Implications
- Mapping Social Choice Theory to RLHF
- Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
- In particular, sections 3.2.1 and 4.1
- A Roadmap to Pluralistic Alignment
- Axioms for AI Alignment from Human Feedback
- Collective Constitutional AI: Aligning a Language Model with Public Input
- Derivative of Constitutional AI: Harmlessness from AI Feedback
- Generative Social Choice
- Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF
- Crowd-PrefRL: Preference-Based Reward Learning from Crowds
- Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons
- Soft Condorcet Optimization
- KTO: Model Alignment as Prospect Theoretic Optimization
- Nash Learning from Human Feedback
- Preference Ranking Optimization for Human Alignment
- RLHF and IIA: Perverse Incentives
- A Minimaximalist Approach to Reinforcement Learning from Human Feedback
- Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
- MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences
- Provable Multi-Party Reinforcement Learning with Diverse Human Feedback
- Moral Machine or Tyranny of the Majority?
- Which Examples Should be Multiply Annotated? Active Learning When Annotators May Disagree
- Batch Active Preference-Based Learning of Reward Functions
- On Releasing Annotator-Level Labels and Information in Datasets
- The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models
- DICES Dataset: Diversity in Conversational AI Evaluation for Safety
- Whose Opinions Do Language Models Reflect?, [data]
- Towards Measuring the Representation of Subjective Global Opinions in Language Models, [data]
- HH-RLHF