Reinforcement Learning from Human Feedback (RLHF) significantly improves language models by aligning their outputs with human preferences. Traditionally, stronger reward models—those with higher accuracy—are expected to enhance language model performance. However, our research presents a counterintuitive finding: language models guided by moderately accurate reward models often outperform those trained with highly accurate ones.
This study focuses on relevance, factuality, and completeness tasks using the QA-FEEDBACK dataset and reward models based on Longformer. Through extensive experimentation, we show that overly accurate reward models can lead to overfitting or poor generalization, while moderate accuracy yields better performance. This raises critical questions about how to balance reward model accuracy to optimize language model outputs in RLHF.
In RLHF, reward models evaluate the outputs of language models based on specific criteria such as relevance or factuality. A common assumption is that higher reward model accuracy should always lead to better LM performance, as more accurate models provide better feedback. However, our findings indicate that moderate accuracy is more effective in striking a balance between guiding model training and preventing overfitting.
We introduce a framework that explores the relationship between reward model accuracy and language model performance. The key factors include:
- Task Alignment: Moderately accurate reward models tend to offer feedback that is more aligned with the overall task, preventing LMs from overfitting to overly specific or narrow criteria.
- Training Stability: Reward models of moderate accuracy foster a more stable and generalizable training process, particularly in tasks requiring complex reasoning, such as QA and long-form answer generation.
We conducted experiments using models from the T5 family, including T5-small, T5-base, and T5-large, trained with Longformer-based reward models for tasks focusing on factuality, relevance, and completeness.
The QA-FEEDBACK dataset, derived from the ASQA dataset, focuses on generating long-form answers to ambiguous, open-domain factual questions. The dataset is divided into training, validation, and testing sets, requiring models to generate detailed responses from multiple knowledge sources.
Our experiments reveal a consistent trend: models trained with moderately accurate reward models tend to outperform those trained with highly accurate ones across a broad range of tasks, including individual cases.
This study challenges the prevailing assumption that higher reward model accuracy always leads to better language model performance in RLHF. Our findings show that moderate accuracy in reward models can improve task alignment and training stability, leading to better outcomes across relevance, factuality, and completeness tasks. Future research should explore how to fine-tune reward models to achieve the optimal balance between accuracy and generalization, particularly in complex NLP tasks.
@article{chen2024accuracy,
title={The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models},
author={Chen, Yanjun and Zhu, Dawei and Sun, Yirong and Chen, Xinghao and Zhang, Wei and Shen, Xiaoyu},
journal={arXiv preprint arXiv:2410.06554},
year={2024}
}
For questions or collaborations, please contact us at yan-jun.chen@connect.polyu.hk.