Question about reward normalization in evaluation #16

cheryyunl · 2025-01-11T20:55:40Z

Hi! Thanks for this amazing work and code open-sourcing!
Quick question - we're evaluating our SFT models and noticed our obtained reward values are quite different from Figure 4(a) in your paper. Since you mentioned "normalized rewards" in the figure caption, could you share what normalization method was used?

Also, wondering if you used the same dataset combo as us (helpful-base, helpful-online, harmless-base from hh-rlhf, without red-team data)? https://huggingface.co/datasets/Anthropic/hh-rlhf

Thanks in advance! 🙏

YangRui2015 · 2025-01-12T15:08:55Z

Thank you for your interest in our work! Yes, we use some normalization configuration from the training dataset. I used the following configuration (mean and std) for the harmless (R1)-helpful (R2) experiments: R1 (-0.94732502, 1.92034349), R2: (-0.01500361, 1.40736504).

For the training set, we use all the train set from hh-rlhf like this: load_dataset(hhrlhf_dataset_path, split='train'). For eval set,
we use all the test set: load_dataset(hhrlhf_dataset_path, split='test'). You can check whether the read-team data is included when using the above loading strategy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about reward normalization in evaluation #16

Question about reward normalization in evaluation #16

cheryyunl commented Jan 11, 2025

YangRui2015 commented Jan 12, 2025

Question about reward normalization in evaluation #16

Question about reward normalization in evaluation #16

Comments

cheryyunl commented Jan 11, 2025

YangRui2015 commented Jan 12, 2025