Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about reward normalization in evaluation #16

Open
cheryyunl opened this issue Jan 11, 2025 · 1 comment
Open

Question about reward normalization in evaluation #16

cheryyunl opened this issue Jan 11, 2025 · 1 comment

Comments

@cheryyunl
Copy link

Hi! Thanks for this amazing work and code open-sourcing!
Quick question - we're evaluating our SFT models and noticed our obtained reward values are quite different from Figure 4(a) in your paper. Since you mentioned "normalized rewards" in the figure caption, could you share what normalization method was used?

Also, wondering if you used the same dataset combo as us (helpful-base, helpful-online, harmless-base from hh-rlhf, without red-team data)? https://huggingface.co/datasets/Anthropic/hh-rlhf

Thanks in advance! 🙏

@YangRui2015
Copy link
Owner

Thank you for your interest in our work! Yes, we use some normalization configuration from the training dataset. I used the following configuration (mean and std) for the harmless (R1)-helpful (R2) experiments: R1 (-0.94732502, 1.92034349), R2: (-0.01500361, 1.40736504).

For the training set, we use all the train set from hh-rlhf like this: load_dataset(hhrlhf_dataset_path, split='train'). For eval set,
we use all the test set: load_dataset(hhrlhf_dataset_path, split='test'). You can check whether the read-team data is included when using the above loading strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants