What makes a good eval PR? #882

qrdlgit · 2023-05-01T08:02:09Z

qrdlgit
May 1, 2023

There was a good discussion about this here - #873 - but it was a bit of a hijack to which I take some responsibility.

There's some info here - https://github.com/openai/evals/blob/main/docs/build-eval.md but sometimes you have to read between the lines.

I downloaded all the merged PRs and asked GPT4 to summarize the common characteristics:

The merged evals cover a wide range of topics and skills, including:

Language understanding: Japanese, Russian, Dutch, Brazilian, Swedish, Greek, Bulgarian, Belarusian, Mongolian, Ukrainian, and Hebrew.
Medical knowledge: Japanese national medical exam, heart disease prediction, Russian medical, and MedMCQA.
Science and mathematics: pH calculation, general science reasoning, Mendelian inheritance, balancing chemical equations, and algebra word problems.
Spatial and logical reasoning: SVG understanding, three-point gene mapping, knot theory, physical rotation reasoning, LogiQA, and diagrammatical reasoning logic.
Legal knowledge: Illinois law claims, US tort law, and legal ethics.
Finance and economics: utility charge eval, financial math, and taxes eval.
Computer science and programming: bitwise eval, Forth Stack Simulator, and computer science theory.
Emotional intelligence: emotional intelligence evaluation.
Music theory: tempo and time signature.
Driving and navigation: Japanese driving license and lat-long-identify eval.
Chess: counting pieces left on the board and playing chess.
Miscellaneous: rhymes, emoji riddles, ROT13 strings, anagrams, counting bigrams, poker hand ranks, positive-binary-operations, chess, and sarcasm detection.

These evals assess various capabilities of the AI model, including language understanding, subject matter knowledge, problem-solving skills, spatial understanding, and emotional intelligence.

As also suggested by @SkyaTura, I'm going to try querying the git repo for information regarding the PRs and what is / what is not getting merged. I'll put the code into a new repo. This should actually be a handy tool for anyone analyzing a repo and wanting to do a PR.

It's worth mentioning that it's possible there are some extrinsic factors as well that can't be surfaced from looking at the PRs themselves.

It's also worth mentioning that custom code PR, despite what the documentation says, were not being accepted as of these issues: #520 #715

The custom code documentation is also out of date due to the refactor, which could very well be why they weren't accepting PRs for it as they didn't want to deal with legacy code. It may be once they update the documentation we'll be able to do custom code evals.

qrdlgit · 2023-05-01T11:11:39Z

qrdlgit
May 1, 2023
Author

Here's what I've done, would love to hear some feedback. I have a short snippet of code to extract the json examples from the PR and I ask GPT 4 to critique it based on the eval critieria from the documentation.

It'd be cool if we could get a consensus prompt and maybe even the repo mainters can chime in. Note this prompt so far is assuming 'match', and probably needs to be customized for different scenarios. Modelgraded might be trickier.

Don't read too much into the example PR I'm using. Hopefully the author doesn't mind. fwiw, GPT4 seems to like it.
..

Can you critique this model eval based on just the sample of questions and responses given? Assume that system content is system prompt to guide GPT4, user content is the content which is responsible for the question, and 'ideal' is the one which is responsible for looking for an exact match. Do not critique such things as diversity, sample size, metrics, or asking to include more information. Keep the critique specific to the example provided below.

Here are some things to think about:

The eval should be thematically consistent. We'd like to see a number of prompts all revolving around the same use case, subject domain, failure mode, etc.
The eval should be challenging. If GPT-4 or GPT-3.5-Turbo do well on all of the prompts, this is not as interesting. Of course, the eval should also be possible given the models' limitations and constraints. Oftentimes, a good rule of thumb is whether a human (potentially a subject expert) could do well on the prompts.
The eval should be directionally clear. The data should include good signal around what is the right behavior. This means, for example, high-quality reference answers or an exhaustive rubric for evaluating answers.
The eval should be carefully crafted. Before you submit, you should think through whether you have engineered your prompts for good performance, whether you are using the best eval template, whether you have spot checked your results to ensure accuracy, etc.

[{'input': [{'role': 'system', 'content': 'You will be presented with one sentence of the Chinese Tang poetry to read. Answer who is the author, and strictly only contain a single word and in the same language'}, {'role': 'user', 'content': '慈母手中线，游子身上衣。'}], 'ideal': '孟郊'}, {'input': [{'role': 'system', 'content': 'You will be presented with one sentence of the Chinese Tang poetry to read. Answer who is the author, and strictly only contain a single word and in the same language'}, {'role': 'user', 'content': '烟笼寒水月笼沙，夜泊秦淮近酒家。'}], 'ideal': '杜牧'}, {'input': [{'role': 'system', 'content': 'You will be presented with one sentence of the Chinese Tang poetry to read. Answer who is the author, and strictly only contain a single word and in the same language'}, {'role': 'user', 'content': '白日依山尽，黄河入海流。'}], 'ideal': '王之涣'}, {'input': [{'role': 'system', 'content': 'You will be presented with one sentence of the Chinese Tang poetry to read. Answer who is the author, and strictly only contain a single word and in the same language'}, {'role': 'user', 'content': '寂寞天宝后，园庐但蒿藜。'}], 'ideal': '杜甫'}, {'input': [{'role': 'system', 'content': 'You will be presented with one sentence of the Chinese Tang poetry to read. Answer who is the author, and strictly only contain a single word and in the same language'}, {'role': 'user', 'content': '春城无处不飞花，寒食东风御柳斜。'}], 'ideal': '韩翊'}]

Note that the above is just a very very rough first draft prompt meant to elicit feedback. We probably want to add stuff like verifying the json format and add more language to get much more specific analysis. Ideally we'd get a good prompt that everyone could run to verify that their eval is roughly in the right ballpark.

One thing I'd like to add is that the answer needs to be unambiguously correct. My feeling is that some of the evals fall down here and it's a missed opportunity for everyone, not just OAI or the eval submitter, but also potential external future users of these evals. I'd like to make sure that eval creators get proper feedback to ensure that "data should include good signal around what is the right behavior.", which I believe is by far the most critical attribute.

Another thing, harder to do for sure, is that it'd be great to see if there was a way to actually verify that the evals had correct answers. Not sure how to do that. I know when I submitted my two evals, I spent an inordinate amount of time re-checking each answer to make sure I didn't make a mistake. Evals with incorrect ideals would likely have negative value for this purpose.

0 replies

qrdlgit · 2023-05-01T11:19:48Z

qrdlgit
May 1, 2023
Author

Something also to keep in mind, it's possible that the repo maintainer has only a small quota of accounts they are allowed to hand out, which is why they're only merging a small # of PRs at a time, and it's not necessarily the quality of the evals holding them back.

What's great about this though, is that these evals are all MIT licensed so anyone can use them for their own project. That's kinda awesome of what OAI did here for the community when you think about it.

2 replies

SkyaTura May 1, 2023

As this is a completely reasonable scenario, and even probably in this particular case, GPT should also figure it out by itself, and address in the insights.

I think there is three good outcomes/goals here, with completely different strategies:

"Pre-review" each request to find their individual relevance by the desired metrics.
Create criteria of comparison with other pull requests, to find if that could be joint with previous requests and maybe try to understand in a qualitative way what differs different PRs in the same topic.
Overall long-term analysis, which could be used as feedback to adjust the metrics used originally themselves.

Obviously, each one add many layers of complexity and need of multiple iterations, but I'm thrilled to go further

qrdlgit May 1, 2023
Author

Great stuff. As hopeful as I was for getting a GPT4 api account, I'm coming to terms with that the benefit from this will be contributing code that can be used for model evaluation and learning a bit about GPT4's strength/weaknesses.

My issue now is less about, when are we going to get our PR reviewed and more - maybe they could make this process smarter to help ensure that people submit evals that can be generally useful.

SkyaTura · 2023-05-01T14:44:59Z

SkyaTura
May 1, 2023

This is an awesome use case for a new project, which could be a bot that uses theses concepts for automatic labeling the PRs.

One big concern I have, tho, is trying to keep it as free as possible from biases. But I think it's manageable

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What makes a good eval PR? #882

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

What makes a good eval PR? #882

qrdlgit May 1, 2023

Replies: 3 comments · 2 replies

qrdlgit May 1, 2023 Author

qrdlgit May 1, 2023 Author

SkyaTura May 1, 2023

qrdlgit May 1, 2023 Author

SkyaTura May 1, 2023

qrdlgit
May 1, 2023

Replies: 3 comments 2 replies

qrdlgit
May 1, 2023
Author

qrdlgit
May 1, 2023
Author

qrdlgit May 1, 2023
Author

SkyaTura
May 1, 2023