Replies: 3 comments 2 replies
-
Here's what I've done, would love to hear some feedback. I have a short snippet of code to extract the json examples from the PR and I ask GPT 4 to critique it based on the eval critieria from the documentation. It'd be cool if we could get a consensus prompt and maybe even the repo mainters can chime in. Note this prompt so far is assuming 'match', and probably needs to be customized for different scenarios. Modelgraded might be trickier. Don't read too much into the example PR I'm using. Hopefully the author doesn't mind. fwiw, GPT4 seems to like it. Can you critique this model eval based on just the sample of questions and responses given? Assume that system content is system prompt to guide GPT4, user content is the content which is responsible for the question, and 'ideal' is the one which is responsible for looking for an exact match. Do not critique such things as diversity, sample size, metrics, or asking to include more information. Keep the critique specific to the example provided below. Here are some things to think about: The eval should be thematically consistent. We'd like to see a number of prompts all revolving around the same use case, subject domain, failure mode, etc. [{'input': [{'role': 'system', 'content': 'You will be presented with one sentence of the Chinese Tang poetry to read. Answer who is the author, and strictly only contain a single word and in the same language'}, {'role': 'user', 'content': '慈母手中线,游子身上衣。'}], 'ideal': '孟郊'}, {'input': [{'role': 'system', 'content': 'You will be presented with one sentence of the Chinese Tang poetry to read. Answer who is the author, and strictly only contain a single word and in the same language'}, {'role': 'user', 'content': '烟笼寒水月笼沙,夜泊秦淮近酒家。'}], 'ideal': '杜牧'}, {'input': [{'role': 'system', 'content': 'You will be presented with one sentence of the Chinese Tang poetry to read. Answer who is the author, and strictly only contain a single word and in the same language'}, {'role': 'user', 'content': '白日依山尽,黄河入海流。'}], 'ideal': '王之涣'}, {'input': [{'role': 'system', 'content': 'You will be presented with one sentence of the Chinese Tang poetry to read. Answer who is the author, and strictly only contain a single word and in the same language'}, {'role': 'user', 'content': '寂寞天宝后,园庐但蒿藜。'}], 'ideal': '杜甫'}, {'input': [{'role': 'system', 'content': 'You will be presented with one sentence of the Chinese Tang poetry to read. Answer who is the author, and strictly only contain a single word and in the same language'}, {'role': 'user', 'content': '春城无处不飞花,寒食东风御柳斜。'}], 'ideal': '韩翊'}] Note that the above is just a very very rough first draft prompt meant to elicit feedback. We probably want to add stuff like verifying the json format and add more language to get much more specific analysis. Ideally we'd get a good prompt that everyone could run to verify that their eval is roughly in the right ballpark. One thing I'd like to add is that the answer needs to be unambiguously correct. My feeling is that some of the evals fall down here and it's a missed opportunity for everyone, not just OAI or the eval submitter, but also potential external future users of these evals. I'd like to make sure that eval creators get proper feedback to ensure that "data should include good signal around what is the right behavior.", which I believe is by far the most critical attribute. Another thing, harder to do for sure, is that it'd be great to see if there was a way to actually verify that the evals had correct answers. Not sure how to do that. I know when I submitted my two evals, I spent an inordinate amount of time re-checking each answer to make sure I didn't make a mistake. Evals with incorrect ideals would likely have negative value for this purpose. |
Beta Was this translation helpful? Give feedback.
-
Something also to keep in mind, it's possible that the repo maintainer has only a small quota of accounts they are allowed to hand out, which is why they're only merging a small # of PRs at a time, and it's not necessarily the quality of the evals holding them back. What's great about this though, is that these evals are all MIT licensed so anyone can use them for their own project. That's kinda awesome of what OAI did here for the community when you think about it. |
Beta Was this translation helpful? Give feedback.
-
This is an awesome use case for a new project, which could be a bot that uses theses concepts for automatic labeling the PRs. One big concern I have, tho, is trying to keep it as free as possible from biases. But I think it's manageable |
Beta Was this translation helpful? Give feedback.
-
There was a good discussion about this here - #873 - but it was a bit of a hijack to which I take some responsibility.
There's some info here - https://github.com/openai/evals/blob/main/docs/build-eval.md but sometimes you have to read between the lines.
I downloaded all the merged PRs and asked GPT4 to summarize the common characteristics:
The merged evals cover a wide range of topics and skills, including:
These evals assess various capabilities of the AI model, including language understanding, subject matter knowledge, problem-solving skills, spatial understanding, and emotional intelligence.
As also suggested by @SkyaTura, I'm going to try querying the git repo for information regarding the PRs and what is / what is not getting merged. I'll put the code into a new repo. This should actually be a handy tool for anyone analyzing a repo and wanting to do a PR.
It's worth mentioning that it's possible there are some extrinsic factors as well that can't be surfaced from looking at the PRs themselves.
It's also worth mentioning that custom code PR, despite what the documentation says, were not being accepted as of these issues: #520 #715
The custom code documentation is also out of date due to the refactor, which could very well be why they weren't accepting PRs for it as they didn't want to deal with legacy code. It may be once they update the documentation we'll be able to do custom code evals.
Beta Was this translation helpful? Give feedback.
All reactions