Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate data for model evaluation using the MMLU benchmark #180

Merged
merged 2 commits into from
Jul 23, 2024

Conversation

derekhiggins
Copy link
Contributor

@derekhiggins derekhiggins commented Jul 20, 2024

commit d280df9
Generate mmlubench data

Rebased from
https://github.com/aakankshaduggal/sdg/pull/15 and
https://github.com/aakankshaduggal/sdg/pull/16

Changed by me (synth data wasn't the correct format without it(merlinite-7b-lab))
src/instructlab/sdg/pipelines/simple/mmlu_bench.yaml
-        temperature: 0
+        temperature: 0.7

Refactor MMLU bench code into eval_data.py

The MMLU bench pipeline is different from the training samples
generation pipeline - instead it is generating a dataset that
can be used to evaluate the model performance.

Let's assume there could be multiple eval sdg pipelines in future,
and they could be used with any of the training data pipelines,
so put mmlu_bench.yaml in instructlab.sdg.pipelines.eval.

Also encapsulate all the relevant code into a new sdg.eval_data
Python module whose main interface is:

```
mmlu_bench_pipe = eval_data.mmlubench_pipe_init(ctx)

eval_data.generate_eval_task_data(mmlu_bench_pipe, task_name, samples, ...)
```

Closes: #170

Co-authored-by: shiv <shivchander.s30@gmail.com>
Co-authored-by: abhi1092 <abhi1092@gmail.com>
Co-authored-by: Aakanksha Duggal <aduggal@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>

Signed-off-by: Derek Higgins <derekh@redhat.com>

commit 54c113a (HEAD -> main, me/mmlubench)
sdg_init() refactor - take a PipelineContext rather than returning one

This makes more sense if we want to use PipelineContext to separately
initialize the mmlu-bench pipeline.

Suggestion from @bbrowning in #163.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

@mergify mergify bot added the needs-rebase label Jul 20, 2024
Copy link
Contributor

mergify bot commented Jul 20, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @derekhiggins please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Contributor

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it work out if most of this goes into a new instructlab.sdg.eval_data module?

eval_data.py:

def _create_mmlu_evaluation_dataset(...):
    ...

def _  create_mmlu_evaluation_task(..):
    ...

def generate_eval_task_data(mmlubench_pipe, task_name, samples):
    mmlubench_data = mmlubench_pipe.generate(samples)
     ...
    create_mmlu_evaluation_dataset(mmlubench_data)
    ...
    create_mmlu_evaluation_task(task_name,  ....)

def mmlubench_pipe_init(ctx):
      with resources.as_file(
                resources.files(EVAL_PIPELINES_PKG).joinpath("mmlu_bench.yaml")
            ) as yaml_path:
                return Pipeline.from_file(ctx, yaml_path)

src/instructlab/sdg/pipelines/simple/mmlu_bench.yaml Outdated Show resolved Hide resolved
src/instructlab/sdg/generate_data.py Outdated Show resolved Hide resolved
src/instructlab/sdg/utils/parse_and_convert.py Outdated Show resolved Hide resolved
@markmc
Copy link
Contributor

markmc commented Jul 22, 2024

See also instructlab/eval#35

@mergify mergify bot added the testing Relates to testing label Jul 22, 2024
@markmc markmc changed the title Generate mmlubench data Generate data for model evaluation using the MMLU benchmark Jul 22, 2024
@markmc
Copy link
Contributor

markmc commented Jul 22, 2024

Ok, I've done a bit of refactoring to match my review comments. I haven't done any testing, however.

task_yaml = {
"task": task_name,
"dataset_kwargs": {"data_files": {"test": eval_data_file_path}},
"include": "_default_mmlu_pr_template_yaml",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely clear where this _default_mmlu_pr_template_yaml lives - the eval library?

I'm definitely nervous about tight coupling between the libraries - this sort of thing can make it difficult to make changes

Might it make sense to put _default_mmlu_pr_template_yaml in the same codebase that generates this task?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry - I thought I linked to that. Yes, that's what it looks like

Thinking about it afterwards ... and looking at https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md#including-a-base-yaml

Why not just include all that task config in the file we write out? There's no huge benefit to referencing another file and all the coordination that comes with that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense looks like part of it is included here already, may aswell have it all

@derekhiggins
Copy link
Contributor Author

Ok, I've done a bit of refactoring to match my review comments. I haven't done any testing, however.

The code as-is at least runs and produces an jsonl file

@derekhiggins
Copy link
Contributor Author

See also instructlab/eval#35

Looking at the example mentioned in instructlab/eval#35
https://github.com/nathan-weinberg/eval/blob/test/tests/testdata/sdg/tonsil_data.jsonl

Some differences what this PR produces and the example are

this version includes a number of key/value pairs that arn't in the example (and I assume not required) e.g. mmlubench_question, mmlubench_answer, output, icl_*, task_description
are they required?

the example has "content" where we have "document"
I guess we need to rename it?

The example has keys the we don't have origin_branch_name, pull_request, input, targets, row_idx, path
should we be producing these?

@nathan-weinberg here is an example of what the PR produces currently (I've formatted the individual lines into multiple lines)
https://goodsquishy.com/upload/d7bc7c5bc21b5b2d2cbc

@markmc
Copy link
Contributor

markmc commented Jul 23, 2024

See also instructlab/eval#35

Looking at the example mentioned in instructlab/eval#35 https://github.com/nathan-weinberg/eval/blob/test/tests/testdata/sdg/tonsil_data.jsonl

Some differences what this PR produces and the example are
...

Reading https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#writing-a-prompt-template and looking at examples

doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer

That suggests that only question, choices and answer is needed

I would imagine any differences in the extra columns won't cause a problem. Personally, I'd prefer to trim the final dataset to the minimum that's needed for lm-eval, but that can be a follow-up after we merge the PR IMO. I've filed #183 for that

@markmc
Copy link
Contributor

markmc commented Jul 23, 2024

Ok, I've done a bit of refactoring to match my review comments. I haven't done any testing, however.

The code as-is at least runs and produces an jsonl file

Ok, if you merge the _default_template_yaml contents into the task yaml, I think we're ready to merge

@markmc
Copy link
Contributor

markmc commented Jul 23, 2024

Oh, and update the create_mmlu_evaluation_yaml() docstring to include some of the details we've learned:

  • The task will be executed by the in instructlab.sdg library using lm-eval
  • The features required are question, choices, and answer
  • A link to the relevant lm-eval docs

@derekhiggins
Copy link
Contributor Author

Ok, if you merge the _default_template_yaml contents into the task yaml, I think we're ready to merge

Oh, and update the create_mmlu_evaluation_yaml() docstring to include some of the details we've learned:

will do,

There is still an outstanding question about the format of the output mmlubench file (see #180 (comment) ) , I think we need to get an answer on that first as some of it needs to change, or should we just follow up with any required changes?

@nathan-weinberg
Copy link
Member

@derekhiggins @markmc WRT discussion around Eval compatibility - @khaledsulayman and @alinaryan are going to run a quick test with the sample data you provided here https://goodsquishy.com/upload/d7bc7c5bc21b5b2d2cbc with MMLUBranch to ensure compatibility - if there's no issues with that we should be good from an Eval POV, cc @alimaredia @danmcp

Copy link
Contributor

mergify bot commented Jul 23, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @derekhiggins please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 23, 2024
derekhiggins and others added 2 commits July 23, 2024 11:19
Rebased from
aakankshaduggal#15 and
aakankshaduggal#16

Changed by me (synth data wasn't the correct format without it(merlinite-7b-lab))
src/instructlab/sdg/pipelines/simple/mmlu_bench.yaml
-        temperature: 0
+        temperature: 0.7

Refactor MMLU bench code into eval_data.py

The MMLU bench pipeline is different from the training samples
generation pipeline - instead it is generating a dataset that
can be used to evaluate the model performance.

Let's assume there could be multiple eval sdg pipelines in future,
and they could be used with any of the training data pipelines,
so put mmlu_bench.yaml in instructlab.sdg.pipelines.eval.

Also encapsulate all the relevant code into a new sdg.eval_data
Python module whose main interface is:

```
mmlu_bench_pipe = eval_data.mmlubench_pipe_init(ctx)

eval_data.generate_eval_task_data(mmlu_bench_pipe, task_name, samples, ...)
```

Closes: instructlab#170

Co-authored-by: shiv <shivchander.s30@gmail.com>
Co-authored-by: abhi1092 <abhi1092@gmail.com>
Co-authored-by: Aakanksha Duggal <aduggal@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>

Signed-off-by: Derek Higgins <derekh@redhat.com>
This makes more sense if we want to use PipelineContext to separately
initialize the mmlu-bench pipeline.

Suggestion from @bbrowning in instructlab#163.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@mergify mergify bot removed the needs-rebase label Jul 23, 2024
@derekhiggins derekhiggins marked this pull request as ready for review July 23, 2024 15:25
@markmc markmc merged commit 785d1a8 into instructlab:main Jul 23, 2024
11 checks passed
@markmc
Copy link
Contributor

markmc commented Jul 23, 2024

@derekhiggins @markmc WRT discussion around Eval compatibility - @khaledsulayman and @alinaryan are going to run a quick test with the sample data you provided here https://goodsquishy.com/upload/d7bc7c5bc21b5b2d2cbc with MMLUBranch to ensure compatibility - if there's no issues with that we should be good from an Eval POV, cc @alimaredia @danmcp

Sorry, I hit merge and remembered this is what I was waiting for - let us know if you hit any issues and @derekhiggins or I will resolve them quickly 👍

@markmc markmc added this to the 0.2.1 milestone Jul 23, 2024
@nathan-weinberg
Copy link
Member

@derekhiggins @markmc WRT discussion around Eval compatibility - @khaledsulayman and @alinaryan are going to run a quick test with the sample data you provided here https://goodsquishy.com/upload/d7bc7c5bc21b5b2d2cbc with MMLUBranch to ensure compatibility - if there's no issues with that we should be good from an Eval POV, cc @alimaredia @danmcp

Sorry, I hit merge and remembered this is what I was waiting for - let us know if you hit any issues and @derekhiggins or I will resolve them quickly 👍

Np @markmc we will followup if any changes are needed 👍

@khaledsulayman
Copy link
Member

@markmc @derekhiggins it seems we are missing the necessary yaml files to put in the tasks directory in order to test the data @nathan-weinberg linked. Is this just the _default_template_yaml or are we missing something?

@markmc
Copy link
Contributor

markmc commented Jul 23, 2024

@markmc @derekhiggins it seems we are missing the necessary yaml files to put in the tasks directory in order to test the data @nathan-weinberg linked. Is this just the _default_template_yaml or are we missing something?

See #180 (comment)

_default_template_yaml should not be needed anymore, its contents are included directly in the task yaml

But do let us know if something specific was missed

@khaledsulayman
Copy link
Member

@markmc I'm sorry I seem to be missing the task yaml you're referring to. Is it linked here or is it somewhere in the repository? the goodsquishy link in Nathan's comment appears to be jsonl.

@markmc
Copy link
Contributor

markmc commented Jul 23, 2024

@markmc I'm sorry I seem to be missing the task yaml you're referring to. Is it linked here or is it somewhere in the repository? the goodsquishy link in Nathan's comment appears to be jsonl.

Here's where it gets generated by ilab data generate:

yaml_file_path = f"{output_dir}/node_datasets_{date_suffix}/{task_name}_{date_suffix}_{task_name}_task.yaml"
logger.info(f"Saving MMLU Task yaml {yaml_file_path}")
_create_mmlu_evaluation_task(
task_name=task_name,
eval_data_file_path=eval_data_file_path,
yaml_file_path=yaml_file_path,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants