Generate data for model evaluation using the MMLU benchmark #180

derekhiggins · 2024-07-20T00:07:34Z

commit d280df9
Generate mmlubench data

Rebased from
https://github.com/aakankshaduggal/sdg/pull/15 and
https://github.com/aakankshaduggal/sdg/pull/16

Changed by me (synth data wasn't the correct format without it(merlinite-7b-lab))
src/instructlab/sdg/pipelines/simple/mmlu_bench.yaml
-        temperature: 0
+        temperature: 0.7

Refactor MMLU bench code into eval_data.py

The MMLU bench pipeline is different from the training samples
generation pipeline - instead it is generating a dataset that
can be used to evaluate the model performance.

Let's assume there could be multiple eval sdg pipelines in future,
and they could be used with any of the training data pipelines,
so put mmlu_bench.yaml in instructlab.sdg.pipelines.eval.

Also encapsulate all the relevant code into a new sdg.eval_data
Python module whose main interface is:

```
mmlu_bench_pipe = eval_data.mmlubench_pipe_init(ctx)

eval_data.generate_eval_task_data(mmlu_bench_pipe, task_name, samples, ...)
```

Closes: #170

Co-authored-by: shiv <shivchander.s30@gmail.com>
Co-authored-by: abhi1092 <abhi1092@gmail.com>
Co-authored-by: Aakanksha Duggal <aduggal@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>

Signed-off-by: Derek Higgins <derekh@redhat.com>

commit 54c113a (HEAD -> main, me/mmlubench)
sdg_init() refactor - take a PipelineContext rather than returning one

This makes more sense if we want to use PipelineContext to separately
initialize the mmlu-bench pipeline.

Suggestion from @bbrowning in #163.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

mergify · 2024-07-20T00:08:14Z

This pull request has merge conflicts that must be resolved before it can be
merged. @derekhiggins please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

markmc

How does it work out if most of this goes into a new instructlab.sdg.eval_data module?

eval_data.py:

def _create_mmlu_evaluation_dataset(...):
    ...

def _  create_mmlu_evaluation_task(..):
    ...

def generate_eval_task_data(mmlubench_pipe, task_name, samples):
    mmlubench_data = mmlubench_pipe.generate(samples)
     ...
    create_mmlu_evaluation_dataset(mmlubench_data)
    ...
    create_mmlu_evaluation_task(task_name,  ....)

def mmlubench_pipe_init(ctx):
      with resources.as_file(
                resources.files(EVAL_PIPELINES_PKG).joinpath("mmlu_bench.yaml")
            ) as yaml_path:
                return Pipeline.from_file(ctx, yaml_path)

src/instructlab/sdg/pipelines/simple/mmlu_bench.yaml

src/instructlab/sdg/generate_data.py

src/instructlab/sdg/utils/parse_and_convert.py

markmc · 2024-07-22T16:06:40Z

See also instructlab/eval#35

markmc · 2024-07-22T17:04:59Z

Ok, I've done a bit of refactoring to match my review comments. I haven't done any testing, however.

markmc · 2024-07-22T17:07:43Z

src/instructlab/sdg/eval_data.py

+    task_yaml = {
+        "task": task_name,
+        "dataset_kwargs": {"data_files": {"test": eval_data_file_path}},
+        "include": "_default_mmlu_pr_template_yaml",


I'm not entirely clear where this _default_mmlu_pr_template_yaml lives - the eval library?

I'm definitely nervous about tight coupling between the libraries - this sort of thing can make it difficult to make changes

Might it make sense to put _default_mmlu_pr_template_yaml in the same codebase that generates this task?

is this the file its referencing ? https://github.com/instructlab/instructlab/blob/main/tests/testdata/mmlu_branch/_default_mmlu_pr_template_yaml

Oh, sorry - I thought I linked to that. Yes, that's what it looks like

Thinking about it afterwards ... and looking at https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md#including-a-base-yaml

Why not just include all that task config in the file we write out? There's no huge benefit to referencing another file and all the coordination that comes with that?

Makes sense looks like part of it is included here already, may aswell have it all

derekhiggins · 2024-07-22T20:25:36Z

Ok, I've done a bit of refactoring to match my review comments. I haven't done any testing, however.

The code as-is at least runs and produces an jsonl file

derekhiggins · 2024-07-22T21:46:59Z

See also instructlab/eval#35

Looking at the example mentioned in instructlab/eval#35
https://github.com/nathan-weinberg/eval/blob/test/tests/testdata/sdg/tonsil_data.jsonl

Some differences what this PR produces and the example are

this version includes a number of key/value pairs that arn't in the example (and I assume not required) e.g. mmlubench_question, mmlubench_answer, output, icl_*, task_description
are they required?

the example has "content" where we have "document"
I guess we need to rename it?

The example has keys the we don't have origin_branch_name, pull_request, input, targets, row_idx, path
should we be producing these?

@nathan-weinberg here is an example of what the PR produces currently (I've formatted the individual lines into multiple lines)
https://goodsquishy.com/upload/d7bc7c5bc21b5b2d2cbc

markmc · 2024-07-23T08:31:07Z

See also instructlab/eval#35

Looking at the example mentioned in instructlab/eval#35 https://github.com/nathan-weinberg/eval/blob/test/tests/testdata/sdg/tonsil_data.jsonl

Some differences what this PR produces and the example are
...

Reading https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#writing-a-prompt-template and looking at examples

doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer

That suggests that only question, choices and answer is needed

I would imagine any differences in the extra columns won't cause a problem. Personally, I'd prefer to trim the final dataset to the minimum that's needed for lm-eval, but that can be a follow-up after we merge the PR IMO. I've filed #183 for that

markmc · 2024-07-23T08:32:43Z

Ok, I've done a bit of refactoring to match my review comments. I haven't done any testing, however.

The code as-is at least runs and produces an jsonl file

Ok, if you merge the _default_template_yaml contents into the task yaml, I think we're ready to merge

markmc · 2024-07-23T08:35:39Z

Oh, and update the create_mmlu_evaluation_yaml() docstring to include some of the details we've learned:

The task will be executed by the in instructlab.sdg library using lm-eval
The features required are question, choices, and answer
A link to the relevant lm-eval docs

derekhiggins · 2024-07-23T09:35:10Z

Ok, if you merge the _default_template_yaml contents into the task yaml, I think we're ready to merge

Oh, and update the create_mmlu_evaluation_yaml() docstring to include some of the details we've learned:

will do,

There is still an outstanding question about the format of the output mmlubench file (see #180 (comment) ) , I think we need to get an answer on that first as some of it needs to change, or should we just follow up with any required changes?

nathan-weinberg · 2024-07-23T14:51:01Z

@derekhiggins @markmc WRT discussion around Eval compatibility - @khaledsulayman and @alinaryan are going to run a quick test with the sample data you provided here https://goodsquishy.com/upload/d7bc7c5bc21b5b2d2cbc with MMLUBranch to ensure compatibility - if there's no issues with that we should be good from an Eval POV, cc @alimaredia @danmcp

mergify · 2024-07-23T15:17:59Z

This pull request has merge conflicts that must be resolved before it can be
merged. @derekhiggins please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Rebased from aakankshaduggal#15 and aakankshaduggal#16 Changed by me (synth data wasn't the correct format without it(merlinite-7b-lab)) src/instructlab/sdg/pipelines/simple/mmlu_bench.yaml - temperature: 0 + temperature: 0.7 Refactor MMLU bench code into eval_data.py The MMLU bench pipeline is different from the training samples generation pipeline - instead it is generating a dataset that can be used to evaluate the model performance. Let's assume there could be multiple eval sdg pipelines in future, and they could be used with any of the training data pipelines, so put mmlu_bench.yaml in instructlab.sdg.pipelines.eval. Also encapsulate all the relevant code into a new sdg.eval_data Python module whose main interface is: ``` mmlu_bench_pipe = eval_data.mmlubench_pipe_init(ctx) eval_data.generate_eval_task_data(mmlu_bench_pipe, task_name, samples, ...) ``` Closes: instructlab#170 Co-authored-by: shiv <shivchander.s30@gmail.com> Co-authored-by: abhi1092 <abhi1092@gmail.com> Co-authored-by: Aakanksha Duggal <aduggal@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Derek Higgins <derekh@redhat.com>

@bbrowning

This makes more sense if we want to use PipelineContext to separately initialize the mmlu-bench pipeline. Suggestion from @bbrowning in instructlab#163. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc · 2024-07-23T16:42:55Z

@derekhiggins @markmc WRT discussion around Eval compatibility - @khaledsulayman and @alinaryan are going to run a quick test with the sample data you provided here https://goodsquishy.com/upload/d7bc7c5bc21b5b2d2cbc with MMLUBranch to ensure compatibility - if there's no issues with that we should be good from an Eval POV, cc @alimaredia @danmcp

Sorry, I hit merge and remembered this is what I was waiting for - let us know if you hit any issues and @derekhiggins or I will resolve them quickly 👍

nathan-weinberg · 2024-07-23T16:44:17Z

@derekhiggins @markmc WRT discussion around Eval compatibility - @khaledsulayman and @alinaryan are going to run a quick test with the sample data you provided here https://goodsquishy.com/upload/d7bc7c5bc21b5b2d2cbc with MMLUBranch to ensure compatibility - if there's no issues with that we should be good from an Eval POV, cc @alimaredia @danmcp

Sorry, I hit merge and remembered this is what I was waiting for - let us know if you hit any issues and @derekhiggins or I will resolve them quickly 👍

Np @markmc we will followup if any changes are needed 👍

khaledsulayman · 2024-07-23T20:41:46Z

@markmc @derekhiggins it seems we are missing the necessary yaml files to put in the tasks directory in order to test the data @nathan-weinberg linked. Is this just the _default_template_yaml or are we missing something?

markmc · 2024-07-23T20:43:42Z

@markmc @derekhiggins it seems we are missing the necessary yaml files to put in the tasks directory in order to test the data @nathan-weinberg linked. Is this just the _default_template_yaml or are we missing something?

See #180 (comment)

_default_template_yaml should not be needed anymore, its contents are included directly in the task yaml

But do let us know if something specific was missed

khaledsulayman · 2024-07-23T20:46:15Z

@markmc I'm sorry I seem to be missing the task yaml you're referring to. Is it linked here or is it somewhere in the repository? the goodsquishy link in Nathan's comment appears to be jsonl.

markmc · 2024-07-23T20:50:21Z

@markmc I'm sorry I seem to be missing the task yaml you're referring to. Is it linked here or is it somewhere in the repository? the goodsquishy link in Nathan's comment appears to be jsonl.

Here's where it gets generated by ilab data generate:

sdg/src/instructlab/sdg/eval_data.py

Lines 120 to 125 in 842eb04

    
           yaml_file_path = f"{output_dir}/node_datasets_{date_suffix}/{task_name}_{date_suffix}_{task_name}_task.yaml" 
        
           logger.info(f"Saving MMLU Task yaml {yaml_file_path}") 
        
           _create_mmlu_evaluation_task( 
        
               task_name=task_name, 
        
               eval_data_file_path=eval_data_file_path, 
        
               yaml_file_path=yaml_file_path,

src/instructlab/sdg/eval_data.py

mergify bot added the needs-rebase label Jul 20, 2024

derekhiggins force-pushed the mmlubench branch from 578bb08 to a3ef09f Compare July 22, 2024 13:20

mergify bot removed the needs-rebase label Jul 22, 2024

markmc reviewed Jul 22, 2024

View reviewed changes

src/instructlab/sdg/pipelines/simple/mmlu_bench.yaml Outdated Show resolved Hide resolved

src/instructlab/sdg/generate_data.py Outdated Show resolved Hide resolved

src/instructlab/sdg/utils/parse_and_convert.py Outdated Show resolved Hide resolved

mergify bot added the testing Relates to testing label Jul 22, 2024

markmc changed the title ~~Generate mmlubench data~~ Generate data for model evaluation using the MMLU benchmark Jul 22, 2024

markmc reviewed Jul 22, 2024

View reviewed changes

markmc mentioned this pull request Jul 23, 2024

Reduce the MMLU evaluation benchmark dataset to the minimum set of features #183

Open

markmc mentioned this pull request Jul 23, 2024

Introduce a way to mix generated datasets before sending to training #163

Merged

mergify bot added the ci-failure label Jul 23, 2024

markmc force-pushed the mmlubench branch from 41bcbbb to 57b2a77 Compare July 23, 2024 15:10

mergify bot removed the ci-failure label Jul 23, 2024

derekhiggins force-pushed the mmlubench branch from 57b2a77 to f0bf98d Compare July 23, 2024 15:17

mergify bot added the needs-rebase label Jul 23, 2024

derekhiggins and others added 2 commits July 23, 2024 11:19

sdg_init() refactor - take a PipelineContext rather than returning one

54c113a

This makes more sense if we want to use PipelineContext to separately initialize the mmlu-bench pipeline. Suggestion from @bbrowning in instructlab#163. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

derekhiggins force-pushed the mmlubench branch from f0bf98d to 54c113a Compare July 23, 2024 15:20

mergify bot removed the needs-rebase label Jul 23, 2024

derekhiggins marked this pull request as ready for review July 23, 2024 15:25

markmc approved these changes Jul 23, 2024

View reviewed changes

markmc merged commit 785d1a8 into instructlab:main Jul 23, 2024
11 checks passed

markmc added this to the 0.2.1 milestone Jul 23, 2024

bbrowning reviewed Jul 24, 2024

View reviewed changes

src/instructlab/sdg/eval_data.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate data for model evaluation using the MMLU benchmark #180

Generate data for model evaluation using the MMLU benchmark #180

derekhiggins commented Jul 20, 2024 •

edited

Loading

mergify bot commented Jul 20, 2024

markmc left a comment

markmc commented Jul 22, 2024

markmc commented Jul 22, 2024

markmc Jul 22, 2024

derekhiggins Jul 22, 2024

markmc Jul 22, 2024

derekhiggins Jul 22, 2024

derekhiggins commented Jul 22, 2024

derekhiggins commented Jul 22, 2024

markmc commented Jul 23, 2024

markmc commented Jul 23, 2024

markmc commented Jul 23, 2024

derekhiggins commented Jul 23, 2024

nathan-weinberg commented Jul 23, 2024

mergify bot commented Jul 23, 2024

markmc commented Jul 23, 2024 •

edited

Loading

nathan-weinberg commented Jul 23, 2024

khaledsulayman commented Jul 23, 2024

markmc commented Jul 23, 2024 •

edited

Loading

khaledsulayman commented Jul 23, 2024

markmc commented Jul 23, 2024

Generate data for model evaluation using the MMLU benchmark #180

Generate data for model evaluation using the MMLU benchmark #180

Conversation

derekhiggins commented Jul 20, 2024 • edited Loading

mergify bot commented Jul 20, 2024

markmc left a comment

Choose a reason for hiding this comment

markmc commented Jul 22, 2024

markmc commented Jul 22, 2024

markmc Jul 22, 2024

Choose a reason for hiding this comment

derekhiggins Jul 22, 2024

Choose a reason for hiding this comment

markmc Jul 22, 2024

Choose a reason for hiding this comment

derekhiggins Jul 22, 2024

Choose a reason for hiding this comment

derekhiggins commented Jul 22, 2024

derekhiggins commented Jul 22, 2024

markmc commented Jul 23, 2024

markmc commented Jul 23, 2024

markmc commented Jul 23, 2024

derekhiggins commented Jul 23, 2024

nathan-weinberg commented Jul 23, 2024

mergify bot commented Jul 23, 2024

markmc commented Jul 23, 2024 • edited Loading

nathan-weinberg commented Jul 23, 2024

khaledsulayman commented Jul 23, 2024

markmc commented Jul 23, 2024 • edited Loading

khaledsulayman commented Jul 23, 2024

markmc commented Jul 23, 2024

derekhiggins commented Jul 20, 2024 •

edited

Loading

markmc commented Jul 23, 2024 •

edited

Loading

markmc commented Jul 23, 2024 •

edited

Loading