Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce data mixing recipe yaml files #203

Merged
merged 1 commit into from
Jul 26, 2024

Commits on Jul 25, 2024

  1. Introduce data mixing recipe yaml files

    This introduces Recipe yaml files, which are used both as an input
    into the data mixing process and as an output of the process.
    
    As an input, we have some default recipe files that specify any
    precomputed datasets that should be mixed with data from new skills
    when generating the overall mix of samples that will be sent to the
    training process.
    
    If a downstream user/packager wants to add default recipes (and
    datasets), they should install them to a path like
    `/usr/share/instructlab/sdg` (varies by platform, uses Python's
    `platformdirs.PlatformDirs` to respect platform conventions).
    
    Recipes should be in sdg/default_data_recipes/{knowledge,skills}.yaml
    
    Datasets should be in sdg/datasets but this location is not enforced.
    
    Currently we are not shipping any default recipe files in the upstream,
    but there is a unit test in place to ensure the functionality to load
    default recipes from disk works once we decide how we want to ship a
    precomputed dataset to our upstream users.
    
    As an output of the data generation process, we write recipe yamls to
    document which datasets were mixed together and in what proportions
    along with the system prompt that was used during the
    generation. Here's an example of a recipe yaml put into the output
    directory after running data generation:
    
    ```yaml
    datasets:
    - path: node_datasets_2024-07-25T17_49_46/knowledge_tonsils_overview_e2e-tonsils_p10.jsonl
      sampling_size: 1.0
    metadata:
      sys_prompt: "I am, Red Hat\xAE Instruct Model based on Granite 7B, an AI language\
        \ model developed by Red Hat and IBM Research, based on the Granite-7b-base language\
        \ model. My primary function is to be a chat assistant."
    ```
    
    Datasets may be referenced by relative paths, which are relative to the
    recipe's own directory. Or, they may use absolute filesystem paths.
    
    Anything written out under the metadata section (currently just
    sys_prompt) is purely informational for the user and ignored when
    loading recipes.
    
    Parts of this are extracted and rebased from
    aakankshaduggal#4
    aakankshaduggal#20
    
    Refs instructlab#162, instructlab#171, instructlab#185, instructlab#201.
    
    Co-authored-by: shivchander <shivchander.s30@gmail.com>
    Co-authored-by: Khaled Sulayman <khaled@thesulaymans.com>
    Co-authored-by: abhi1092 <abhi1092@gmail.com>
    Co-authored-by: Aakanksha Duggal <aduggal@redhat.com>
    Co-authored-by: Mark McLoughlin <markmc@redhat.com>
    Signed-off-by: Ben Browning <bbrownin@redhat.com>
    6 people committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    fcfacfc View commit details
    Browse the repository at this point in the history