You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We decided not to move forward with the hugging face approach, but to have better results with SDG + train, we need to have a precomputed dataset that the newly generated data will be mixed with.
The only way to specify a default dataset today is to supply a default recipe yaml file for knowledge and/or skills. These would reside at a path like /usr/share/instructlab/sdg/default_data_recipes/skills.yaml, ~/.local/share/instructlab/sdg/default_data_recipes/skills.yaml, etc (where the exact path is system-dependent, from platformdir.PlatformDirs). So, a user could do this today by hand-writing a default recipe at the correct path. Or, something like ilab data download could download that dataset from HuggingFace, place it into an appropriate path, and then write out a default recipe that references it.
Once the default recipe file gets in the right place, the rest of the existing data generation code should automatically pick up and use that recipe for mixing.
Thinking more from a user's point-of-view, is downloading one or more precomputed datasets a different task from creating a recipe that uses those datasets? Would I want to ilab data download <some other HF dataset>, just like I can download different models? Where do those datasets get stored on disk when I do so? Once I've downloaded them, how do I generate a recipe to use them? How do I pass my custom dataset and/or recipe into ilab data generate?
And, all of this is only relevant for users with big hardware doing the full data generation pipeline and phased training, right? Does the precomputed dataset impact the output at all for any user doing legacy training, simple pipeline, or non-phased training?
Related to #201
We decided not to move forward with the hugging face approach, but to have better results with SDG + train, we need to have a precomputed dataset that the newly generated data will be mixed with.
Couple approaches that we could take --
ilab data download
- this will pull the data from instructlab's hugging face.The text was updated successfully, but these errors were encountered: