Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint files make iterating on a taxonomy awkward #245

Open
bbrowning opened this issue Aug 16, 2024 · 2 comments
Open

Checkpoint files make iterating on a taxonomy awkward #245

bbrowning opened this issue Aug 16, 2024 · 2 comments
Labels
UX Affects the User Experience

Comments

@bbrowning
Copy link
Contributor

When we create a checkpoint file, it has no notion of the version of the qna.yaml that it came from. We just assume a qna.yaml is unchanging and write a checkpoint file that gets used for that leaf node even if later on the qna.yaml is updated or changed entirely.

When writing a new knowledge or skill, a common workflow for me is to make a first attempt at the qna.yaml, run data generation, and see how the generated data looks. Then I may tweak my qna.yaml (change context, questions and answers, adjust the actual knowledge docs themselves) and do this multiple more times until I'm getting good data generation results. With the addition of checkpoint files, I now have to remember to manually remove all the checkpoints for this taxonomy leaf node every time before I re-run the data generation step. Otherwise, it just picks up the old checkpoint file even though I've since changed the qna.yaml to something that now makes that old checkpoint no longer valid.

It would be great if the checkpoints were somehow tied to the qna.yaml itself, so that if I change the qna.yaml in any way it knows to regenerate data there instead of reusing the checkpoint. Perhaps something as simple as calculating a hash of the qna.yaml file and embedding that in the name, directory, or content of the checkpoints would suffice?

@derekhiggins
Copy link
Contributor

It would be great if the checkpoints were somehow tied to the qna.yaml itself, so that if I change the qna.yaml in any way it knows to regenerate data there instead of reusing the checkpoint. Perhaps something as simple as calculating a hash of the qna.yaml file and embedding that in the name, directory, or content of the checkpoints would suffice?

In addition to the qna.yaml file its probably also worth including the model being used and the contents of pipeline in the hash as the user may also be switching models or iterating on development of a new custom pipeline

@bbrowning
Copy link
Contributor Author

@derekhiggins Nice foresight there - some testers just hit a case like you described, where they changed the pipeline used from one run to the next and it picked up old data from the previous pipeline, causing issues and requiring a blowing away of the generated datasets to get going again.

@nathan-weinberg nathan-weinberg added the UX Affects the User Experience label Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
UX Affects the User Experience
Projects
None yet
Development

No branches or pull requests

3 participants