-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce v3 schema #39
Conversation
5723d66
to
7be2c44
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There is an example that works with the proposed schema here: https://github.com/russellb/taxonomy/tree/v3-example/knowledge/tonsils
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need to include compositional skills json. Taxonomy repo uses latest schema version so missing compositional skills json will prevent contribution compositional skills.
It seems the main change here is to wrap some context around seed q/a. This now requires 5 examples with 3 q/a pairs. So now a total of 15 q/a pairs.
Done! I wasn't sure if I had to duplicate it even thought it didn't change. That's fine though. I added it.
That's right. |
} | ||
}, | ||
"document_outline": { | ||
"description": "An outline of the document.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the document_outline is used anywhere in the prompt to the teacher model for SDG, we must be more precise in the description about how it is used and what people should put here. Otherwise people will treat it as a just some commentary field they just put something in to shut up the yaml validation.
See task_description in compositional skills.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thanks!
Closes instructlab#38 v3 includes some backwards incompatible changes to the knowledge schema format. Here is a diff against v2. The changes are: - Q&A pairs now have an associated context blob from the knowledge document. - There is new `document_outline` field. - drop `task_description` ```diff --- src/instructlab/schema/v2/knowledge.json 2024-07-17 12:56:37 +++ src/instructlab/schema/v3/knowledge.json 2024-07-18 13:21:59 @@ -6,9 +6,9 @@ "required": [ "created_by", "domain", - "task_description", "seed_examples", - "document" + "document", + "document_outline" ], "unevaluatedProperties": false, "properties": { @@ -27,15 +27,6 @@ "Pop culture" ] }, - "task_description": { - "description": "A description of the task which is used in prompts to the teacher model during synthetic data generation. The description should be detailed and prescriptive to improve the teacher model's responses.", - "type": "string", - "minLength": 1, - "examples": [ - "To teach a language model about softball history", - "To teach a language model about tabby cats" - ] - }, "seed_examples": { "description": "An array of seed examples for synthetic data generation.", "type": "array", @@ -44,20 +35,39 @@ "items": { "type": "object", "required": [ - "question", - "answer" + "context", + "questions_and_answers" ], "unevaluatedProperties": false, "properties": { - "question": { - "description": "A question used for synthetic data generation.", + "context": { + "description": "Context from the document associated with this set of sample q&a pairs.", "type": "string", "minLength": 1 }, - "answer": { - "description": "The desired response for the question.", - "type": "string", - "minLength": 1 + "questions_and_answers": { + "type": "array", + "minItems": 3, + "uniqueItems": true, + "items": { + "type": "object", + "required": [ + "question", + "answer" + ], + "properties": { + "question": { + "description": "A question used for synthetic data generation.", + "type": "string", + "minLength": 1 + }, + "answer": { + "description": "The desired response for the question.", + "type": "string", + "minLength": 1 + } + } + } } } } @@ -104,6 +114,14 @@ } } } + }, + "document_outline": { + "description": "A brief summary of the document.", + "type": "string", + "minLength": 1, + "examples": [ + "Overview of Human tonsils, describing their types, locations, structure, function, and clinical significance, with a specific focus on their role in the immune system and related health issues." + ] } } } ``` Signed-off-by: Russell Bryant <rbryant@redhat.com>
We use a github action: https://github.com/hynek/build-and-inspect-python-package?tab=readme-ov-file which runs a script to lint the python wheel it builds: https://pypi.org/project/check-wheel-contents/ This tool is warning about duplicate files between v2 and v3. The skills schema did not change intentionally and we still want a copy in both places. Signed-off-by: Russell Bryant <rbryant@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to expand/improve the descriptions, but we can do this in later PRs.
Thank you. Yes, agreed and understood. Doc updates are going to needed in various repos too, probably. I've got that on the todo list in instructlab/sdg#160 |
3e808f1 Introduce v3 schema
6acb4aa ci: Ignore duplicate files when linting package
commit 3e808f1
Author: Russell Bryant rbryant@redhat.com
Date: Wed Jul 17 13:16:18 2024 -0400
commit 6acb4aa
Author: Russell Bryant rbryant@redhat.com
Date: Wed Jul 17 16:49:05 2024 -0400