Introduce v3 schema #39

russellb · 2024-07-17T17:21:47Z

3e808f1 Introduce v3 schema
6acb4aa ci: Ignore duplicate files when linting package

commit 3e808f1
Author: Russell Bryant rbryant@redhat.com
Date: Wed Jul 17 13:16:18 2024 -0400

Introduce v3 schema

Closes #38

v3 includes some backwards incompatible changes to the knowledge
schema format. Here is a diff against v2. The changes are:

- Q&A pairs now have an associated context blob from the knowledge
  document.

- There is new `document_outline` field.

- drop `task_description`

```diff
--- src/instructlab/schema/v2/knowledge.json    2024-07-17 12:56:37
+++ src/instructlab/schema/v3/knowledge.json    2024-07-18 13:21:59
@@ -6,9 +6,9 @@
     "required": [
         "created_by",
         "domain",
-        "task_description",
         "seed_examples",
-        "document"
+        "document",
+        "document_outline"
     ],
     "unevaluatedProperties": false,
     "properties": {
@@ -27,15 +27,6 @@
                 "Pop culture"
             ]
         },
-        "task_description": {
-            "description": "A description of the task which is used in prompts to the teacher model during synthetic data generation. The description should be detailed and prescriptive to improve the teacher model's responses.",
-            "type": "string",
-            "minLength": 1,
-            "examples": [
-                "To teach a language model about softball history",
-                "To teach a language model about tabby cats"
-            ]
-        },
         "seed_examples": {
             "description": "An array of seed examples for synthetic data generation.",
             "type": "array",
@@ -44,20 +35,39 @@
             "items": {
                 "type": "object",
                 "required": [
-                    "question",
-                    "answer"
+                    "context",
+                    "questions_and_answers"
                 ],
                 "unevaluatedProperties": false,
                 "properties": {
-                    "question": {
-                        "description": "A question used for synthetic data generation.",
+                    "context": {
+                        "description": "Context from the document associated with this set of sample q&a pairs.",
                         "type": "string",
                         "minLength": 1
                     },
-                    "answer": {
-                        "description": "The desired response for the question.",
-                        "type": "string",
-                        "minLength": 1
+                    "questions_and_answers": {
+                        "type": "array",
+                        "minItems": 3,
+                        "uniqueItems": true,
+                        "items": {
+                            "type": "object",
+                            "required": [
+                                "question",
+                                "answer"
+                            ],
+                            "properties": {
+                                "question": {
+                                    "description": "A question used for synthetic data generation.",
+                                    "type": "string",
+                                    "minLength": 1
+                                },
+                                "answer": {
+                                    "description": "The desired response for the question.",
+                                    "type": "string",
+                                    "minLength": 1
+                                }
+                            }
+                        }
                     }
                 }
             }
@@ -104,6 +114,14 @@
                     }
                 }
             }
+        },
+        "document_outline": {
+            "description": "A brief summary of the document.",
+            "type": "string",
+            "minLength": 1,
+            "examples": [
+                "Overview of Human tonsils, describing their types, locations, structure, function, and clinical significance, with a specific focus on their role in the immune system and related health issues."
+            ]
         }
     }
 }
```

Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 6acb4aa
Author: Russell Bryant rbryant@redhat.com
Date: Wed Jul 17 16:49:05 2024 -0400

ci: Ignore duplicate files when linting package

We use a github action:

  https://github.com/hynek/build-and-inspect-python-package?tab=readme-ov-file

which runs a script to lint the python wheel it builds:

  https://pypi.org/project/check-wheel-contents/

This tool is warning about duplicate files between v2 and v3. The
skills schema did not change intentionally and we still want a copy in
both places.

Signed-off-by: Russell Bryant <rbryant@redhat.com>

src/instructlab/schema/v3/version.json

src/instructlab/schema/v3/knowledge.json

aakankshaduggal

lgtm

src/instructlab/schema/v3/knowledge.json

russellb · 2024-07-17T19:45:18Z

There is an example that works with the proposed schema here: https://github.com/russellb/taxonomy/tree/v3-example/knowledge/tonsils

❯ ilab taxonomy diff
knowledge/tonsils/qna.yaml
Taxonomy in /Users/rbryant/Library/Application Support/instructlab/taxonomy is valid :)

bjhargrave

Also need to include compositional skills json. Taxonomy repo uses latest schema version so missing compositional skills json will prevent contribution compositional skills.

It seems the main change here is to wrap some context around seed q/a. This now requires 5 examples with 3 q/a pairs. So now a total of 15 q/a pairs.

src/instructlab/schema/v3/knowledge.json

russellb · 2024-07-17T20:58:56Z

Also need to include compositional skills json. Taxonomy repo uses latest schema version so missing compositional skills json will prevent contribution compositional skills.

Done! I wasn't sure if I had to duplicate it even thought it didn't change. That's fine though. I added it.

It seems the main change here is to wrap some context around seed q/a. This now requires 5 examples with 3 q/a pairs. So now a total of 15 q/a pairs.

That's right.

bjhargrave · 2024-07-18T10:14:22Z

src/instructlab/schema/v3/knowledge.json

+            }
+        },
+        "document_outline": {
+            "description": "An outline of the document.",


If the document_outline is used anywhere in the prompt to the teacher model for SDG, we must be more precise in the description about how it is used and what people should put here. Otherwise people will treat it as a just some commentary field they just put something in to shut up the yaml validation.

See task_description in compositional skills.

done, thanks!

Closes instructlab#38 v3 includes some backwards incompatible changes to the knowledge schema format. Here is a diff against v2. The changes are: - Q&A pairs now have an associated context blob from the knowledge document. - There is new `document_outline` field. - drop `task_description` ```diff --- src/instructlab/schema/v2/knowledge.json 2024-07-17 12:56:37 +++ src/instructlab/schema/v3/knowledge.json 2024-07-18 13:21:59 @@ -6,9 +6,9 @@ "required": [ "created_by", "domain", - "task_description", "seed_examples", - "document" + "document", + "document_outline" ], "unevaluatedProperties": false, "properties": { @@ -27,15 +27,6 @@ "Pop culture" ] }, - "task_description": { - "description": "A description of the task which is used in prompts to the teacher model during synthetic data generation. The description should be detailed and prescriptive to improve the teacher model's responses.", - "type": "string", - "minLength": 1, - "examples": [ - "To teach a language model about softball history", - "To teach a language model about tabby cats" - ] - }, "seed_examples": { "description": "An array of seed examples for synthetic data generation.", "type": "array", @@ -44,20 +35,39 @@ "items": { "type": "object", "required": [ - "question", - "answer" + "context", + "questions_and_answers" ], "unevaluatedProperties": false, "properties": { - "question": { - "description": "A question used for synthetic data generation.", + "context": { + "description": "Context from the document associated with this set of sample q&a pairs.", "type": "string", "minLength": 1 }, - "answer": { - "description": "The desired response for the question.", - "type": "string", - "minLength": 1 + "questions_and_answers": { + "type": "array", + "minItems": 3, + "uniqueItems": true, + "items": { + "type": "object", + "required": [ + "question", + "answer" + ], + "properties": { + "question": { + "description": "A question used for synthetic data generation.", + "type": "string", + "minLength": 1 + }, + "answer": { + "description": "The desired response for the question.", + "type": "string", + "minLength": 1 + } + } + } } } } @@ -104,6 +114,14 @@ } } } + }, + "document_outline": { + "description": "A brief summary of the document.", + "type": "string", + "minLength": 1, + "examples": [ + "Overview of Human tonsils, describing their types, locations, structure, function, and clinical significance, with a specific focus on their role in the immune system and related health issues." + ] } } } ``` Signed-off-by: Russell Bryant <rbryant@redhat.com>

We use a github action: https://github.com/hynek/build-and-inspect-python-package?tab=readme-ov-file which runs a script to lint the python wheel it builds: https://pypi.org/project/check-wheel-contents/ This tool is warning about duplicate files between v2 and v3. The skills schema did not change intentionally and we still want a copy in both places. Signed-off-by: Russell Bryant <rbryant@redhat.com>

bjhargrave

I think we need to expand/improve the descriptions, but we can do this in later PRs.

russellb · 2024-07-22T21:02:55Z

I think we need to expand/improve the descriptions, but we can do this in later PRs.

Thank you. Yes, agreed and understood. Doc updates are going to needed in various repos too, probably. I've got that on the todo list in instructlab/sdg#160

russellb mentioned this pull request Jul 17, 2024

[Epic] Support for v3 schema of knowledge taxonomy additions instructlab/sdg#160

Closed

27 tasks

russellb requested a review from bjhargrave July 17, 2024 17:22

russellb commented Jul 17, 2024

View reviewed changes

src/instructlab/schema/v3/version.json Outdated Show resolved Hide resolved

src/instructlab/schema/v3/knowledge.json Show resolved Hide resolved

russellb commented Jul 17, 2024

View reviewed changes

src/instructlab/schema/v3/knowledge.json Show resolved Hide resolved

russellb force-pushed the v3 branch 2 times, most recently from 5723d66 to 7be2c44 Compare July 17, 2024 19:14

aakankshaduggal reviewed Jul 17, 2024

View reviewed changes

russellb commented Jul 17, 2024

View reviewed changes

src/instructlab/schema/v3/knowledge.json Outdated Show resolved Hide resolved

russellb force-pushed the v3 branch from 7be2c44 to 2be6938 Compare July 17, 2024 19:53

bjhargrave reviewed Jul 17, 2024

View reviewed changes

src/instructlab/schema/v3/knowledge.json Show resolved Hide resolved

russellb force-pushed the v3 branch from 892b60a to 9865f4a Compare July 17, 2024 20:51

russellb requested a review from bjhargrave July 17, 2024 21:07

russellb added this to the v0.3.0 milestone Jul 17, 2024

bjhargrave reviewed Jul 18, 2024

View reviewed changes

russellb added 2 commits July 18, 2024 13:21

russellb force-pushed the v3 branch from 0e89c5e to 6acb4aa Compare July 18, 2024 17:22

russellb requested a review from bjhargrave July 18, 2024 17:23

russellb mentioned this pull request Jul 19, 2024

Add v3 knowledge schema support instructlab/sdg#161

Merged

bjhargrave approved these changes Jul 22, 2024

View reviewed changes

russellb merged commit fb82e0d into instructlab:main Jul 22, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce v3 schema #39

Introduce v3 schema #39

russellb commented Jul 17, 2024 •

edited

Loading

aakankshaduggal left a comment

russellb commented Jul 17, 2024 •

edited

Loading

bjhargrave left a comment

russellb commented Jul 17, 2024

bjhargrave Jul 18, 2024

russellb Jul 18, 2024

bjhargrave left a comment

russellb commented Jul 22, 2024

Introduce v3 schema #39

Introduce v3 schema #39

Conversation

russellb commented Jul 17, 2024 • edited Loading

aakankshaduggal left a comment

Choose a reason for hiding this comment

russellb commented Jul 17, 2024 • edited Loading

bjhargrave left a comment

Choose a reason for hiding this comment

russellb commented Jul 17, 2024

bjhargrave Jul 18, 2024

Choose a reason for hiding this comment

russellb Jul 18, 2024

Choose a reason for hiding this comment

bjhargrave left a comment

Choose a reason for hiding this comment

russellb commented Jul 22, 2024

russellb commented Jul 17, 2024 •

edited

Loading

russellb commented Jul 17, 2024 •

edited

Loading