Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce v3 schema #39

Merged
merged 2 commits into from
Jul 22, 2024
Merged

Introduce v3 schema #39

merged 2 commits into from
Jul 22, 2024

Conversation

russellb
Copy link
Member

@russellb russellb commented Jul 17, 2024

3e808f1 Introduce v3 schema
6acb4aa ci: Ignore duplicate files when linting package

commit 3e808f1
Author: Russell Bryant rbryant@redhat.com
Date: Wed Jul 17 13:16:18 2024 -0400

Introduce v3 schema

Closes #38

v3 includes some backwards incompatible changes to the knowledge
schema format. Here is a diff against v2. The changes are:

- Q&A pairs now have an associated context blob from the knowledge
  document.

- There is new `document_outline` field.

- drop `task_description`

```diff
--- src/instructlab/schema/v2/knowledge.json    2024-07-17 12:56:37
+++ src/instructlab/schema/v3/knowledge.json    2024-07-18 13:21:59
@@ -6,9 +6,9 @@
     "required": [
         "created_by",
         "domain",
-        "task_description",
         "seed_examples",
-        "document"
+        "document",
+        "document_outline"
     ],
     "unevaluatedProperties": false,
     "properties": {
@@ -27,15 +27,6 @@
                 "Pop culture"
             ]
         },
-        "task_description": {
-            "description": "A description of the task which is used in prompts to the teacher model during synthetic data generation. The description should be detailed and prescriptive to improve the teacher model's responses.",
-            "type": "string",
-            "minLength": 1,
-            "examples": [
-                "To teach a language model about softball history",
-                "To teach a language model about tabby cats"
-            ]
-        },
         "seed_examples": {
             "description": "An array of seed examples for synthetic data generation.",
             "type": "array",
@@ -44,20 +35,39 @@
             "items": {
                 "type": "object",
                 "required": [
-                    "question",
-                    "answer"
+                    "context",
+                    "questions_and_answers"
                 ],
                 "unevaluatedProperties": false,
                 "properties": {
-                    "question": {
-                        "description": "A question used for synthetic data generation.",
+                    "context": {
+                        "description": "Context from the document associated with this set of sample q&a pairs.",
                         "type": "string",
                         "minLength": 1
                     },
-                    "answer": {
-                        "description": "The desired response for the question.",
-                        "type": "string",
-                        "minLength": 1
+                    "questions_and_answers": {
+                        "type": "array",
+                        "minItems": 3,
+                        "uniqueItems": true,
+                        "items": {
+                            "type": "object",
+                            "required": [
+                                "question",
+                                "answer"
+                            ],
+                            "properties": {
+                                "question": {
+                                    "description": "A question used for synthetic data generation.",
+                                    "type": "string",
+                                    "minLength": 1
+                                },
+                                "answer": {
+                                    "description": "The desired response for the question.",
+                                    "type": "string",
+                                    "minLength": 1
+                                }
+                            }
+                        }
                     }
                 }
             }
@@ -104,6 +114,14 @@
                     }
                 }
             }
+        },
+        "document_outline": {
+            "description": "A brief summary of the document.",
+            "type": "string",
+            "minLength": 1,
+            "examples": [
+                "Overview of Human tonsils, describing their types, locations, structure, function, and clinical significance, with a specific focus on their role in the immune system and related health issues."
+            ]
         }
     }
 }
```

Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 6acb4aa
Author: Russell Bryant rbryant@redhat.com
Date: Wed Jul 17 16:49:05 2024 -0400

ci: Ignore duplicate files when linting package

We use a github action:

  https://github.com/hynek/build-and-inspect-python-package?tab=readme-ov-file

which runs a script to lint the python wheel it builds:

  https://pypi.org/project/check-wheel-contents/

This tool is warning about duplicate files between v2 and v3. The
skills schema did not change intentionally and we still want a copy in
both places.

Signed-off-by: Russell Bryant <rbryant@redhat.com>

Copy link
Member

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@russellb
Copy link
Member Author

russellb commented Jul 17, 2024

There is an example that works with the proposed schema here: https://github.com/russellb/taxonomy/tree/v3-example/knowledge/tonsils

❯ ilab taxonomy diff
knowledge/tonsils/qna.yaml
Taxonomy in /Users/rbryant/Library/Application Support/instructlab/taxonomy is valid :)

Copy link
Contributor

@bjhargrave bjhargrave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to include compositional skills json. Taxonomy repo uses latest schema version so missing compositional skills json will prevent contribution compositional skills.

It seems the main change here is to wrap some context around seed q/a. This now requires 5 examples with 3 q/a pairs. So now a total of 15 q/a pairs.

src/instructlab/schema/v3/knowledge.json Show resolved Hide resolved
@russellb
Copy link
Member Author

Also need to include compositional skills json. Taxonomy repo uses latest schema version so missing compositional skills json will prevent contribution compositional skills.

Done! I wasn't sure if I had to duplicate it even thought it didn't change. That's fine though. I added it.

It seems the main change here is to wrap some context around seed q/a. This now requires 5 examples with 3 q/a pairs. So now a total of 15 q/a pairs.

That's right.

@russellb russellb requested a review from bjhargrave July 17, 2024 21:07
@russellb russellb added this to the v0.3.0 milestone Jul 17, 2024
}
},
"document_outline": {
"description": "An outline of the document.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the document_outline is used anywhere in the prompt to the teacher model for SDG, we must be more precise in the description about how it is used and what people should put here. Otherwise people will treat it as a just some commentary field they just put something in to shut up the yaml validation.

See task_description in compositional skills.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks!

Closes instructlab#38

v3 includes some backwards incompatible changes to the knowledge
schema format. Here is a diff against v2. The changes are:

- Q&A pairs now have an associated context blob from the knowledge
  document.

- There is new `document_outline` field.

- drop `task_description`

```diff
--- src/instructlab/schema/v2/knowledge.json	2024-07-17 12:56:37
+++ src/instructlab/schema/v3/knowledge.json	2024-07-18 13:21:59
@@ -6,9 +6,9 @@
     "required": [
         "created_by",
         "domain",
-        "task_description",
         "seed_examples",
-        "document"
+        "document",
+        "document_outline"
     ],
     "unevaluatedProperties": false,
     "properties": {
@@ -27,15 +27,6 @@
                 "Pop culture"
             ]
         },
-        "task_description": {
-            "description": "A description of the task which is used in prompts to the teacher model during synthetic data generation. The description should be detailed and prescriptive to improve the teacher model's responses.",
-            "type": "string",
-            "minLength": 1,
-            "examples": [
-                "To teach a language model about softball history",
-                "To teach a language model about tabby cats"
-            ]
-        },
         "seed_examples": {
             "description": "An array of seed examples for synthetic data generation.",
             "type": "array",
@@ -44,20 +35,39 @@
             "items": {
                 "type": "object",
                 "required": [
-                    "question",
-                    "answer"
+                    "context",
+                    "questions_and_answers"
                 ],
                 "unevaluatedProperties": false,
                 "properties": {
-                    "question": {
-                        "description": "A question used for synthetic data generation.",
+                    "context": {
+                        "description": "Context from the document associated with this set of sample q&a pairs.",
                         "type": "string",
                         "minLength": 1
                     },
-                    "answer": {
-                        "description": "The desired response for the question.",
-                        "type": "string",
-                        "minLength": 1
+                    "questions_and_answers": {
+                        "type": "array",
+                        "minItems": 3,
+                        "uniqueItems": true,
+                        "items": {
+                            "type": "object",
+                            "required": [
+                                "question",
+                                "answer"
+                            ],
+                            "properties": {
+                                "question": {
+                                    "description": "A question used for synthetic data generation.",
+                                    "type": "string",
+                                    "minLength": 1
+                                },
+                                "answer": {
+                                    "description": "The desired response for the question.",
+                                    "type": "string",
+                                    "minLength": 1
+                                }
+                            }
+                        }
                     }
                 }
             }
@@ -104,6 +114,14 @@
                     }
                 }
             }
+        },
+        "document_outline": {
+            "description": "A brief summary of the document.",
+            "type": "string",
+            "minLength": 1,
+            "examples": [
+                "Overview of Human tonsils, describing their types, locations, structure, function, and clinical significance, with a specific focus on their role in the immune system and related health issues."
+            ]
         }
     }
 }
```

Signed-off-by: Russell Bryant <rbryant@redhat.com>
We use a github action:

  https://github.com/hynek/build-and-inspect-python-package?tab=readme-ov-file

which runs a script to lint the python wheel it builds:

  https://pypi.org/project/check-wheel-contents/

This tool is warning about duplicate files between v2 and v3. The
skills schema did not change intentionally and we still want a copy in
both places.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Copy link
Contributor

@bjhargrave bjhargrave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to expand/improve the descriptions, but we can do this in later PRs.

@russellb
Copy link
Member Author

I think we need to expand/improve the descriptions, but we can do this in later PRs.

Thank you. Yes, agreed and understood. Doc updates are going to needed in various repos too, probably. I've got that on the todo list in instructlab/sdg#160

@russellb russellb merged commit fb82e0d into instructlab:main Jul 22, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants