Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dtype_numpy with medium-effort validation #315

Closed
wants to merge 2 commits into from

Conversation

danielballan
Copy link
Member

Add dtype_numpy to Event Descriptor.

Supersedes #215.

@danielballan
Copy link
Member Author

Running the schema generation seems to remove NX_class and not add dtype_numpy. I'm confused.

$ git diff
diff --git a/event_model/schemas/event_descriptor.json b/event_model/schemas/event_descriptor.json
index 3549af5..5958023 100644
--- a/event_model/schemas/event_descriptor.json
+++ b/event_model/schemas/event_descriptor.json
@@ -112,12 +112,6 @@
             "description": "The 'interesting' data keys for this device.",
             "type": "object",
             "properties": {
-                "NX_class": {
-                    "title": "Nx Class",
-                    "description": "The NeXus class definition for this device.",
-                    "type": "string",
-                    "pattern": "^NX[A-Za-z_]+$"
-                },
                 "fields": {
                     "title": "Fields",
                     "description": "The 'interesting' data keys for this device.",

@evalott100
Copy link
Contributor

evalott100 commented Sep 23, 2024

Running the schema generation seems to remove NX_class and not add dtype_numpy. I'm confused.

I'll take a look!

@danielballan
Copy link
Member Author

Thank you, @evalott100!

@evalott100
Copy link
Contributor

This is very confusing - everything is working as it should on my end (copying that dtype_numpy field directly):

diff --git a/event_model/documents/event_descriptor.py b/event_model/documents/event_descriptor.py
index 3fa799a..8e565d4 100644
--- a/event_model/documents/event_descriptor.py
+++ b/event_model/documents/event_descriptor.py
@@ -1,4 +1,4 @@
-from typing import Any, Dict, List, Optional
+from typing import Any, Dict, List, Optional, Tuple, Union
 
 from typing_extensions import Annotated, Literal, NotRequired, TypedDict
 
@@ -23,6 +23,19 @@ class DataKey(TypedDict):
         Dtype,
         Field(description="The type of the data in the event."),
     ]
+    dtype_numpy: NotRequired[
+        Annotated[
+            Union[
+                str,  # e.g. "<u4",
+                List[Tuple[str, str]],  # e.g. [("a", "<u4"), ("b", "<f8")]
+            ],
+            Field(
+                description="The type of the data in the event, given as a numpy dtype string (or, for structured dtypes, array)",
+                pattern="[|<>][tbiufcmMOSUV][0-9]+",
+            ),
+            # TODO Enforce the regex pattern.
+        ]
+    ]
     external: NotRequired[
         Annotated[
             str,
diff --git a/event_model/schemas/datum_page.json b/event_model/schemas/datum_page.json
index a09dc59..7d34f98 100644
--- a/event_model/schemas/datum_page.json
+++ b/event_model/schemas/datum_page.json
@@ -14,11 +14,7 @@
     "properties": {
         "datum_id": {
             "description": "Array unique identifiers for each Datum (akin to 'uid' for other Document types), typically formatted as '<resource>/<integer>'",
-            "allOf": [
-                {
-                    "$ref": "#/$defs/DataFrameForDatumPage"
-                }
-            ]
+            "$ref": "#/$defs/DataFrameForDatumPage"
         },
         "datum_kwargs": {
             "title": "Datum Kwargs",
diff --git a/event_model/schemas/event_descriptor.json b/event_model/schemas/event_descriptor.json
index 3549af5..5195a23 100644
--- a/event_model/schemas/event_descriptor.json
+++ b/event_model/schemas/event_descriptor.json
@@ -52,6 +52,31 @@
                         "integer"
                     ]
                 },
+                "dtype_numpy": {
+                    "title": "Dtype Numpy",
+                    "description": "The type of the data in the event, given as a numpy dtype string (or, for structured dtypes, array)",
+                    "anyOf": [
+                        {
+                            "type": "string"
+                        },
+                        {
+                            "items": {
+                                "maxItems": 2,
+                                "minItems": 2,
+                                "prefixItems": [
+                                    {
+                                        "type": "string"
+                                    },
+                                    {
+                                        "type": "string"
+                                    }
+                                ],
+                                "type": "array"
+                            },
+                            "type": "array"
+                        }
+                    ]
+                },
                 "external": {
                     "title": "External",
                     "description": "Where the data is stored if it is stored external to the events",

... FURTHER REMOVAL OF UNNECESSARY allOf

@evalott100
Copy link
Contributor

And I get the same correct output as above on your branch directly (with python 3.10/3.11) $ python event_model/documents/generate 🤔

Were you using an older venv?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the regex work? If would be interesting to see a couple of examples (just string and structured dtype) that fail validation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to guess the regex does nothing, as it doesn't appear in the schema. @evalott100 can you think of a way round this? We want it to be Annotated[str, something_that_checks_for_regex] | List[Tuple[str, Annotated[str, something_that_checks_for_regex], but I don't know if you can do that with pydantic...

Copy link
Contributor

@evalott100 evalott100 Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pydantic 2 they want to use Field instead of the constr wrapper in the type itself... which is annoying. The following should work (I'll try tomorrow).

_ConstrainedDtype = Annotated[str, Field(pattern="[|<>][tbiufcmMOSUV][0-9]+")]

numpy_dtype: Union[
    _ConstrainedDtype,
    List[Tuple[str, _ConstrainedDtype]]
]

Copy link
Contributor

@evalott100 evalott100 Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, constr just does this directly:

https://github.com/pydantic/pydantic/blob/7cedbfb03df82ac55c844c97e6f975359cb51bb9/pydantic/types.py#L810-L821

    return Annotated[  # pyright: ignore[reportReturnType]
        str,
        StringConstraints(
            strip_whitespace=strip_whitespace,
            to_upper=to_upper,
            to_lower=to_lower,
            strict=strict,
            min_length=min_length,
            max_length=max_length,
            pattern=pattern,
        ),
    ]

so I'd be surprised if the above doesn't work (perhaps swapping Field to StringConstraints from pydantic.types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup this works:

_ConstrainedDtype = Annotated[str, Field(pattern="[|<>][tbiufcmMOSUV][0-9]+")]

class MyModel(BaseModel):
    numpy_dtype: Union[
        _ConstrainedDtype,
        List[Tuple[str, _ConstrainedDtype]]
    ]

errors = []
for dtype in (
    "<u4", # Passes
    ">u4", # Passes
    ">S1000", # Passes
    ">uS4", # Fails
    "u4", # Fails
    "u", # Fails
    ">4u", # Fails
    [("hi", "<U4"), ("hi", "<U4")], # Passes
    [("hi", "<U4"), ("hi", ">U4")], # Passes
    [("hi", "<U4"), ("hi", ">S1000")], # Passes
    [("hi", "<U4"), ("hi", ">uS4")], # Fails
    [("hi", "<U4"), ("hi", "u4")], # Fails
    [("hi", "<U4"), ("hi", "u")], # Fails
    [("hi", "<U4"), ("hi", ">4u")], # Fails
):
    try:
        model = MyModel(numpy_dtype=dtype)
        print(model)
    except ValueError as e:
        errors.append(e)
[2 validation errors for MyModel
numpy_dtype.constrained-str
  String should match pattern '[|<>][tbiufcmMOSUV][0-9]+' [type=string_pattern_mismatch, input_value='>uS4', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/string_pattern_mismatch
numpy_dtype.list[tuple[str, constrained-str]]
  Input should be a valid list [type=list_type, input_value='>uS4', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/list_type,
 2 validation errors for MyModel
numpy_dtype.constrained-str
  String should match pattern '[|<>][tbiufcmMOSUV][0-9]+' [type=string_pattern_mismatch, input_value='u4', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/string_pattern_mismatch
numpy_dtype.list[tuple[str, constrained-str]]
  Input should be a valid list [type=list_type, input_value='u4', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/list_type,
 2 validation errors for MyModel
numpy_dtype.constrained-str
  String should match pattern '[|<>][tbiufcmMOSUV][0-9]+' [type=string_pattern_mismatch, input_value='u', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/string_pattern_mismatch
numpy_dtype.list[tuple[str, constrained-str]]
  Input should be a valid list [type=list_type, input_value='u', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/list_type,
 2 validation errors for MyModel
numpy_dtype.constrained-str
  String should match pattern '[|<>][tbiufcmMOSUV][0-9]+' [type=string_pattern_mismatch, input_value='>4u', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/string_pattern_mismatch
numpy_dtype.list[tuple[str, constrained-str]]
  Input should be a valid list [type=list_type, input_value='>4u', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/list_type,
 2 validation errors for MyModel
numpy_dtype.constrained-str
  Input should be a valid string [type=string_type, input_value=[('hi', '<U4'), ('hi', '>uS4')], input_type=list]
    For further information visit https://errors.pydantic.dev/2.5/v/string_type
numpy_dtype.list[tuple[str, constrained-str]].1.1
  String should match pattern '[|<>][tbiufcmMOSUV][0-9]+' [type=string_pattern_mismatch, input_value='>uS4', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/string_pattern_mismatch,
 2 validation errors for MyModel
numpy_dtype.constrained-str
  Input should be a valid string [type=string_type, input_value=[('hi', '<U4'), ('hi', 'u4')], input_type=list]
    For further information visit https://errors.pydantic.dev/2.5/v/string_type
numpy_dtype.list[tuple[str, constrained-str]].1.1
  String should match pattern '[|<>][tbiufcmMOSUV][0-9]+' [type=string_pattern_mismatch, input_value='u4', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/string_pattern_mismatch,
 2 validation errors for MyModel
numpy_dtype.constrained-str
  Input should be a valid string [type=string_type, input_value=[('hi', '<U4'), ('hi', 'u')], input_type=list]
    For further information visit https://errors.pydantic.dev/2.5/v/string_type
numpy_dtype.list[tuple[str, constrained-str]].1.1
  String should match pattern '[|<>][tbiufcmMOSUV][0-9]+' [type=string_pattern_mismatch, input_value='u', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/string_pattern_mismatch,
 2 validation errors for MyModel
numpy_dtype.constrained-str
  Input should be a valid string [type=string_type, input_value=[('hi', '<U4'), ('hi', '>4u')], input_type=list]
    For further information visit https://errors.pydantic.dev/2.5/v/string_type
numpy_dtype.list[tuple[str, constrained-str]].1.1
  String should match pattern '[|<>][tbiufcmMOSUV][0-9]+' [type=string_pattern_mismatch, input_value='>4u', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/string_pattern_mismatch]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evalott100 do you have time to turn that into a test and finish off this PR? I'm guessing @danielballan is busy at NOBUGS...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, can do now. I'm curious why Dan's schema generation wasn't working.

@evalott100 evalott100 mentioned this pull request Sep 30, 2024
@evalott100 evalott100 closed this Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants