Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for sch:view element, and Schematron View Vocabulary, for embedded notations, branches, richer iteration (updated @view-as proposal) #76

Open
rjelliffe opened this issue Jun 16, 2024 · 0 comments

Comments

@rjelliffe
Copy link
Member

rjelliffe commented Jun 16, 2024

(Updated)
This is a more worked-out version of the @view-as idea I suggested for issue " Checking for substrings of the node values #75". I am putting it here so that #75 can be considered on its own merits.

Use-Cases

  • It is ubiquitous that there are "embedded notations" in XML. For example, CSS, JavaScript, JSON fragments, URLs, mail addresses, email addresses, SKUs, etc. In Schematron, data in "embedded notations" relative to some @context location can be transduced (parsed or taken apart in some way, producing a parse tree of some description) into variables (of rules); these transducer results can in turn be used to construct other variables; and these can be iterated over by the @test of assertions which access the variables. But this gives two problems:
    -- first, we lose track in the assertions of the location information related to the original context, so the SVRL (or whatever) can only report in terms of the broader @context; this means poor diagnostics and communication with the human and automated targets of the validation;
    --second, it makes it much more difficult to iterate over multiple values

  • There are cases where we want to restrict our validation to branches or other ranges in the XML, for semantic, workflow or performance reasons. But this can complicate the XPaths and cause unnecessary tree traversals to check we are in the correct context.

  • The tree model of XDM allows iterating over e.g. children but still allowing access to shared context information (i.e., in ancestors) and to previous and subsquent items. However, it when iterating over a list (e.g. a list of string values) it it is difficult or impossible to do so. So there would be an advantage in a way to present lists (e.g. of strings) for validation (in a synthetic view) as items in a tree to take advantage of the full expressive power of XPath more.

  • Extensions such as Schematron Quick-Fix need to know exact locations in strings. It may be that such tools can take advantage where the string has been parsed more thoroughly to give exact positions.

Proposal
The proposal is to extend the sch:rule element with an element sch:view-as. This allows information in the document locatable from the sch:rule/@context to be transformed into one or more new, independent XML "sub-documents": these then then provides various "sub-contexts" for a group of assertions or reports over their "sub-document". This has two uses:

  • Validation after an XML-to-XML transformation (i.e. converting parts of document into some more tractable form)
  • Validation after a string-to-XML transformation (i.e. parsing/transducing an "embedded notation")

As part of this, some standard "wrapper functions", "wrapper elements" and names will be defined. These are called the Schematron View Vocabulary (SVV).

Elements
The content model for sch:rule (and sch:rule-set if adopted) is augmented from (something like)

     sch:rule = @context, ..., sch:title?, sch:para*,  
                     (sch:let | sch:assert |  sch:report |  %foreign;)*

to (something like)

    sch:rule = @context, @as, ... sch:title?, sch:para*, 
                    (sch:let | sch:assert |  sch:report | sch:view-as  |%foreign;)*
    sch:view-as  = @function, @context, @as, ... sch:title?, sch:para*,  
                    (sch:let | sch:assert |  sch:report |  %foreign;)*

for example:

   <sch:rule context="/html//style"  as="xs:string" >
      <sch:view-as  
                           using="parse-css(.)" 
                          new-context="/css/rule" 
                          as="element(rule)*">
           <sch:assert test="starts-with(@name, 'eg-corp')">
                            A CSS style name should start with 'eg-corp'</sch:assert>
            ...
       </sch:view-as>
   <sch:rule>

The reason for introducing an element here (rather than an @view-as attribute) is 1) so that there is no danger that a schema using this would be interpreted ignoring the @view-as by an existing system, 2) so that the grouping is established and visually distinct, 3) so that the fact that we are establishing a new context is explicit, 4) so that we can add whatever attributes are needed to make it work, and 5) so that the same rule context can also have other asserts and reports under it as usual.

The operation is that in each sch:rule/ @new-context, that context is processed by some function specified by @function, which produces an XDM document (the "new view" or "sub-document") . The @using can be any location Xpath (absolute or relative to the current sch:rule /@context) but terminates with a function that produces "locatable XML" (definition given below.) This is treated as a new XML document and the sch:view-as/ @new-context specifies an Xpath expression that is iterated over like normal rules: a rule or variable inside the sch:view that has an absolute path (such as "/") is in terms of the result of the view document.

If the assertions of a view want to interrogate the outside context, it has access to in-scope variables as at sch:view-as.

Schematron View Vocabulary
If a function does not produce an XDM document, the implementation must attempt to wrap the results in standard elements that allows it to be used. In particular, it should provide some mechanism where new standard functions sch:get-rule-location(element()) and get-starts-at(element()) are provided, which then allow relevant information for location diagnostics. This allow a variety of possible implementations:

  • maintain a separate list of location mappings for the view to the original, e.g. keyed by generate-id().
  • insert processing instructions into the view giving the appropriate information
  • insert attributes into the view giving the appropriate information
  • have a parallel tree visible to the function with the mapping.

To explain, we first define some wrapper elements for a new SVV (Schematron View Vocabulary) namespace.

The vocabulary is:

  • svv:view - element for top-level wrapping
  • svv:item - element for other wrapping item
  • svv:get-starts-at(xs:string | node()) returns xs:string - a function that gets the string index of the current node in the view XML that corresponds to the position in the rule context. This index would be either a number (the character number starting with 1) or a cursor position (e.g. "1:0" for the start of the first line, "2:4" for the position between the fourth and fifth characters on the second line. ) If the context is not a string, then this will be an XPath.
  • svv:get-ends-at(xs:string|node()) returns xs:string - similar to previous for end of string.
  • svv:get-rule-location(node(() returns xs:string - a function that gets the Xpath of the current rule-location (i.e. the @location as provided by SVRL)

For example, if an implementation wants to allow the use of Xpath tokenize(), such as:

    <sch:rule context="some-tokens">
        <sch:view-as using="tokenize(., ' ')" new-context="//svv-item">
            <sch:report test=". = 'Fred'">A token called Fred was found
                     at character position <xsl:value-of select="svv:get-starts-at(.)"/></sch:report>

and the input

    <x><some-tokens>fred wilma ginger frog</some-tokens></x>

then one way to implement it would be to embed Processing Instructions: such an implementation would create the view XML as

   <svv:view>
        <svv:item>fred</svv:item>
        <svv:item>wilma</svv:item>
        <svv:item>ginger</svv:item>
        <svv:item>frog</svv:item> 
  </svv:view>

In the case of function analyse-string() it already produces an XML tree, so there is no need to add any wrappers. (If analyse-string() does not produce a document node, the implementation would need to add that at the top.)

Predefined Names
The implementation needs to provide a way to map from nodes found in the sub-document to nodes (and positions) in the original document. Otherwise, the user or using application will not get very good or specific diagnostics and hints, as e.g. QuickFix might require.

The implementation can work out the index by the string-length of the previous tokens plus the string-length of the delimiter. To support such an implementation the standard will define the following names (for use in PIs or attributes):

** svv:starts-at giving either an index or position pair ("1:0") into a string of the start character or position (if the function expect a string or text() node as input) or a constructed Xpath (if the function works on nodes).
** svv:ends-at giving either an index or position pair ("1:0") into a string of the end character or end position (if the function expect a string or text() node as input) or a constructed Xpath (if the function works on nodes).
** svv:rule-location or attribute with a constructed Xpath for the location of the current rule's context

I propose using PIs here, however because some people have a phobia of Processing Instructions even where they simplify life, this proposal could be adopted using attributes instead e.g. @svv:rule-location and @svv:starts-at. However, madness is everywhere, why fight it? Using functions svv:get-starts-at() and svv:get-rule-location() to access the information makes it an implementation decision whether to use embedded markup (attributes or PIs) or external markup (generate-id() mapping lists, parallel trees). However, I don't think the external markup is feasible, because it is not straightforward to integrate custom-defined xsl:functions into it.

The view implemented using PIs (probably my preferred options, things considered) would look like:

   <svv:view><?svv:rule-location="/x[1]/some-tokens[1]"?>
        <svv:item><?svv starts-at="1"?>fred</svv:item>
        <svv:item><?svv starts-at="6"?>wilma</svv:item>
        <svv:item><?svv starts-at="12"?>ginger</svv:item>
        <svv:item><?svv starts-at="18"?>frog</svv:item> 
  </svv:view>

In the case of analyse-string(), the function returns XML so no wrapper elements are needed. However it does not provide for start-at values, so the implement needs to count the terminal text nodes under "matched" and "unmatched" elements (excluding whitespace under "group" elements).

QLB
So it can be seen that in some cases (such as tokenize() and analyse-string()) it is straightforward to support a function. However, some functions may be difficult or have modes that present a challenge.

Consequently, the standard should specify whether:

  1. The QLB specifies the functions allowed to be used in that query language (or, better, provides wrapped versions of these that wrap the output appropriate and add positional information)

  2. The QLB specifies the functions required to be supported (or provides wrapped versions) , where the use of other functions will result in missing or bogus @start-at, or an @start-at expressed in other terms (such as the /svv:view/svv:item/position() )

  3. Leave it all open for future specification

  4. Require an implementation to support 1. as the default, but allow the user to select 2. (or 3.) This is my preferred option, as the safest but not preventing the user from degrading their results if they choose.

  5. The method of implementation (mapping table, PI, attribute, parallel tree) should not be specified in the standard, however, the standard PI/attribute names should be defined to support developers.

Function Wrappers or Detection
Wrapper-functions: The best approach is that the QLB would define the names of wrapper functions for standard functions that can be used. For example svv:tokenize(), svv:analyse-string(), svv:json-to-xml() and so on. These functions would act the same as the versions of the query language, and so not need special documentation, but would produce XML documents out: doing the wrapping and starts-at/ends-at augmentation. Under this method, there would be no need for inspecting the @using attribute for regex signatures.

This would look like:

    <sch:rule context="some-tokens">
        <sch:view-as using="svv:tokenize(., ' ')" new-context="//svv-item">
            <sch:report test=". = 'Fred'">A token called Fred was found
                     at character position <xsl:value-of select="svv:get-starts-at(.)"/></sch:report>

Registry: An alternate approach would that the implementation would have a registered list of signature regexes. If a regex matches, it knows to use that function and then to apply whatever fix-ups are necessary to provide the SVV information. E.g. regex [^A-Za-z]tokenise\s*\( would determine that the tokenise() function was in use. It would probably be better to adopt some rule-of thumb that the first-positioned function signature that matches is the one used, so that we can have tokenize( replace(., '.', '_'))

Support functions
The functions svv:get-starts-at(), svv:get-ends-at() and svv:get-rule-location() have been mentioned above. I believe a good starting list for the XPath 3 QLB would be svv:analyse-string(), svv:tokenise() and svv:json-to-xml(). (Are there others?)

It may be that some features in other proposals for Schematron 2025 could be implemented using this mechanism.

For example, I made various proposals for efficiency to restrict items to branches. If we had a function that takes the current content (element) and makes a new document starting there (lets call this function sch:view-branch-as-document() ) then we could have

  <sch:rule context="chapter" as="element()">
       <sch:view-as function="sch:view-branch-as-document(.)"  context="//footnote-ref">
             <sch:assert test="preceding::footnote-ref/@ref ne current()/@ref">There should not be more than one reference 
              to a footnote in a chapter"</sch:assert>
     </sch:view-as>
  </sch:rule>

In this example, we trim the document down to just the single chapter element, which then simplifies the Xpaths for the @test which do not need to worry about overstepping the chapter bounds. And it does not extend the Schematron language, but instead uses the QLB.

So if this proposal is adopted (mutatis mutandis) then part of the exercise would be too see if other proposals can be implemented in some useful part by defining some simple tree transformation. This would also provide a way to enhance the value of xsl:function definitions: the sch:view-as/@function function could be such a custom function, and if the function produces nodes in the appropriate form (i.e. with svv:start-at and svvv:rule-location PIs/attributes) then they would play well and be insulated from the output form (e.g. SVRL and whatever the interface with OxygenXML and QuickFixes use.)

Turning off
The implementation could provide an option to turn off validation. In any case, the @function attribute can always reference an enabling variable or parameter, such ``function=" tokenise(self:node()[$ENABLE-SUBPARSING], '|') "```

SVRL The output of the SVRL should be enhanced as needed, so include information that a view has been made (or that an empty view was made.)

Locatable XPath
This is just an XPath fragment that can be simply appended to the svrl:fired-rule/@location to get the location of the sch:view-as/@sub-context.

The issue of how to specify ranges of text in SVRL might need to be addressed. Unless my knowledge is stale, XPath does not have a good method to specify text subranges. (I gather that Xpath 4.0 may impove things, if it gets "slices". But I may have my wires crossed.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant