Skip to content

Pattern matching

Mwexim edited this page Dec 29, 2021 · 6 revisions

This is a detailed explanation of the approach to Skript parsing used in skript-parser's code. This document is meant as an introduction to the inner mechanics of skript-parser, hopefully encouraging contribution.

This document was made by Syst3ms and is shared on the wiki to provide further information to potential contributors.

Terminology used in this document :

  • the parsing chain : registration time, parse time and runtime
  • registration time : when syntaxes are registered
  • parse time : when scripts are parsed, after registration time
  • runtime : when scripts are ran, after parse time
  • Skript : the Skript Spigot plugin
  • skript-parser : this project
  • pattern parsing : converting a pattern string into a pattern object
  • application : a piece of text written inside of a script that is meant to match a registered syntax
  • literal : parse time-constant, whose value can always be directly obtained from the corresponding string.
  • variable : similar to other languages' variables ; they are of the form {...}, but their precise description is not the aim of this document.
  • conditional expression : a boolean expression specifically made to be used as a condition. They are unlike Skript's condition in the sense that they can be used as expressions.

Registration time

The parsing chain starts with the registration of a syntax. Skript doesn't do anything notable here, and simply adds the string to its list of registered syntaxes. This is where the first optimization starts. Instead of simply registering the raw string and having to deal with all the quirks of string matching at parse time, skript-parser first converts the given string into a code-usable object, which I'll refer to as the pattern object.

The aim of pattern parsing is obviously to make parse time more efficient, but it also gives a chance to skript-parser to check the validity of registered syntaxes in an in-depth way before parse time.

A pattern object can be one out of multiple pattern elements :

  1. Text element: the most basic pattern element, simply a string of text.
  2. Compound element: a simple succession of other pattern elements.
  3. Optional group : indicated by square brackets (ex: [optional]). The other pattern element contained inside an optional group can be omitted when writing an application. It usually serves as a way to give more variety to the syntax, though it can be used to add optional parameters.
  4. Choice group: indicated by two parentheses, and each choice separated by a pipe | (ex: (choice|option)). When writing an application, either choice can be written. It also usually serves as a way to give more variety to the syntax. However, when the choice group is supposed to matter, one can use parse marks. Parse marks are indicated by an integer placed before a single choice, separated using a colon :. In the example (0:choice|1:option), the choices choice and option have respectively the parse marks 0 and 1. A parse mark can be omitted, to which case the choice's parse mark will be 0 by default.
  5. Regex group: indicated by angled brackets (ex: <regex>). Simply represents a regex pattern that will be tested at parse time. This is useful when the other pattern elements can't represent the desired syntax properly (usually when the desired result can only be modelled by regex quantifiers). This should only be used if it must be, as skript-parser handles the other pattern elements much more efficiently.
  6. Expression element: indicated by a pair of per cent signs (ex: %type%). Represents an expression of a given type that will have to be inputted when writing an application. The possible types are also only registered at registration time (the types guaranteed by skript-parser can be found further in the document). Yet, additional restrictions can be applied to expression elements :
    • %*type%: the application must use a literal in place of the element.
    • %~type%: the application must not use a literal in place of the element.
    • %^type%: the application must use a variable in place of the element.
    • %=boolean%: the application can use a conditional expression in place of this element.
    • %type?%: if this element is omitted in the application (due to an optional group for example), skript-parser will use the default expression provided by the given type.
    • %type1/type2% : allows the application to use an expression of either type. This can be combined with the previous restrictions, which would apply to all types indiscriminately.

Now that the different pattern elements have been presented, we can present the precise way they're used at parse time.

Parse time

After registration is over, the script must be parsed. This is when the object structure of patterns becomes very useful to make the matching process easy to code and to understand.

Each possible pattern element has its own semantics when being parsed :

  1. Text element: text elements are meant to be case- and whitespace-insensitive, meaning that two strings with extraneous whitespace and varying case should still match.
  2. Compound element: a compound element matches a string if all of its successive components match the string as well. If a single component fails to match, the entire compound element fails to match as well. Obviously, indices are handled such that each component is matched against the correct part of the string; this goes for every pattern element.
  3. Optional group: if the component of the optional group matches the string, indices are updated accordingly, and if it doesn't, the match doesn't fail and instead the indices simply don't update.
  4. Choice group: each choice is tested successively until one matches. The parse mark associated with the choice is XORed with a global parse mark (global in the sense if multiple parse marks are present anywhere in the pattern, they will be XORed with one another properly). If no choice matches, the choice group fails to match entirely.

For the next two pattern elements, a concept of "next possible input" is required. The next possible inputs are simply every pattern element that can possibly come directly after the current pattern element. If nothing comes after the element, the parser will know to look until the end of the string. Text elements with only whitespace are ignored when looking for next possible inputs. With the example of %type% [option] [(one|two)], the next possible inputs are the texts option, one, two or the end of the line. The next possible inputs that are text elements are used to give a bound on the amount of text the parser can possibly try to parse. When regex groups or expressions elements are encountered during this kind of search, they are simply added to the list of possible elements as-is.

  1. Regex group: if the next possible input is a text element, the regex will be matched against the substring from the current index to the beginning of the possible input, if it occurs. The regex group cannot currently handle the case of next possible inputs being expression elements. If the next possible input is a regex group, it is not currently handled to have the next possible inputs after it not be text elements.
  2. Expression element : for each text element possible input, the expression element will try to find if it occurs, and if it does, try to parse the substring from the current index to the beginning of the possible input. If a possible input is a regex group, the parser will look for a match of the regex group and try to parse the substring from the current index to the beginning of the match (obviously it will try all possible matches of the regex group). If the next element is an expression element, the parser will try to look at all strings that can be obtained by splitting at spaces. For example, given the text a tricky expression element in that context, the parser would try to parse a, a tricky and a tricky expression.

Examples using the "is set" condition %~objects% (is|are)[1:( not|n't)] set.

  1. The parser is trying to match against {variable::*} are set :
    1. The next possible inputs are is and are.
    2. The parser looks for an occurrence of is and doesn't find it.
    3. The parser looks for an occurrence of are, finds it, and tries to parse the substring {variable::*}. The parsing succeeds and thus the match as well.
  2. The parser is trying to match against ({var} is 1) is set :
    1. The next possible inputs are is and are.
    2. The parser finds a first occurrence of is, and tries to parse the substring ({var}. The match fails because of the parenthesis.
    3. The parser finds a later occurrence of is, and tries to parse the substring ({var} is 1). This time, the parsing and thus the match both succeed.
Clone this wiki locally