A Library/Ecosystem/Proof-of-Concept for working with Weighted Attribute Grammars. WAGon is designed for researchers and language designers experimenting with WAGs such that they do not have to waste time creating a DSL and parser and can start immediately experimenting with the format itself.
You can read the paper at https://dulfer.be/wagon/paper.pdf which explains more in detail what WAGs are, what WAGon is and why you should care.
Simply clone this repository, and make sure you have rust
, cargo
and clippy
installed.
WAGon is split into several different crates, all of which are relevant for different purposes.
These are crates intended for use by the "end user". People who just want a finalized AST of their WAG that they can do with whatever they want.
wagon-parser
holds the code for the main parser of the WAGon DSL.
It can parse an input string into a valid AST, as well as perform some rewrite operations on it (for example, factoring out EBNF operators). Every node
it parses is provided with span information, such that proper error messaging can be done later.
wagon-codegen
holds code that is generically useful for turning a WAGon AST (as created by the parser) into valid Rust code.
More specifically, it provides code that automatically turns valid WAGon evaluation expressions into valid Rust code, that can then be used anywhere the user wants.
These are crates that are mostly used by other, more user oriented crates. They can be used directly, but usually they can be ignored.
wagon-lexer
is the lexer for the WAGon DSL. It takes the raw input string and converts it into a stream
of tokens that the parser can use. It automatically performs mode-switching based on the context (whether we are lexing metadata, production rules, or mathematical evaluations).
wagon-value
provides a generic enum for pseudo-dynamic typing. The end result of the code generated by wagon-codegen
is of the type
provided by wagon-value
. It provides generic functions for the most basic usecases, but if more intricate types or computations are needed, you will probably have to read through
this crate.
wagon-ident
is a central repository to get the data structure for identifiers.
Identifiers are used all throughout WAGon for various purposes, but most of the time you can just import the crate and be done with it.
wagon-utils
simply provides various utility functions used throughout WAGon.
wagon-macros
is a catch-all crate for any procedural macros needed anywhere in WAGon. This crate only exists
because of limitations in Rust's procedural macro system.
These are crates that demonstrate how WAGon can be used, as discussed in my (not yet written) thesis. It provides an implementation of a GLL Parser that uses a WAG for disambiguation.
wagon-gll
is a generic GLL library written in Rust. It is used by any parsers generated by wagon-gll-codegen
.
Technically, this library can be used completely removed from the WAGon ecosystem, it is a fully functioning GLL library. However, it does expect attributes and weights to exist. As such, it can fundamentally not be as efficient as a dedicated GLL library which does not have to consider these things.
wagon-gll-codegen
is the crate that takes a valid WAGon AST and generates valid Rust code for a GLL parser. It is very complex and very specific
to its usecase, so most people can ignore this crate. However, the implementation in nodes/assignment.rs
may prove a good example of how wagon-codegen
can be used.
It also provides a binary which takes an input file and generates code for a fully functioning GLL parser. Can be interesting to look at in order to see how to convert an input file into a valid WAG, as well as to see how to nicely show error messaging to your users.
WAGon (Weighted Attribute Grammar Oriented Notation (totally not a backronym)) is both an ecosystem and a DSL. In fact, the ecosystem only exists to support the DSL.
type: analytical;
=========================================
S -> A | B;
A -> {$did_a = true;} E<$did_a> B | ;
B -> {$did_a = false;} E<$did_a> A | ;
E<*did_a> -> [*did_a * 2 + 1] C
| [(!*did_a) * 3 - 4] D
;
C -> [0.3] F
| [0.7] G
;
D -> [0.7] F
| [0.3] G
;
F -> "1";
G -> "1";
While the full DSL is described below, reading it may be a bit much. While I have tried to make it intuitive, a human readable explanation may be useful.
Every grammar file can optionally start with a metadata section. Inside of the grammar section, key: value
pairs may be written (the use of which lies with the grammar designer).
Additionally, includes can be written inside of the metadata section to specify that external files should be included in this grammar. However, as of writing the includes are not yet supported.
The metadata section is delineated with 3 or more =
signs.
Production Rules follow the pattern Identifier -> Rule
. Alternate rules can be written using either |
or by using the same Identifier
again. The whole rule should be closed with a ;
.
If any inherited or synthesized attributes are used in the rule, they should be declared on the left-hand side of rule, as if they were type parameters (so between <>
, separated by ,
).
In the literature for Attribute Grammars, there are 2 types of attributes; inherited and synthesized. In WAGon, we extend these with 2 more; local and unknown:
name | prefix | explanation |
---|---|---|
Inherited | * | Value is defined earlier in the grammar. Any changes stay only in the local scope. |
Synthesized | & | Value is defined earlier in the grammar. Changes are passed "upward". |
Local | $ | Is defined inside of this scope. |
Unknown | Must be inferred from the rest of the grammar |
The type of each attribute is defined inside of a rule. Inherited and Synthesized attributes should be specified on the left-hand side, while local and unknown attributes can be defined anywhere.
Any time attributes are passed to a non-terminal, they are specified as if we are calling a function with type parameters.
The Unknown
identifier is currently reserved for non-terminals. Theoretically, the FirstPass
in wagon-parser
could run through the "call stack" of the grammar
and try determine to what type each attribute is automatically. An Unknown
identifier would then be rewritten into any of the other types. This is currently not supported however.
Note that the type of attribute on the "calling" side may differ from that of the "definition" side. For example, we may have S<*a, *b> -> A<*a, *b>;
and A<&b, &c> -> ...;
.
In this case, for the "scope" of S
, the attributes *a
and *b
are inherited, and any changes are thus not passed upwards.
However, for the scope of A
, &b
and &c
are synthesized and changes are thus passed upwards. So if &b
were to change, *a
would also change inside of S
's scope.
Of course, this is implementation detail. If you are working with a raw AST, you can define these things differently.
The DSL supports regex representations of terminals (for example, you could write Digit -> /[0-9]/;
). A regex must be enclosed by //
to differentiate it from a normal non-terminal.
The parser will parse this regex and construct a DFA from it using regex_automata
. This DFA can then be used either directly, or compiled to bytes and loaded later on in some other Rust program.
An example of the "compiled" approach can be found in wagon-codegen-gll
.
At the time of writing, the DSL looks as follows (written itself in the WAGon DSL):
Wag -> Metadata? Rule*;
// Metadata Section
Metadata -> Meta* MetaDelim;
MetaDelim -> "==" "="+;
Meta -> Include | Config;
Include -> Identifier? "::" Identifier Include?;
Config -> Identifier ":" Expression ";";
// Production Rules
Rule -> Identifier NTArgs? "->" Rhs;
Rhs -> Weight? Chunk* "|" Rhs
| Weight? Chunk* ";"
;
Weight -> "[" Expression "]";
Chunk -> ChunkP EbnfType?;
EbnfType -> "+" | "*" | "?";
ChunkP -> Symbol
| "(" Chunk* ")"
;
Symbol -> NonTerminal
| Terminal
| Assignment
| // This is an empty rule, aka ε aka epsilon.
;
NonTerminal -> Identifier NTArgs?;
NTArgs -> "<" AttrIdentifierList ">";
AttrIdentifierList -> AttrIdentifier "," AttrIdentifierList
| AttrIdentifier
;
Terminal -> "/" /[^/]*/ "/" // Regex
| String
;
Assignment -> "{" (AttrIdentifier "=" Expression ";")* "}";
// Attribute Evaluation
Expression -> SubProc
| If
| Disjunct
;
SubProc -> "$(" /[^)]*/ ")";
If -> "if" Disjunct "then" Disjunct ("else" Expression)?;
Disjunct -> Conjunct ("&&" Disjunct)?;
Conjunct -> Inverse ("||" Conjunct)?;
Inverse -> "!"? Comparison;
Comparison -> Sum (CompOp Sum)?;
CompOp -> "<" | "<=" | ">" | ">=" | "==" | "!=" | "in";
Sum -> Term SumP?;
SumP -> SumOp Term SumP?;
SumOp -> "+" | "-";
Term -> Factor TermP?;
TermP -> TermOp Factor TermP?;
TermOp -> "*" | "//" | "/" | "%";
Factor -> Atom ("**" Factor)?;
Atom -> AttrIdentifier
| Dictionary
| Bool
| Num
| Float
| String
| "(" Expression ")"
;
Identifier -> /[a-zA-Z][a-zA-Z0-9_]*/;
AttrIdentifier -> AttrSpec? Identifier;
AttrSpec -> "$" | "*" | "&";
Dictionary -> AttrIdentifier "[" Expression "]";
Bool -> "true" | "false";
Num -> /[0-9]+/;
Float -> /[0-9]+.[0-9]+/;
String -> '"' /[^"]*/ '"'
| "'" /[^']*/ "'"
Currently, the following features are not supported, but they were planned. As such, skeletons for implementations exist throughout the library.
At the moment, each rule uses the ->
arrow. However, the following arrows were also designed (and have support in the parser).
A =>
rule means that this rule is "generative". It differs from "analytic" rules in that it is intended specifically to generate data, as opposed to parsing it.
WAGon was designed with Module Grammars in mind. As such, the following arrow types exist (as defined in the paper):
<-
- Import by reference<=
- Import by clone<<
- Recursive import-by-clone
Additionally, in order to specify rules that should not be imported, one can use the <\
arrow, as well as a series of non-terminals separated by &
.
The attribute evaluation language is quite limited. To get around the limitation, it was decided to allow grammar designers to shell out to arbitrary shell scripts.
These scripts should print out json data, which can then be parsed and put into a dictionary. A skeleton implementation exists in wagon-codegen
, but it was never tested or completed.
There is no type checking functionality. For now, it is assumed that grammars are properly written (they are not, but we assume it anyway). Proper error handling of value errors should occur inside of any generated code as a result.
As discussed in the section on Attribute Identifiers, it was intended that the type of an attribute could be inferred automatically. This is not currently supported.