Skip to content

Rafaeltheraven/wagon

Repository files navigation

WAGon

A Library/Ecosystem/Proof-of-Concept for working with Weighted Attribute Grammars. WAGon is designed for researchers and language designers experimenting with WAGs such that they do not have to waste time creating a DSL and parser and can start immediately experimenting with the format itself.

You can read the paper at https://dulfer.be/wagon/paper.pdf which explains more in detail what WAGs are, what WAGon is and why you should care.

Installation

Simply clone this repository, and make sure you have rust, cargo and clippy installed.

Structure of the ecosystem

WAGon is split into several different crates, all of which are relevant for different purposes.

Researcher-Oriented

These are crates intended for use by the "end user". People who just want a finalized AST of their WAG that they can do with whatever they want.

wagon-parser

wagon-parser holds the code for the main parser of the WAGon DSL. It can parse an input string into a valid AST, as well as perform some rewrite operations on it (for example, factoring out EBNF operators). Every node it parses is provided with span information, such that proper error messaging can be done later.

wagon-codegen

wagon-codegen holds code that is generically useful for turning a WAGon AST (as created by the parser) into valid Rust code.

More specifically, it provides code that automatically turns valid WAGon evaluation expressions into valid Rust code, that can then be used anywhere the user wants.

Backend

These are crates that are mostly used by other, more user oriented crates. They can be used directly, but usually they can be ignored.

wagon-lexer

wagon-lexer is the lexer for the WAGon DSL. It takes the raw input string and converts it into a stream of tokens that the parser can use. It automatically performs mode-switching based on the context (whether we are lexing metadata, production rules, or mathematical evaluations).

wagon-value

wagon-value provides a generic enum for pseudo-dynamic typing. The end result of the code generated by wagon-codegen is of the type provided by wagon-value. It provides generic functions for the most basic usecases, but if more intricate types or computations are needed, you will probably have to read through this crate.

wagon-ident

wagon-ident is a central repository to get the data structure for identifiers.

Identifiers are used all throughout WAGon for various purposes, but most of the time you can just import the crate and be done with it.

wagon-utils

wagon-utils simply provides various utility functions used throughout WAGon.

wagon-macros

wagon-macros is a catch-all crate for any procedural macros needed anywhere in WAGon. This crate only exists because of limitations in Rust's procedural macro system.

Proof-of-Concept

These are crates that demonstrate how WAGon can be used, as discussed in my (not yet written) thesis. It provides an implementation of a GLL Parser that uses a WAG for disambiguation.

wagon-gll

wagon-gll is a generic GLL library written in Rust. It is used by any parsers generated by wagon-gll-codegen.

Technically, this library can be used completely removed from the WAGon ecosystem, it is a fully functioning GLL library. However, it does expect attributes and weights to exist. As such, it can fundamentally not be as efficient as a dedicated GLL library which does not have to consider these things.

wagon-togll

wagon-gll-codegen is the crate that takes a valid WAGon AST and generates valid Rust code for a GLL parser. It is very complex and very specific to its usecase, so most people can ignore this crate. However, the implementation in nodes/assignment.rs may prove a good example of how wagon-codegen can be used.

It also provides a binary which takes an input file and generates code for a fully functioning GLL parser. Can be interesting to look at in order to see how to convert an input file into a valid WAG, as well as to see how to nicely show error messaging to your users.

The WAGon DSL

WAGon (Weighted Attribute Grammar Oriented Notation (totally not a backronym)) is both an ecosystem and a DSL. In fact, the ecosystem only exists to support the DSL.

Example Grammar

type: analytical;
=========================================

S -> A | B;
A -> {$did_a = true;} E<$did_a> B | ;
B -> {$did_a = false;} E<$did_a> A | ;
E<*did_a> -> [*did_a * 2 + 1] C 
  | [(!*did_a) * 3 - 4] D
  ;
C -> [0.3] F 
  | [0.7] G
  ;
D -> [0.7] F
  | [0.3] G
  ;
F -> "1";
G -> "1";

Human Explanation

While the full DSL is described below, reading it may be a bit much. While I have tried to make it intuitive, a human readable explanation may be useful.

Metadata

Every grammar file can optionally start with a metadata section. Inside of the grammar section, key: value pairs may be written (the use of which lies with the grammar designer). Additionally, includes can be written inside of the metadata section to specify that external files should be included in this grammar. However, as of writing the includes are not yet supported.

The metadata section is delineated with 3 or more = signs.

Production Rules

Production Rules follow the pattern Identifier -> Rule. Alternate rules can be written using either | or by using the same Identifier again. The whole rule should be closed with a ;.

If any inherited or synthesized attributes are used in the rule, they should be declared on the left-hand side of rule, as if they were type parameters (so between <>, separated by ,).

Attribute Identifiers

In the literature for Attribute Grammars, there are 2 types of attributes; inherited and synthesized. In WAGon, we extend these with 2 more; local and unknown:

name prefix explanation
Inherited * Value is defined earlier in the grammar. Any changes stay only in the local scope.
Synthesized & Value is defined earlier in the grammar. Changes are passed "upward".
Local $ Is defined inside of this scope.
Unknown Must be inferred from the rest of the grammar

The type of each attribute is defined inside of a rule. Inherited and Synthesized attributes should be specified on the left-hand side, while local and unknown attributes can be defined anywhere.

Any time attributes are passed to a non-terminal, they are specified as if we are calling a function with type parameters.

The Unknown identifier is currently reserved for non-terminals. Theoretically, the FirstPass in wagon-parser could run through the "call stack" of the grammar and try determine to what type each attribute is automatically. An Unknown identifier would then be rewritten into any of the other types. This is currently not supported however.

Scope

Note that the type of attribute on the "calling" side may differ from that of the "definition" side. For example, we may have S<*a, *b> -> A<*a, *b>; and A<&b, &c> -> ...;. In this case, for the "scope" of S, the attributes *a and *b are inherited, and any changes are thus not passed upwards. However, for the scope of A, &b and &c are synthesized and changes are thus passed upwards. So if &b were to change, *a would also change inside of S's scope.

Of course, this is implementation detail. If you are working with a raw AST, you can define these things differently.

Regexes

The DSL supports regex representations of terminals (for example, you could write Digit -> /[0-9]/;). A regex must be enclosed by // to differentiate it from a normal non-terminal.

The parser will parse this regex and construct a DFA from it using regex_automata. This DFA can then be used either directly, or compiled to bytes and loaded later on in some other Rust program. An example of the "compiled" approach can be found in wagon-codegen-gll.

Full (Supported) DSL

At the time of writing, the DSL looks as follows (written itself in the WAGon DSL):

Wag                -> Metadata? Rule*;

// Metadata Section
Metadata           -> Meta* MetaDelim;
MetaDelim          -> "==" "="+;
Meta               -> Include | Config;
Include            -> Identifier? "::" Identifier Include?;
Config             -> Identifier ":" Expression ";";

// Production Rules
Rule               -> Identifier NTArgs? "->" Rhs;
Rhs                -> Weight? Chunk* "|" Rhs
	               |  Weight? Chunk* ";"
	               ;
Weight             -> "[" Expression "]";
Chunk              -> ChunkP EbnfType?;
EbnfType           -> "+" | "*" | "?";
ChunkP             -> Symbol
 	               |  "(" Chunk* ")"
 	               ;
Symbol             -> NonTerminal
                   |  Terminal
                   |  Assignment
                   |  // This is an empty rule, aka ε aka epsilon.
                   ;

NonTerminal        -> Identifier NTArgs?;
NTArgs             -> "<" AttrIdentifierList ">";
AttrIdentifierList -> AttrIdentifier "," AttrIdentifierList
                   |  AttrIdentifier
                   ;
Terminal           -> "/" /[^/]*/ "/" // Regex
                   |  String
                   ;
Assignment         -> "{" (AttrIdentifier "=" Expression ";")* "}";

// Attribute Evaluation
Expression         -> SubProc
                   |  If
                   |  Disjunct
                   ;
SubProc            -> "$(" /[^)]*/ ")";
If                 -> "if" Disjunct "then" Disjunct ("else" Expression)?;
Disjunct           -> Conjunct ("&&" Disjunct)?;
Conjunct           -> Inverse ("||" Conjunct)?;
Inverse            -> "!"? Comparison;
Comparison         -> Sum (CompOp Sum)?;
CompOp             -> "<" | "<=" | ">" | ">=" | "==" | "!=" | "in";
Sum                -> Term SumP?;
SumP               -> SumOp Term SumP?;
SumOp              -> "+" | "-";
Term               -> Factor TermP?;
TermP              -> TermOp Factor TermP?;
TermOp             -> "*" | "//" | "/" | "%";
Factor             -> Atom ("**" Factor)?;
Atom               -> AttrIdentifier
                   |  Dictionary
                   |  Bool
                   |  Num
                   |  Float
                   |  String
                   |  "(" Expression ")"
                   ;

Identifier         -> /[a-zA-Z][a-zA-Z0-9_]*/;
AttrIdentifier     -> AttrSpec? Identifier;
AttrSpec           -> "$" | "*" | "&";

Dictionary         -> AttrIdentifier "[" Expression "]";
Bool               -> "true" | "false";
Num                -> /[0-9]+/;
Float              -> /[0-9]+.[0-9]+/;
String             -> '"' /[^"]*/ '"'
                   |  "'" /[^']*/ "'"

Limitations

Currently, the following features are not supported, but they were planned. As such, skeletons for implementations exist throughout the library.

Different arrow types.

At the moment, each rule uses the -> arrow. However, the following arrows were also designed (and have support in the parser).

Generative

A => rule means that this rule is "generative". It differs from "analytic" rules in that it is intended specifically to generate data, as opposed to parsing it.

Imports

WAGon was designed with Module Grammars in mind. As such, the following arrow types exist (as defined in the paper):

  • <- - Import by reference
  • <= - Import by clone
  • << - Recursive import-by-clone

Additionally, in order to specify rules that should not be imported, one can use the <\ arrow, as well as a series of non-terminals separated by &.

Subshells

The attribute evaluation language is quite limited. To get around the limitation, it was decided to allow grammar designers to shell out to arbitrary shell scripts. These scripts should print out json data, which can then be parsed and put into a dictionary. A skeleton implementation exists in wagon-codegen, but it was never tested or completed.

Type checking

There is no type checking functionality. For now, it is assumed that grammars are properly written (they are not, but we assume it anyway). Proper error handling of value errors should occur inside of any generated code as a result.

Attribute Type Inference

As discussed in the section on Attribute Identifiers, it was intended that the type of an attribute could be inferred automatically. This is not currently supported.