-
Notifications
You must be signed in to change notification settings - Fork 224
Design
The parser is consistently a "push" design. Net or script sends buffers to the tokenizer, which sends tokens to the tree builder, which sends tree construction ops to script. The "sending" here is just a method call, but the tree op method will do an actual message send in the case of off-thread parsing. The use of traits allows swapping in other consumers of tokens and tree ops, e.g. a test harness.
The tokenizer holds a queue of uniquely-owned buffers. Using this rather than a single buffer allows us to avoid intermediate copies. The points at which input is broken into discrete buffers should have no effect on the output; this is tested by the tokenizer test runner. At any time the tokenizer may get stuck waiting on additional input, so all of its state lives in a struct that persists across method calls.
This would be much cleaner using tasks as coroutines, but that would impose extra requirements on the library consumer.
Buffers will be tagged with IDs of running scripts, so that document.write
can insert characters in the right place.
Buffers currently use Rust's native UTF-8 ~str
type, but may change to UCS-2.
I'm avoiding writing code that would make this switch difficult. There's more
discussion on the mailing
list.
Input encoding detection is not part of this codebase yet. It seems pretty orthogonal and will probably happen after the new parser lands in Servo.
The tokenizer is coded as a very direct translation of the state machine in the
spec, using macros to condense the common state machine actions. It will
probably grow special cases to handle hot paths (e.g. long runs in Data
state) without re-dispatching on state after every character.
Character references have their own state machine within the tokenizer. It
uses rust-phf
to build a static map
for the several thousand character names and all their prefixes.