Skip to content
Keegan McAllister edited this page Apr 1, 2014 · 2 revisions

The parser is consistently a "push" design. Net or script sends buffers to the tokenizer, which sends tokens to the tree builder, which sends tree construction ops to script. The "sending" here is just a method call, but the tree op method will do an actual message send in the case of off-thread parsing. The use of traits allows swapping in other consumers of tokens and tree ops, e.g. a test harness.

The tokenizer holds a queue of uniquely-owned buffers. Using this rather than a single buffer allows us to avoid intermediate copies. The points at which input is broken into discrete buffers should have no effect on the output; this is tested by the tokenizer test runner. At any time the tokenizer may get stuck waiting on additional input, so all of its state lives in a struct that persists across method calls.

This would be much cleaner using tasks as coroutines, but that would impose extra requirements on the library consumer.

Buffers will be tagged with IDs of running scripts, so that document.write can insert characters in the right place.

Buffers currently use Rust's native UTF-8 ~str type, but may change to UCS-2. I'm avoiding writing code that would make this switch difficult. There's more discussion on the mailing list.

Input encoding detection is not part of this codebase yet. It seems pretty orthogonal and will probably happen after the new parser lands in Servo.

The tokenizer is coded as a very direct translation of the state machine in the spec, using macros to condense the common state machine actions. It will probably grow special cases to handle hot paths (e.g. long runs in Data state) without re-dispatching on state after every character.

Character references have their own state machine within the tokenizer. It uses rust-phf to build a static map for the several thousand character names and all their prefixes.

Clone this wiki locally