Replies: 11 comments 6 replies
-
The std::io::BufRead trait exposes two low level methods, The std::io::BufReader struct is a struct that can be used to wrap any
It sounds like The std::io::Cursor struct allows seeking to an arbitrary location in an in-memory buffer. I can see this potentially being useful at some point, but the relevance isn't immediately obvious. The discussion in the |
Beta Was this translation helpful? Give feedback.
-
The An interesting thing about |
Beta Was this translation helpful? Give feedback.
-
The
Is this relevant to a tokenization pipeline? Maybe. |
Beta Was this translation helpful? Give feedback.
-
The Servo tokenization code largely delegates to html5ever for the lower-level operations. By the time we get to the Servo code, there is little left to be done, so the Servo The Servo In Servo, the Possible next steps:
I would like to explore the other approaches before looking more closely at either of these possible next steps. |
Beta Was this translation helpful? Give feedback.
-
The The encoding used by the Gecko HTML5 parser is set by the UTF-8 and UTF-16 are both used internally. It seems that UTF-8 is used by the streaming parser, and UTF-16 is used for inner loop code, although this is only an impression. One buffer class that is passed around appears to be the nsHtml5UTF16Buffer class. This call site makes me wonder whether UTF-16 is the internal charset used by the Gecko HTML5 parsing code once you get past the streaming parser. By contrast, this method uses a UTF-8 buffer. This line, where a UTF-8 buffer passed as an argument is being read into a UTF-16 buffer, provides additional reason to think that the internal buffer used for HTML parsing is indeed working with UTF-16. This is one of the places where the encoding is selected. It seems that UTF-8 is an option, at least at the start. There are various possible sources of the charset to be used, including the "channel," and heuristics are used to determine which charset to go with. In The tokenizer exposes There is no token builder class, as far as I can tell. Instead, the way the tokenizer works is to tokenize a buffer, ultimately from an array of pending buffers, and then to call methods on the nsHtml5TreeBuilder like One thing I noticed about the Gecko code base is that on classes with a lot of member fields, such as the HTML parser, the fields are set one by one in a long method that initializes everything. One thought here is that Gecko might not be using enough helper classes, and so there are a large number of member fields, and classes have a lot of responsibilities. Another thought is that it might be good to use the builder pattern in this situation, using a builder that you can call In the streaming parser, a lot of buffer handling can be seen in calling code, which makes me think the class abstraction being used is too low-level. Because of this, the streaming parser code a little hard to read. Questions:
Possible next steps:
I probably won't be looking at any of these next steps until everything else has been reviewed. |
Beta Was this translation helpful? Give feedback.
-
The There is a The step method has a lot of the tokenization state machine. Some of the states are differentiated using enum variants instead of getting their own top-level state. In place of what we're calling A There is a So with a small set of methods, the tokenizer calls methods on an html5ever TreeBuilder parameterized with a further sink parameter that implements For the very low-level manipulation of the input, there is a How do bytes get to the tokenizer? On the sending side, much higher in the call stack, in the Servo code, there is a Some other odds and ends:
Possible next steps:
|
Beta Was this translation helpful? Give feedback.
-
In the case of WebKit, there is an The tokenizer is controlled by The tokenizer is passed a A How are network bytes fed into the tokenizer? An example of appending bytes to a document can be seen here. By way of inheritance, that line calls |
Beta Was this translation helpful? Give feedback.
-
Chromium's rendering is delegated to a subproject called Blink, which handles the rendering pipeline, including the decoding of network bytes, the tokenization and the parsing of the token stream. So we are interested in what Blink does for input byte stream decoding and tokenization. Blink is a fork of WebKit's WebCore library, and the process is very similar to what WebCore does, with different names for some of the classes involved. We can start with Once network bytes are received,
Tokenization starts when So at a high level, the Other odds and ends:
|
Beta Was this translation helpful? Give feedback.
-
The encoding_rs crate is the encoding library used by both Gecko and Servo. By way of Gecko, this crate has been used in Firefox since Firefox 56 (September 2017). Firefox enables an optional feature, In the case of Servo, the Servo's Judging from I've noticed that the different browser implementations don't use UTF-8 everywhere, and I had assumed that this was because the existing implementations have been around for a while, but this might not be quite right. The
So it seems the choice of UTF-8 for the internal string representation in Gosub is not an automatic one, or perhaps even a wise one. |
Beta Was this translation helpful? Give feedback.
-
The article encoding_rs: a Web-Compatible Character Encoding Library in Rust goes into some detail around the creation of the Henri Sivonen, the author of the Rust library and an important contributor to Firefox, addresses a question I had whose tradeoffs were hard to grasp:
Henri Sivonen also writes this, which takes the possibilities in the direction of just dealing with the UTF-8 to UTF-16 conversion dynamically:
There is a good discussion on Hacker News of the tradeoffs that Servo considered when they decided to go with a Servo-internal UTF-16 string representation:
But Josh Triplett, one of the core Rust developers, thought it might be feasible to implement a JavaScript engine using UTF-8 that maintained backwards compatibility in the interface:
By this reasoning, we might be able to get away with a UTF-8 string representation and convert to UTF-16 only as needed. Some relevant details that are brought up in the different articles:
|
Beta Was this translation helpful? Give feedback.
-
The Encoding Standard defines the UTF-8 encoding as well as legacy encodings that user agents will need to handle. I think this document is too low-level for anything Gosub will be doing anytime soon. |
Beta Was this translation helpful? Give feedback.
-
A question has come up about how to construct an efficient pipeline that starts with a source like a socket, a file or a string and from there turn it into a series of
u8
oru16
bytes, and from there tochar
Unicode scalar values, to UTF-8 characters or maybe&str
slices, and then from there how to pass it over to the HTML5 (or CSS3) tokenization stage, where a stream of tokens and tokenization errors will be emitted to downstream parsing code.What should this pipeline look like? How much will buffering be needed and at what points? Can we use
Iterator<Item = T>
here? How do string slices fit into this picture? What are the opportunities for a zero-copy approach, and how would attempting that change things? Are there any standard Rust traits or commonly used crates that will help out here? How many stages should there be, and how do they feed into one another? This thread is intended to look into these and related questions.nom
crate do in its streaming mode?Beta Was this translation helpful? Give feedback.
All reactions