What would an efficient tokenization pipeline look like? (research thread) #220

emwalker · 2023-11-02T23:56:15Z

emwalker
Nov 2, 2023
Collaborator

emwalker · 2023-11-03T01:05:23Z

emwalker
Nov 3, 2023
Collaborator Author

std::io::BufRead trait, the std::io::BufReader struct, the std::io::Cursor struct and the buffered_reader Rust crate

The std::io::BufRead trait exposes two low level methods, fill_buf and consume, and several higher-level methods that look helpful in routine file processing. The higher-level methods look susceptible to attacks by malicious sources that do not send a suitable delimiter (e.g., a newline), so I wouldn't want to use them without further understanding how to do so safely. The lower level methods might be safe and good to use.

The std::io::BufReader struct is a struct that can be used to wrap any Read instance. It is primarily used to avoid the overhead of a system call in the right situation:

BufReader<R> can improve the speed of programs that make small and repeated read calls to the same file or network socket. It does not help when reading very large amounts at once, or reading just one or a few times. It also provides no advantage when reading from a source that is already in memory, like a Vec<u8>.

It sounds like BufReader<R> might be relevant in our case for sources like a socket.

The std::io::Cursor struct allows seeking to an arbitrary location in an in-memory buffer. I can see this potentially being useful at some point, but the relevance isn't immediately obvious.

The discussion in the buffered_reader crate summary highlights a lot of disadvantages of BufReader for advanced parsing use cases, such as speculative lookahead, where you might need to backtrack. The interface seems informed by significant experience working with BufReader and encountering its limitations. A code review might be good at some point. The buffered_reader crate is available under the LGPL, which is incompatible with our current MIT license.

0 replies

emwalker · 2023-11-04T20:41:34Z

emwalker
Nov 4, 2023
Collaborator Author

What does the nom crate do in its streaming mode?

The nom parser generator statements are generic in input, output and errors. This allows you to parse Rust's UTF-8 &str slices, as well as binary input like &[u8] and individual bits, as well as sending in more complex structures like location-aware spans.

An interesting thing about nom is that it has few dependencies. What dependencies there are appear to be largely there for testing purposes. There is no encoding layer, for example, as far as I can tell. You're working with bytes or with &str or some custom type, and so I guess the problem of encoding falls on the user to sort out, outside of nom. When you look at the nom code, although it can clearly handle raw bytes, it is not clear how it would handle input encoded in Shift JIS or UTF-16. I'm not sure what would happen if you attempted to feed a Shift JIS text file into a nom parser; perhaps there would be panics or errors in the calling code when Rust attempts to work with the file? The high-level takeaway for me is that it seems that somehow nom figured out how to bypass the whole question of character encodings entirely with skillfully written code with generic parameters for the input and output. I am not sure whether we can draw much in the way of lessons for this project from what it's doing.

0 replies

emwalker · 2023-11-05T23:14:10Z

emwalker
Nov 5, 2023
Collaborator Author

bytes Rust crate

The bytes Rust crate provides an abstraction for working with buffers of bytes, primarily with a networking use case in mind, but it can also be used for other purposes. It uses reference counting to allow multiple buffers to point to the same region of memory. Memory might or might not be contiguous, depending on what approach is taken.

Bytes is an efficient container for storing and operating on contiguous slices of memory. It is intended for use primarily in networking code, but could have applications elsewhere as well.

Bytes values facilitate zero-copy network programming by allowing multiple Bytes objects to point to the same underlying memory. This is managed by using a reference count to track when the memory is no longer needed and can be freed.

A Bytes handle can be created directly from an existing byte store (such as &[u8] or Vec), but usually a BytesMut is used first and written to.

Is this relevant to a tokenization pipeline? Maybe.

3 replies

jaytaph Nov 8, 2023
Maintainer

Tokens are emitted from the bytestream so i guess it make sense to have slices of bytes in the tokens. Ultimately, we would copy these slices into nodes later on, but i suspect we can even skip that as well (altough that might become difficult when dealing with meegwd text nodes). Also, there is a chance that we are generating a node tree while the char buffer is still filling up with delayed packets. And we might not even want to have the stream in memory after some time.

emwalker Nov 8, 2023
Collaborator Author

One thing I saw in Servo that I'll be interested to compare to other engines is the use of a thing called a StrTendril, which is a shared pointer into part of a larger str slice from the network, using this crate. The StrTendril is only converted to an owned string very late in the parsing, when a Node is being created.

It seems to be something like this:

let network_input = "long-string-from-http-network-response".to_string();

// Not exactly.  I think you can also modify what the tendril contains.
let tendril = Rc::new(&network_input[1..3]);

// Turned into a `String` at a much later stage, when a node is being added to the DOM.
let comment = String::from(tendril);

I'll be interested to see what other browsers do here.

emwalker Nov 8, 2023
Collaborator Author

Also, there is a chance that we are generating a node tree while the char buffer is still filling up with delayed packets.

For the network case, we'll need to support incremental parsing, as you suggest. Servo does this by using a Tokenizer / TokenSink = TreeBuilder / TreeSink pattern, which updates a changing document as new bytes are received from the network. It seems to be the general approach of all of the browsers, although they probably differ in important ways.

emwalker · 2023-11-06T00:44:56Z

emwalker
Nov 6, 2023
Collaborator Author

What does Servo do?

The Servo tokenization code largely delegates to html5ever for the lower-level operations. By the time we get to the Servo code, there is little left to be done, so the Servo Tokenizer struct implementation is fairly small. The Servo Tokenizer is mostly a wrapper around html5ever::tokenizer::Tokenizer, and the parser methods set up a tree builder sink, some tokenizer options and the inner html5ever tokenizer that it delegates to.

The Servo Tokenizer struct has a generic parameter for a sink. An example of a sink is a TreeBuilder<Dom<Node>, Sink> (note the generic Sink parameter that is passed to the inner tokenizer).

In Servo, the Tokenizer is instantiated and passed as an owned argument to ServoParser, e.g., here, here and here. There is a high-level Tokenizer enum that has variants for different kinds of tokenizer (Html, AsyncHtml and Xml) and it is this enum that is passed as an argument to ServoParser. The inner HTML tokenizer wrapped in the enum variant accepts a document, a URL, an Option<FragmentContext> and a parsing algorithm. The possible parsing algorithms are Normal and Fragment. The return type of the ServoParser::parse_* associated methods is Iterator<Item = DomRoot<Node>> in two cases and void in one case, where it sets the document on the parser.

Possible next steps:

Replace CharIterator with a first pass at a BufferQueue input struct
Explore whether any of the higher-level Servo parsing interface and its use of the Tokenizer make sense in our case

I would like to explore the other approaches before looking more closely at either of these possible next steps.

3 replies

jaytaph Nov 8, 2023
Maintainer

What would bufferqueue do?

Also, ther will be a situation that we are parsing data while packets are still pending. The inputstream/character iterator could starve, but has not yet signalled eof. It that case, the parser should stop work (nothing to do anyway until new data arrives) but we might be able do other things (making new request for scripts images we found, other stuff). This sounds event loopy to me, and is not a design i'm very familair with. once the parser notifies somebody that it has finished (i think this is the place where javascript resolves the promise?), we can send the domtree to another system.

If the inputstream gets more data, it should start the stopped parser and continue

emwalker Nov 8, 2023
Collaborator Author

What would bufferqueue do?

The BufferQueue looks like this. It's basically a VecDec (similar to a Vec) of &str slices (really StrTendrils). It exposes methods like peek(), next(), etc., and it allows you to prepend, append, pop and pop_front individual StrTendrils. When network bytes are received, they are decoded using a detected endocoding and then pushed onto the end of the buffer queue.

This is what Servo does. I'm interested to see what the others do.

emwalker Nov 8, 2023
Collaborator Author

This sounds event loopy to me, and is not a design i'm very familair with.

In the case of Servo, you have a ScriptThread with an associated parser, and a DOM that is instantiated early on. As network bytes are received and decoded and pushed onto the buffer queue, the parser feeds the updated buffer queue into the tokenizer, which drives everything until the parser comes to a stopping point. I think the script thread also has to not parse for too long so as to maintain a good user experience, but I don't know much about the details.

I'm assuming we'll need to update our own parser to be incremental at some point rather than attempting to do a full run in one go.

emwalker · 2023-11-06T03:14:02Z

emwalker
Nov 6, 2023
Collaborator Author

What does Gecko do?

The nsHtml5Tokenizer class is the tokenizer used by Gecko to process HTML5. The constructor accepts an nsHtml5TreeBuilder and a boolean viewingXmlSource flag.

The encoding used by the Gecko HTML5 parser is set by the SetDocumentCharset setter method on the IParser interface. An example of where the encoding is set is in the nsHTMLDocument::StartDocumentLoad method. It is also set on the HTML document by way of the SetDocumentCharacterSet method. It is also set on the HTML tree builder. There is also a charset source, which includes a user-specified charset, the parent document, or the channel.

UTF-8 and UTF-16 are both used internally. It seems that UTF-8 is used by the streaming parser, and UTF-16 is used for inner loop code, although this is only an impression. One buffer class that is passed around appears to be the nsHtml5UTF16Buffer class. This call site makes me wonder whether UTF-16 is the internal charset used by the Gecko HTML5 parsing code once you get past the streaming parser. By contrast, this method uses a UTF-8 buffer. This line, where a UTF-8 buffer passed as an argument is being read into a UTF-16 buffer, provides additional reason to think that the internal buffer used for HTML parsing is indeed working with UTF-16.

This is one of the places where the encoding is selected. It seems that UTF-8 is an option, at least at the start. There are various possible sources of the charset to be used, including the "channel," and heuristics are used to determine which charset to go with. In nsHtml5StreamParser, this is the code that detects the charset from a stream of bytes.

The tokenizer exposes getLineNumber and getColumnNumber methods, which are passed to a new tree operation. There is also a GetAttributes method that returns an nsHtml5HtmlAttributes pointer. So perhaps one of the things the tokenizer does is collect attributes for the current HTML element (this is a little unexpected).

There is no token builder class, as far as I can tell. Instead, the way the tokenizer works is to tokenize a buffer, ultimately from an array of pending buffers, and then to call methods on the nsHtml5TreeBuilder like startTag. The current state is passed around as a method parameter and changed in the methods. So the interaction between the tokenizer and the tree builder is fairly low-level.

One thing I noticed about the Gecko code base is that on classes with a lot of member fields, such as the HTML parser, the fields are set one by one in a long method that initializes everything. One thought here is that Gecko might not be using enough helper classes, and so there are a large number of member fields, and classes have a lot of responsibilities. Another thought is that it might be good to use the builder pattern in this situation, using a builder that you can call finalize() on to get an instance of the class, so that you know that it's fully initialized and ready to use and never only partially initialized.

In the streaming parser, a lot of buffer handling can be seen in calling code, which makes me think the class abstraction being used is too low-level. Because of this, the streaming parser code a little hard to read.

Questions:

Why is the encoding set on the parser, the document and the tree builder?

Possible next steps:

Build out a wrapper class for HTML attributes
Begin to look at BOM sniffing for input streams
Look at migrating to the tokenizer / buffer / html tree builder pattern

I probably won't be looking at any of these next steps until everything else has been reviewed.

0 replies

emwalker · 2023-11-08T03:39:47Z

emwalker
Nov 8, 2023
Collaborator Author

html5ever Rust crate, a Servo crate that provides low-level tokenization operations

The html5ever Rust crate is used by Servo for lower-level parsing and tokenization. In this case we're interested in the tokenization. The entrypoint is the Tokenizer struct. The struct is initialized with a Sink (e.g., a tree builder) and a set of options. The tokenizer struct doubles as a token builder and has a bunch of fields for the current token. The struct of options that is passed to the new tokenizer closely matches the html5lib-test input.

There is a Token struct that the tokenizer processes. It seems the tokenizer is keeping track of the amount of time spent in processing, perhaps in order to exit before spending too much time. There is a run method that processes as long as it can, calling step repeatedly and getting back a processing result. Processing exits early if a script is encountered. Times are tracked as the loop continues.

The step method has a lot of the tokenization state machine. Some of the states are differentiated using enum variants instead of getting their own top-level state. In place of what we're calling Bytes::Ch(char), the html5ever tokenizer is matching on a FromSet enum. Used together with a pop_except_from! macro, a StrTendril might be returned to be emitted in the final match arm. The whole method with the tokenization state machine is ~ 700 lines.

A StrTendril is a handle that points into a shared buffer of chars using reference counting. It can be turned into a String later on by the higher-level TreeSink that is driven by the html5ever TreeBuilder. For example, the emit_chars method takes a StrTendril and passes it back to the TreeSink as a CharacterTokens(b). There is a get_preprocessed_char method that returns a single char at a time (could this benefit from an iterator?).

There is a TokenSink trait that the tokenizer works with, which is implemented by the html5ever TreeBuilder struct. The TokenSink trait has a process_token method that the tokenizer delegates to. There is a TokenSinkResult that the token sink returns when sink.process_token is called. Apparently the token sink (e.g., the tree builder) can tell the tokenizer to continue, or something about plaintext, or that we're in a script, or that we have raw data. I am unsure how the tokenizer makes use of this information.

So with a small set of methods, the tokenizer calls methods on an html5ever TreeBuilder parameterized with a further sink parameter that implements TreeSink, which is sent in from outside the library. The TreeBuilder tells the TreeSink to do things like create_comment, which the tree sink then turns into DOM nodes.

For the very low-level manipulation of the input, there is a BufferQueue that exposes methods like pop_except_from. It exposes a next() -> Option<char> method, as well as push_back and push_front methods, which make a StrTendril available for use again in the buffer queue. This is an example of how Servo uses a StrTendril to construct a comment node.

How do bytes get to the tokenizer? On the sending side, much higher in the call stack, in the Servo code, there is a ScriptThread that has a handle_fetch_chunk method that accepts a Vec<u8>. A parser stored with the script thread is fed a chunk of bytes. The push_input_method_chunk decodes the chunk after BOM sniffing, and then pushes a new StrTendril into a network buffer. The decoder that is used is determined from the BOM check. The network input buffer is a reference to a BufferQueue. And the network buffer queue is fed into the tokenizer here, which eventually gets to the feed method of the html5ever tokenizer. Once we're back in the html5ever tokenizer, the input buffer queue then is sent into run, completing the circle.

Some other odds and ends:

The html5ever tokenizer has a macro DSL that is used to shorten the code in the tokenization state machine.
There is a separate CharRefTokenizer that handles the parsing of character entity references.
Instead of working with a hash of attributes, the Servo code uses a Vec<Attribute>. Is this faster?

Possible next steps:

Explore using StrTendril and other Token variants.
Explore using the interface between TokenSink, Tokenizer, TreeBuilder, and TreeSink.
Explore adding a BufferQueue with a low-level interace that exposes StrTendrils.

0 replies

emwalker · 2023-11-10T04:15:31Z

emwalker
Nov 10, 2023
Collaborator Author

What does WebKit do?

In the case of WebKit, there is an HTMLTokenizer class that does the tokenization. It is similar to Servo in doubling as a token builder, with a field for the current token that it modifies as the state progresses.

The tokenizer is controlled by HTMLDocumentParser, which calls nextToken in a loop and then does something with the token. (This loop might lend itself to the iterator pattern and the Iterator trait in Rust.) The approach here is different from that taken by Servo, where the tree construction is driven by actions of the tokenizer as it calls methods on a tree builder. Once the WebKit tokenizer produces a token, the parser sends the token into the tree builder to process. The tree builder is provided an AtomicHTMLToken which it then passes to processToken. If the tokenizer returns null instead of a token, the parser exits the tokenization loop.

The tokenizer is passed a SegmentedString to process. To produce this segmented string, the parser holds an HTMLInputStream in a field and calls current() on it. The HTMLInputStream looks a lot like the BufferQueue that Servo uses, with methods for appending segmented strings to a linked list of segmented strings. Calling current() on the HTMLInputStream yields an individual segmented string for the tokenizer to work with.

A SegmentedString has a deque of Substringss and a reference to the first one. It exposes methods like advance(), isEmpty(), pushBack(), and append(). Like the HTMLInputStream, the SegmentedString bears a resemblance to the BufferQueue used in Servo. A Substring has methods like currentCharacter(). It seems to be analagous to the StrTendril used in Servo.

How are network bytes fed into the tokenizer? An example of appending bytes to a document can be seen here. By way of inheritance, that line calls DecodedDataDocumentParser::appendBytes, which decodes the data using a decoder and then appends it to the document parser. The append() call then appends the data to the input buffer and pumps the tokenizer. We now enter HTMLDocumentParser::pumpTokenizerLoop, where nextToken() is called on the tokenizer, passing in the current input buffer. This brings us back to where we were, above.

0 replies

emwalker · 2023-11-11T21:35:52Z

emwalker
Nov 11, 2023
Collaborator Author

What does Chromium do?

Chromium's rendering is delegated to a subproject called Blink, which handles the rendering pipeline, including the decoding of network bytes, the tokenization and the parsing of the token stream. So we are interested in what Blink does for input byte stream decoding and tokenization. Blink is a fork of WebKit's WebCore library, and the process is very similar to what WebCore does, with different names for some of the classes involved.

We can start with DocumentLoader, a class that loads an HTML document for a browser frame. That class calls DocumentLoader::MaybeStartLoadingBodyInBackground, which fetches a decoder using BuildTextResourceDecoder. That static (?) function is defined here and is responsible for determining the decoder. To do that it delegates to another static (?) method called DetectTextEncoding.

Once network bytes are received, AppendBytes is called on the parser, and the bytes are then decoded, e.g., here. That converts an array of bytes into a String, which is then appended to the input here and here. The String is then converted to a SegmentedString. It is this segmented string that is then appended to the input, which is an instance of HTMLInputStream.

HTMLInputStream has methods like SplitInto, MergeFrom, AppendToEnd and Current. SegmentedString has methods like Append, IsEmpty and LookAhead. String appears to be a Chromium-internal class with many of the usual methods.

Tokenization starts when PumpTokenizer is called on the parser, after the input has been built up a little, by calling NextToken on the tokenizer and passing in the current segmented string in the input. NextToken brings us to the HTML tokenization state machine, possibly returning an HTMLToken if everything went well. It looks like HTMLToken is a token builder of sorts, with methods for updating it as the token is constructed. It is eventually converted into an AtomicHTMLToken, which appears to be an immutable token derived from HTMLToken. The immutable token is then fed into the tree construction path here. The tokenization happens in a loop until a stopping point is reached.

So at a high level, the DocumentLoader begins to load an HTML file in the background and fetches a decoder. The network bytes are sent to the parser, which decodes them and appends the input to the HTML parser's HTMLInputStream. At some point PumpTokenizer is called on the parser, which calls NextToken on the tokenizer in a loop, passing in the current segmented string. The tokenizer goes through its state machine, building up an HTMLToken, which is then returned and converted into an immutable AtomicHTMLToken and then fed into the tree construction path.

Other odds and ends:

Element attributes are represented as a vec of attributes, as in WebKit, rather than, for example, a hash.

0 replies

emwalker · 2023-11-12T02:06:25Z

emwalker
Nov 12, 2023
Collaborator Author

encoding_rs and encoding Rust crates

The encoding_rs crate is the encoding library used by both Gecko and Servo. By way of Gecko, this crate has been used in Firefox since Firefox 56 (September 2017). Firefox enables an optional feature, simd-accel, that Servo does not. An example of the use of encoding_rs in Gecko can be found here.

In the case of Servo, the encoding_rs crate is used via the tendril crate, another Servo crate. (Gecko does not use tendril.) Servo has a NetworkDecoder that instantiates a tendril LossyDecoder, passing it an encoding. The encoding comes from the document. The document encoding is determined using a BOM check in ServoParser. The use of encoding_rs is fairly complex. The output buffer is allocated to a maximum provided by the decoder, or 8192 bytes if no length is provided. The buffer is recreated with each call to decode. The network decoder is used here. There are other uses of the encoding_rs crate elsewhere in the Servo code as well.

Servo's tendril crate was updated in August 2023, but its CI run is failing, although its unit tests are passing, so it is unclear whether it would be wise to use it. There is a warning at the top of the README that says it has a lot of unsafe code and to use at your own risk. The crate is available under a dual MIT/Apache 2 license. It uses the rust-utf8 crate, which has a decode method. The decode method is defined here. The rust-utf8 crate in turn delegates to str::from_utf8. The rust-utf8 crate is unmaintained.

Judging from tendril's use of the encoding crate, it is another crate that does something similar to encoding_rs. The last commit added to the encoding crate was made in 2017, in contrast to the encoding_rs crate, which was updated in August 2023. So the encoding crate is unmaintained, while the encoding_rs crate appears to be critical to Firefox and so is not going away soon.

I've noticed that the different browser implementations don't use UTF-8 everywhere, and I had assumed that this was because the existing implementations have been around for a while, but this might not be quite right. The tendril crate has this to say about UTF-16 compatability:

SpiderMonkey expects text to be in UCS-2 format for the most part. The semantics of JavaScript strings are difficult to implement on UTF-8. This also applies to HTML parsing via document.write. Also, passing SpiderMonkey a string that isn't contiguous in memory will incur additional overhead and complexity, if not a full copy.

Solution: Use WTF-8 in parsing and in the DOM. Servo will convert to contiguous UTF-16 when necessary. The conversion can easily be parallelized, if we find a practical need to convert huge chunks of text all at once.

So it seems the choice of UTF-8 for the internal string representation in Gosub is not an automatic one, or perhaps even a wise one.

0 replies

emwalker · 2023-11-14T01:56:43Z

emwalker
Nov 14, 2023
Collaborator Author

encoding_rs: a Web-Compatible Character Encoding Library in Rust

The article encoding_rs: a Web-Compatible Character Encoding Library in Rust goes into some detail around the creation of the encoding_rs character encoding conversion crate that has replaced unconv, an encoding library used in Firefox / Netscape since 1999. There are several accompanying articles and videos that discuss the same topic.

Henri Sivonen, the author of the Rust library and an important contributor to Firefox, addresses a question I had whose tradeoffs were hard to grasp:

Conceptually a character encoding is a mapping from a stream of bytes onto a stream of Unicode scalar values and, in most cases, vice versa. Therefore, it would seem that the right abstraction for a converter is an iterator adaptor that consumes an iterator over bytes and yields Unicode scalar values (or vice versa).

There are two problems with modeling character encoding converters as iterator adaptors. First, it leaves optimization to the compiler, when manual optimizations across runs of code units are desirable. Specifically, it is a core goal for encoding_rs to make ASCII handling fast using SIMD, and the compiler does not have enough information about the data to know to produce ASCII-sequence-biased autovectorization. Second, Rust iterators are ill-suited for efficient and (from the C perspective) idiomatic exposure over the FFI.

The API style of unconv, java.nio.charset, iconv, etc., of providing input and output buffers of several code units at a time to the converter is friendly both to SIMD and to FFI (Rust slices trivially decompose to pointer and length in C). While this isn’t 100% rustic like iterators, slices still aren’t unrustic.

Henri Sivonen also writes this, which takes the possibilities in the direction of just dealing with the UTF-8 to UTF-16 conversion dynamically:

The combination of encoding_rs’s faster converter internals with the new allocation strategy that is a better fit for Gecko’s memory allocator was a clear performance win. My hope is that going forward conversion between UTF-8 and UTF-16 will be perceived as having acceptable enough a cost that Gecko developers will feel more comfortable with components that use UTF-8 internally even if it means that a conversion has to happen on a component boundary.

There is a good discussion on Hacker News of the tradeoffs that Servo considered when they decided to go with a Servo-internal UTF-16 string representation:

Sadly, the pervasiveness of JavaScript means that UTF-16 interoperability will be needed as least as long as the Web is alive. JavaScript strings are fundamentally UTF-16. This is why we've tentatively decided to go with UTF-16 in Servo (the experimental browser engine) -- converting to UTF-8 every time text needed to go through the layout engine would kill us in benchmarks.

For new APIs in which legacy interoperability isn't needed, I completely approve of this document.

But Josh Triplett, one of the core Rust developers, thought it might be feasible to implement a JavaScript engine using UTF-8 that maintained backwards compatibility in the interface:

Still doesn't mean that it has to store the strings in UTF-16. And a bit of common-case optimizing would allow ignoring that case until a string actually ends up with a non-BMP character in it.

Alternatively, if a JavaScript implementation chose to completely ignore that particular requirement, I'd guess that approximately zero pages (outside of test cases) would break, and a few currently broken pages (that assumed sane Unicode handling) would start working.

By this reasoning, we might be able to get away with a UTF-8 string representation and convert to UTF-16 only as needed.

Some relevant details that are brought up in the different articles:

The encoding_rs crate does not deal with the character encoding conversion needed by Spidermonkey, the Firefox JavaScript interpreter (presumably the encoding is UCS-2). Instead, Gecko calls into encoding_rs and then does a further pass on the result.
The HTML spec requires UTF-8, but support for other encodings is needed for legacy reasons
It seems that lossy UTF-8 decoding might introduce security vulnerabilities: "it’s a bad idea for character encoding conversion API to offer a mode where errors are neither signaled to the caller nor replaced with the REPLACEMENT CHARACTER."

0 replies

emwalker · 2023-11-27T01:05:31Z

emwalker
Nov 27, 2023
Collaborator Author

The Encoding Standard

The Encoding Standard defines the UTF-8 encoding as well as legacy encodings that user agents will need to handle. I think this document is too low-level for anything Gosub will be doing anytime soon.

0 replies

What would an efficient tokenization pipeline look like? (research thread) #220

emwalker Nov 2, 2023 Collaborator

Replies: 11 comments · 6 replies

emwalker Nov 3, 2023 Collaborator Author

emwalker Nov 4, 2023 Collaborator Author

emwalker Nov 5, 2023 Collaborator Author

jaytaph Nov 8, 2023 Maintainer

emwalker Nov 8, 2023 Collaborator Author

emwalker Nov 8, 2023 Collaborator Author

emwalker Nov 6, 2023 Collaborator Author

jaytaph Nov 8, 2023 Maintainer

emwalker Nov 8, 2023 Collaborator Author

emwalker Nov 8, 2023 Collaborator Author

emwalker Nov 6, 2023 Collaborator Author

emwalker Nov 8, 2023 Collaborator Author

emwalker Nov 10, 2023 Collaborator Author

emwalker Nov 11, 2023 Collaborator Author

emwalker Nov 12, 2023 Collaborator Author

emwalker Nov 14, 2023 Collaborator Author

emwalker Nov 27, 2023 Collaborator Author

emwalker
Nov 2, 2023
Collaborator

Replies: 11 comments 6 replies

emwalker
Nov 3, 2023
Collaborator Author

emwalker
Nov 4, 2023
Collaborator Author

emwalker
Nov 5, 2023
Collaborator Author

jaytaph Nov 8, 2023
Maintainer

emwalker Nov 8, 2023
Collaborator Author

emwalker Nov 8, 2023
Collaborator Author

emwalker
Nov 6, 2023
Collaborator Author

jaytaph Nov 8, 2023
Maintainer

emwalker Nov 8, 2023
Collaborator Author

emwalker Nov 8, 2023
Collaborator Author

emwalker
Nov 6, 2023
Collaborator Author

emwalker
Nov 8, 2023
Collaborator Author

emwalker
Nov 10, 2023
Collaborator Author

emwalker
Nov 11, 2023
Collaborator Author

emwalker
Nov 12, 2023
Collaborator Author

emwalker
Nov 14, 2023
Collaborator Author

emwalker
Nov 27, 2023
Collaborator Author