Abort is not a great error handling strategy #50

stellaraccident · 2024-11-26T02:18:29Z

Calling these two functions with illegal data causes a stack-trace/abort in Rust code (when it tries to unwrap a failing error object):

::tokenizers::Tokenizer::FromBlobJSON(json_blob)
::tokenizers::Tokenizer::FromBlobByteLevelBPE(vocab_blob, merges_blob, added_tokens)

The sentence piece initializer doesn't fail but it is probably just silently not parsing the protobuf:

::tokenizers::Tokenizer::FromBlobSentencePiece(model_blob)

Such functions in a library aren't a great design: it would be better to at least have a simple way to marshal the failure to create to the user. For code that is trying to be light-weight and not use exceptions, I would probably use a signature like:

static std::unique_ptr<Tokenizer> FromBlobJSON(const std::string& json_blob, std::string &error_message);

And define it such that returning a nullptr indicates error and the library makes some attempt to populate error_message in this case. Since you likely already have things calling these and those aren't expecting a nullptr return, maybe rename the to a WithError suffix or something.

The text was updated successfully, but these errors were encountered:

* This is gated by SHORTFIN_ENABLE_TOKENIZERS (presently off). * I'd like to either take over the wrapper or get mlc-ai/tokenizers-cpp#50 before putting much weight on this. * There is no great C++ option for this component, so we go to the trouble of integrating a Rust component. We will need to do a bit of prep on our CI systems to enable this by default. * Python API will be added in a subsequent commit. This should be more efficient than the tokenizers Python API since we will allow direct access to the tokens vs doing a lot of conversions. * Obligatory language flame bait: Use Rust, they said. It's super efficient. Prior to this patch, libshortfin was 1.8MB, which gave us an entire GPU and CPU runtime stack. After this patch (stripped) it is 8.4MB. Given how important the use case is, I'm willing to tolerate this for the moment. It seems like there is room for something better here, which is why I did not expose the underlying vendor'd API directly.

* This is gated by SHORTFIN_ENABLE_TOKENIZERS (presently off). * I'd like to either take over the wrapper or get mlc-ai/tokenizers-cpp#50 before putting much weight on this. * There is no great C++ option for this component, so we go to the trouble of integrating a Rust component. We will need to do a bit of prep on our CI systems to enable this by default. * Python API will be added in a subsequent commit. This should be more efficient than the tokenizers Python API since we will allow direct access to the tokens vs doing a lot of conversions. * Size analysis: Prior to this patch, libshortfin was 1.8MB, which gave us an entire GPU and CPU runtime stack. After this patch (stripped) it is 8.4MB. Given how important the use case is, I'm willing to tolerate this for the moment. It seems like there is room for something better here, which is why I did not expose the underlying vendor'd API directly (edit: by switching to a nightly rust and activating a bunch of options from https://github.com/johnthagen/min-sized-rust, I was able to produce a binary that is 4.2MB, which is more reasonable).

stellaraccident changed the title ~~Abort is not a fantastic error handling strategy~~ Abort is not a great error handling strategy Nov 26, 2024

stellaraccident mentioned this issue Nov 26, 2024

[shortfin] Add C++ tokenizer wrapper library. nod-ai/shark-ai#610

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abort is not a great error handling strategy #50

Abort is not a great error handling strategy #50

stellaraccident commented Nov 26, 2024

Abort is not a great error handling strategy #50

Abort is not a great error handling strategy #50

Comments

stellaraccident commented Nov 26, 2024