-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError When Using Korean Tokenizer #277
Comments
A virus. Do not click that URL. |
🤔 To me it seems like it's problem with the tokenizer itself as it can't handle arbitrary utf8, which I would expect it to do so. If possible I think this should be resolved in the tokenizer library itself. However in the meantime (or if they don't want to fix it), we could create a wrapper which will handle this case, seems to me like the cleanest choice. |
I completely agree with your opinion. It seems possible to add an option to bypass this issue through the wrapper. However, rather than at the tokenizing stage in the pipeline, shouldn’t this feature to bypass errors when processing "doc" be handled at the executor level? What do you think? |
By bypassing you mean silently ignoring the error and skipping the document ? |
Yes. |
As seen in #279, it seems that indeed this library might not be stable enough. Spacy has a korean tokenizer, would you be willing to look into whether this might be an alternative solution? If so we can just switch the tokenizer we defined for korean to the spacy one |
I encountered a UnicodeDecodeError while using a Korean tokenizer integrated into our data processing pipeline. This issue seems to occur specifically when processing certain types of input data with the tokenizer, as detailed in the error log below:
This error arises within the imap_unordered function of a multiprocessing pool, suggesting an issue in handling encoding during parallel processing of text data. Below is the relevant portion of the traceback:
To temporarily �solve this issue, I implemented a pre-check filter(using lambda filter) that assesses the suitability of the input for the tokenizer, which prevents the process from crashing but does not solve the underlying problem:
This issue seems to originate from the Kiwi library, used by Korean tokenizer. It affects not only my project but potentially other teams.(This issue has already been reported by another internal team using CommonCrawl.)
The text was updated successfully, but these errors were encountered: