Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError When Using Korean Tokenizer #277

Open
justHungryMan opened this issue Aug 28, 2024 · 6 comments
Open

UnicodeDecodeError When Using Korean Tokenizer #277

justHungryMan opened this issue Aug 28, 2024 · 6 comments

Comments

@justHungryMan
Copy link
Contributor

I encountered a UnicodeDecodeError while using a Korean tokenizer integrated into our data processing pipeline. This issue seems to occur specifically when processing certain types of input data with the tokenizer, as detailed in the error log below:

UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 0-1: illegal encoding

This error arises within the imap_unordered function of a multiprocessing pool, suggesting an issue in handling encoding during parallel processing of text data. Below is the relevant portion of the traceback:

│ /home/ubuntu/.local/lib/python3.12/site-packages/datatrove/executor/local.py:133 in run          │
│                                                                                                  │
│   130 │   │   │   completed_lock = mg.Lock()                                                     │
│   131 │   │   │   ctx = multiprocess.get_context(self.start_method)                              │
│   132 │   │   │   with ctx.Pool(self.workers) as pool:                                           │
│ ❱ 133 │   │   │   │   stats = list(                                                              │
│   134 │   │   │   │   │   pool.imap_unordered(                                                   │
│   135 │   │   │   │   │   │   partial(                                                           │
│   136 │   │   │   │   │   │   │   self._launch_run_for_rank,                                     │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ completed_counter = <ValueProxy object, typeid 'Value' at 0x7a7e8b103f50>                    │ │
│ │    completed_lock = <AcquirerProxy object, typeid 'Lock' at 0x7a7e8db923c0>                  │ │
│ │               ctx = <multiprocess.context.ForkServerContext object at 0x7a7e9f75d7f0>        │ │
│ │                 i = 31                                                                       │ │
│ │                mg = <multiprocess.managers.SyncManager object at 0x7a7e8db223c0>             │ │
│ │              pool = <multiprocess.pool.Pool state=TERMINATE pool_size=32>                    │ │
│ │           ranks_q = <AutoProxy[Queue] object, typeid 'Queue' at 0x7a7e905fc380>              │ │
│ │      ranks_to_run = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... +1014]                                │ │
│ │              self = <datatrove.executor.local.LocalPipelineExecutor object at                │ │
│ │                     0x7a7e8dbacf80>                                                          │ │
│ │           skipped = 0                                                                        │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                  │
│ /home/ubuntu/.local/lib/python3.12/site-packages/multiprocess/pool.py:873 in next                │
│                                                                                                  │
│   870 │   │   success, value = item                                                              │
│   871 │   │   if success:                                                                        │
│   872 │   │   │   return value                                                                   │
│ ❱ 873 │   │   raise value                                                                        │
│   874 │                                                                                          │
│   875 │   __next__ = next                    # XXX                                               │
│   876                                                                                            │
│                                                                                                  │
│ ╭───────────────────────────────────────── locals ──────────────────────────────────────────╮    │
│ │    item = (False, UnicodeDecodeError('utf-16-le', b'\x00\xdc', 0, 2, 'illegal encoding')) │    │
│ │    self = <multiprocess.pool.IMapUnorderedIterator object at 0x7a7e8b175fa0>              │    │
│ │ success = False                                                                           │    │
│ │ timeout = None                                                                            │    │
│ │   value = UnicodeDecodeError('utf-16-le', b'\x00\xdc', 0, 2, 'illegal encoding')          │    │
│ ╰───────────────────────────────────────────────────────────────────────────────────────────╯    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 0-1: illegal encoding

To temporarily �solve this issue, I implemented a pre-check filter(using lambda filter) that assesses the suitability of the input for the tokenizer, which prevents the process from crashing but does not solve the underlying problem:

def check_korean_tokenizer_pass(doc):
    tokenizer = load_word_tokenizer(Languages.korean)
    try:
        text = doc.text
        words = tokenizer.word_tokenize(text)
        return True
    except:
        return False

This issue seems to originate from the Kiwi library, used by Korean tokenizer. It affects not only my project but potentially other teams.(This issue has already been reported by another internal team using CommonCrawl.)

@zengxiaofei
Copy link

What is this?

A virus. Do not click that URL.

@hynky1999
Copy link
Contributor

🤔 To me it seems like it's problem with the tokenizer itself as it can't handle arbitrary utf8, which I would expect it to do so. If possible I think this should be resolved in the tokenizer library itself.

However in the meantime (or if they don't want to fix it), we could create a wrapper which will handle this case, seems to me like the cleanest choice.

@justHungryMan
Copy link
Contributor Author

🤔 To me it seems like it's problem with the tokenizer itself as it can't handle arbitrary utf8, which I would expect it to do so. If possible I think this should be resolved in the tokenizer library itself.

However in the meantime (or if they don't want to fix it), we could create a wrapper which will handle this case, seems to me like the cleanest choice.

I completely agree with your opinion. It seems possible to add an option to bypass this issue through the wrapper. However, rather than at the tokenizing stage in the pipeline, shouldn’t this feature to bypass errors when processing "doc" be handled at the executor level? What do you think?

@hynky1999
Copy link
Contributor

By bypassing you mean silently ignoring the error and skipping the document ?

@justHungryMan
Copy link
Contributor Author

By bypassing you mean silently ignoring the error and skipping the document ?

Yes.

@guipenedo
Copy link
Collaborator

As seen in #279, it seems that indeed this library might not be stable enough. Spacy has a korean tokenizer, would you be willing to look into whether this might be an alternative solution? If so we can just switch the tokenizer we defined for korean to the spacy one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@guipenedo @justHungryMan @zengxiaofei @hynky1999 and others