Hangs Due to `std::length_error` in Data Processing Pipeline #279

justHungryMan · 2024-09-01T15:10:26Z

2024-09-01 15:00:25.122 | INFO     | datatrove.executor.local:run:120 - Skipping 4095 already completed tasks
2024-09-01 15:00:25.772 | INFO     | datatrove.utils.logging:add_task_logger:58 - Launching pipeline for rank=3925
2024-09-01 15:00:25.772 | INFO     | datatrove.utils.logging:log_pipeline:90 -
--- 🛠️  PIPELINE 🛠
📖 - READER: 🐿 Jsonl
🔻 - FILTER: 👤 Lambda
🔻 - FILTER: 👯 Gopher Repetition
🔻 - FILTER: 🥇 Gopher Quality
🔻 - FILTER: 👤 Lambda
🔻 - FILTER: ⛰ C4 Quality
🔻 - FILTER: 🍷 FineWeb Quality
💽 - WRITER: 🐿 Jsonl
2024-09-01 15:00:27.077 | INFO     | datatrove.pipeline.readers.base:read_files_shard:191 - Reading input file 03925.jsonl.gz, 1/2
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_S_create

The only �solution I've found is to abandon the task file and forcibly stop the executor. The error message and location are not specific enough to identify the source of the problem. Any insights or suggestions on how to handle this error more gracefully would be appreciated.

The text was updated successfully, but these errors were encountered:

SinclairCoder · 2024-09-01T19:49:05Z

Any better () infra that can handle the whole CC processing (i.e., reading, extracting, and writing) than Datatrove lol?

guipenedo · 2024-09-02T12:21:33Z

Can you share the full script? Curious particularly about the lambda blocks

justHungryMan · 2024-09-02T14:09:39Z

Hi, @guipenedo

I can only show you about lambda function.

First Lambda for prechecking issues #277

def check_korean_tokenizer_pass(doc):
    tokenizer = load_word_tokenizer(Languages.korean)
    try:
        text = doc.text
        words = tokenizer.word_tokenize(text)

        return True
    except:
        return False

Second Lambda for substituting max_non_alpha_words_ratio in GopherQualityFilter
In many time, Korean use Hangul(korean), chinese (but related to Korean Hanja) and English, we just filter documents under min_non_korean_words_ratio

def filter_non_korean_words_ratio(doc):
        text = doc.text
        words = tokenizer.word_tokenize(text)
        n_words = len(words)

        n_korean_words = sum(bool(korean_pattern.search(word)) for word in words)

        if min_non_korean_words_ratio and (n_korean_words / n_words) < min_non_korean_words_ratio:
            return False

        return True

guipenedo · 2024-09-02T16:22:30Z

This seems to be an issue with the korean tokenizer, if you look at the project https://github.com/bab2min/kiwipiepy a good chunk of it is C++, which would make sense given the C++ error you are getting. I imagine one of your documents is causing this issue with the tokenizer, can you try just tokenizing all the documents in that specific file directly? Not sure if this error is try...catchable, but if it is can you then try to print the document that triggers it?
I tried taking a look at the project's issues but they're mostly in korean, maybe you'll have better luck

justHungryMan · 2024-09-03T04:07:19Z

It does seem that the issue stems from the kiwipiepy tokenizer as you mentioned. However, the main problem is that no error location is provided in the message. Interestingly, despite having an initial filtering step via a lambda function to preemptively catch such errors, it’s unclear whether this error is emanating from that lambda or another filtering function.

I’m working with commoncrawl data, not a private dataset, so I suspect others might encounter similar issues. Given that kiwipiepy is the default tokenizer module in datatrove, perhaps we should consider implementing a timeout mechanism similar to what’s used in trafilatura for filtering. Additionally, incorporating a bypass feature for problematic documents could help avoid getting stuck. (#277 (comment)) What do you think?

guipenedo · 2024-09-03T09:53:04Z

The reason why there isn't any useful error message is likely because some external C++ code is being called from that library, you would normally have an error message for other types of errors.
If kiwipiepy is not stable enough we can consider using the spacy korean tokenizer instead. I don't speak korean, but maybe you have some insight on which one might be better?

Additionally, incorporating a bypass feature for problematic documents could help avoid getting stuck. (#277 (comment)) What do you think?

not entirely clear to me how this would work, some blocks rely on order implicitly (dedup mostly) and you'd basically need to try catch every single run() method for the others. If you have a possible idea for a working implementation that would generalize well I'd love to hear it

justHungryMan · 2024-09-03T10:14:07Z

The reason why there isn't any useful error message is likely because some external C++ code is being called from that library, you would normally have an error message for other types of errors. If kiwipiepy is not stable enough we can consider using the spacy korean tokenizer instead. I don't speak korean, but maybe you have some insight on which one might be better?

Additionally, incorporating a bypass feature for problematic documents could help avoid getting stuck. (#277 (comment)) What do you think?

not entirely clear to me how this would work, some blocks rely on order implicitly (dedup mostly) and you'd basically need to try catch every single run() method for the others. If you have a possible idea for a working implementation that would generalize well I'd love to hear it

When you mention that some blocks rely on order implicitly, I understand it to mean that there’s a dependency on the order within blocks, like with min-hash deduplication. My suggestion involves dropping documents at the document level within a block if an issue arises, but I’m not entirely sure which part of the datatrove code would need to be modified for this approach (or if the current architecture even supports such modifications).

I will reach out to the kiwipiepy repository to discuss this issue further. From what I understand, kiwipiepy generally performs better with Korean text compared to spacy’s tokenizer.

guipenedo · 2024-09-03T10:16:10Z

Thank you for clarifying, will wait to hear back from the kiwipiepy maintainers then

justHungryMan · 2024-09-03T16:26:00Z

The error analysis has revealed that the memory issue occurs when processing spam texts that consist of over 26,000 characters without any spaces. This seems to be what triggers the problem.

The author has indicated that they are preparing a patch to resolve this issue. 🤗

justHungryMan · 2024-09-23T09:16:28Z

This error is resolved in bab2min/kiwipiepy#172 (comment)

guipenedo mentioned this issue Sep 3, 2024

UnicodeDecodeError When Using Korean Tokenizer #277

Closed

justHungryMan mentioned this issue Sep 3, 2024

타모듈에서 std::length_error 에러 발생 bab2min/kiwipiepy#172

Closed

justHungryMan closed this as completed Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hangs Due to `std::length_error` in Data Processing Pipeline #279

Hangs Due to `std::length_error` in Data Processing Pipeline #279

justHungryMan commented Sep 1, 2024

SinclairCoder commented Sep 1, 2024

guipenedo commented Sep 2, 2024

justHungryMan commented Sep 2, 2024 •

edited

Loading

guipenedo commented Sep 2, 2024

justHungryMan commented Sep 3, 2024

guipenedo commented Sep 3, 2024 •

edited

Loading

justHungryMan commented Sep 3, 2024

guipenedo commented Sep 3, 2024

justHungryMan commented Sep 3, 2024

justHungryMan commented Sep 23, 2024

Hangs Due to std::length_error in Data Processing Pipeline #279

Hangs Due to std::length_error in Data Processing Pipeline #279

Comments

justHungryMan commented Sep 1, 2024

SinclairCoder commented Sep 1, 2024

guipenedo commented Sep 2, 2024

justHungryMan commented Sep 2, 2024 • edited Loading

guipenedo commented Sep 2, 2024

justHungryMan commented Sep 3, 2024

guipenedo commented Sep 3, 2024 • edited Loading

justHungryMan commented Sep 3, 2024

guipenedo commented Sep 3, 2024

justHungryMan commented Sep 3, 2024

justHungryMan commented Sep 23, 2024

Hangs Due to `std::length_error` in Data Processing Pipeline #279

Hangs Due to `std::length_error` in Data Processing Pipeline #279

justHungryMan commented Sep 2, 2024 •

edited

Loading

guipenedo commented Sep 3, 2024 •

edited

Loading