Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hangs Due to std::length_error in Data Processing Pipeline #279

Closed
justHungryMan opened this issue Sep 1, 2024 · 10 comments
Closed

Hangs Due to std::length_error in Data Processing Pipeline #279

justHungryMan opened this issue Sep 1, 2024 · 10 comments

Comments

@justHungryMan
Copy link
Contributor

2024-09-01 15:00:25.122 | INFO     | datatrove.executor.local:run:120 - Skipping 4095 already completed tasks
2024-09-01 15:00:25.772 | INFO     | datatrove.utils.logging:add_task_logger:58 - Launching pipeline for rank=3925
2024-09-01 15:00:25.772 | INFO     | datatrove.utils.logging:log_pipeline:90 -
--- 🛠️  PIPELINE 🛠
📖 - READER: 🐿 Jsonl
🔻 - FILTER: 👤 Lambda
🔻 - FILTER: 👯 Gopher Repetition
🔻 - FILTER: 🥇 Gopher Quality
🔻 - FILTER: 👤 Lambda
🔻 - FILTER: ⛰ C4 Quality
🔻 - FILTER: 🍷 FineWeb Quality
💽 - WRITER: 🐿 Jsonl
2024-09-01 15:00:27.077 | INFO     | datatrove.pipeline.readers.base:read_files_shard:191 - Reading input file 03925.jsonl.gz, 1/2
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_S_create

The only �solution I've found is to abandon the task file and forcibly stop the executor. The error message and location are not specific enough to identify the source of the problem. Any insights or suggestions on how to handle this error more gracefully would be appreciated.

@SinclairCoder
Copy link

Any better () infra that can handle the whole CC processing (i.e., reading, extracting, and writing) than Datatrove lol?

@guipenedo
Copy link
Collaborator

Can you share the full script? Curious particularly about the lambda blocks

@justHungryMan
Copy link
Contributor Author

justHungryMan commented Sep 2, 2024

Hi, @guipenedo

I can only show you about lambda function.

First Lambda for prechecking issues #277

def check_korean_tokenizer_pass(doc):
    tokenizer = load_word_tokenizer(Languages.korean)
    try:
        text = doc.text
        words = tokenizer.word_tokenize(text)

        return True
    except:
        return False   

Second Lambda for substituting max_non_alpha_words_ratio in GopherQualityFilter
In many time, Korean use Hangul(korean), chinese (but related to Korean Hanja) and English, we just filter documents under min_non_korean_words_ratio

def filter_non_korean_words_ratio(doc):
        text = doc.text
        words = tokenizer.word_tokenize(text)
        n_words = len(words)

        n_korean_words = sum(bool(korean_pattern.search(word)) for word in words)

        if min_non_korean_words_ratio and (n_korean_words / n_words) < min_non_korean_words_ratio:
            return False

        return True

@guipenedo
Copy link
Collaborator

This seems to be an issue with the korean tokenizer, if you look at the project https://github.com/bab2min/kiwipiepy a good chunk of it is C++, which would make sense given the C++ error you are getting. I imagine one of your documents is causing this issue with the tokenizer, can you try just tokenizing all the documents in that specific file directly? Not sure if this error is try...catchable, but if it is can you then try to print the document that triggers it?
I tried taking a look at the project's issues but they're mostly in korean, maybe you'll have better luck

@justHungryMan
Copy link
Contributor Author

It does seem that the issue stems from the kiwipiepy tokenizer as you mentioned. However, the main problem is that no error location is provided in the message. Interestingly, despite having an initial filtering step via a lambda function to preemptively catch such errors, it’s unclear whether this error is emanating from that lambda or another filtering function.

I’m working with commoncrawl data, not a private dataset, so I suspect others might encounter similar issues. Given that kiwipiepy is the default tokenizer module in datatrove, perhaps we should consider implementing a timeout mechanism similar to what’s used in trafilatura for filtering. Additionally, incorporating a bypass feature for problematic documents could help avoid getting stuck. (#277 (comment)) What do you think?

@guipenedo
Copy link
Collaborator

guipenedo commented Sep 3, 2024

The reason why there isn't any useful error message is likely because some external C++ code is being called from that library, you would normally have an error message for other types of errors.
If kiwipiepy is not stable enough we can consider using the spacy korean tokenizer instead. I don't speak korean, but maybe you have some insight on which one might be better?

Additionally, incorporating a bypass feature for problematic documents could help avoid getting stuck. (#277 (comment)) What do you think?

not entirely clear to me how this would work, some blocks rely on order implicitly (dedup mostly) and you'd basically need to try catch every single run() method for the others. If you have a possible idea for a working implementation that would generalize well I'd love to hear it

@justHungryMan
Copy link
Contributor Author

The reason why there isn't any useful error message is likely because some external C++ code is being called from that library, you would normally have an error message for other types of errors. If kiwipiepy is not stable enough we can consider using the spacy korean tokenizer instead. I don't speak korean, but maybe you have some insight on which one might be better?

Additionally, incorporating a bypass feature for problematic documents could help avoid getting stuck. (#277 (comment)) What do you think?

not entirely clear to me how this would work, some blocks rely on order implicitly (dedup mostly) and you'd basically need to try catch every single run() method for the others. If you have a possible idea for a working implementation that would generalize well I'd love to hear it

When you mention that some blocks rely on order implicitly, I understand it to mean that there’s a dependency on the order within blocks, like with min-hash deduplication. My suggestion involves dropping documents at the document level within a block if an issue arises, but I’m not entirely sure which part of the datatrove code would need to be modified for this approach (or if the current architecture even supports such modifications).

I will reach out to the kiwipiepy repository to discuss this issue further. From what I understand, kiwipiepy generally performs better with Korean text compared to spacy’s tokenizer.

@guipenedo
Copy link
Collaborator

Thank you for clarifying, will wait to hear back from the kiwipiepy maintainers then

@justHungryMan
Copy link
Contributor Author

The error analysis has revealed that the memory issue occurs when processing spam texts that consist of over 26,000 characters without any spaces. This seems to be what triggers the problem.

The author has indicated that they are preparing a patch to resolve this issue. 🤗

@justHungryMan
Copy link
Contributor Author

This error is resolved in bab2min/kiwipiepy#172 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants