-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hangs Due to std::length_error
in Data Processing Pipeline
#279
Comments
Any better () infra that can handle the whole CC processing (i.e., reading, extracting, and writing) than Datatrove lol? |
Can you share the full script? Curious particularly about the lambda blocks |
Hi, @guipenedo I can only show you about lambda function. First Lambda for prechecking issues #277
Second Lambda for substituting max_non_alpha_words_ratio in GopherQualityFilter
|
This seems to be an issue with the korean tokenizer, if you look at the project https://github.com/bab2min/kiwipiepy a good chunk of it is C++, which would make sense given the C++ error you are getting. I imagine one of your documents is causing this issue with the tokenizer, can you try just tokenizing all the documents in that specific file directly? Not sure if this error is try...catchable, but if it is can you then try to print the document that triggers it? |
It does seem that the issue stems from the kiwipiepy tokenizer as you mentioned. However, the main problem is that no error location is provided in the message. Interestingly, despite having an initial filtering step via a lambda function to preemptively catch such errors, it’s unclear whether this error is emanating from that lambda or another filtering function. I’m working with commoncrawl data, not a private dataset, so I suspect others might encounter similar issues. Given that kiwipiepy is the default tokenizer module in datatrove, perhaps we should consider implementing a timeout mechanism similar to what’s used in trafilatura for filtering. Additionally, incorporating a bypass feature for problematic documents could help avoid getting stuck. (#277 (comment)) What do you think? |
The reason why there isn't any useful error message is likely because some external C++ code is being called from that library, you would normally have an error message for other types of errors.
not entirely clear to me how this would work, some blocks rely on order implicitly (dedup mostly) and you'd basically need to try catch every single run() method for the others. If you have a possible idea for a working implementation that would generalize well I'd love to hear it |
When you mention that some blocks rely on order implicitly, I understand it to mean that there’s a dependency on the order within blocks, like with min-hash deduplication. My suggestion involves dropping documents at the document level within a block if an issue arises, but I’m not entirely sure which part of the datatrove code would need to be modified for this approach (or if the current architecture even supports such modifications). I will reach out to the kiwipiepy repository to discuss this issue further. From what I understand, kiwipiepy generally performs better with Korean text compared to spacy’s tokenizer. |
Thank you for clarifying, will wait to hear back from the kiwipiepy maintainers then |
The error analysis has revealed that the memory issue occurs when processing spam texts that consist of over 26,000 characters without any spaces. This seems to be what triggers the problem. The author has indicated that they are preparing a patch to resolve this issue. 🤗 |
This error is resolved in bab2min/kiwipiepy#172 (comment) |
The only �solution I've found is to abandon the task file and forcibly stop the executor. The error message and location are not specific enough to identify the source of the problem. Any insights or suggestions on how to handle this error more gracefully would be appreciated.
The text was updated successfully, but these errors were encountered: