-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster json parsing #1
Comments
Moved to orjson. |
For now I've reverted to using ujson instead of orjson, orjson is not an in-place replacement because its dumps() returns a byte string instead of a string, so requires modification of printing it to stdout. Then we should change that and test again on 1% of all data, which seems to be not worth the potential speedup. |
Yes, my bad, I didn't remember that byte thing. Maybe not a drop-in replacement because of that and serializing numpy float from fasttext requires enabling an option:
but serializing to bytes, in my opinion, it is just fine, just skip the additional string decoding step that others are doing:
But also, it has another advantage, that is non-ascii characters not being escaped, which allow reading all non-english, non-latin languages directly without having to use |
orjson is several times faster than
ujson
orjson
from the standard library, and it is a drop-in replacement.warc2text-runner/two/trafilatura/traf.py
Line 4 in f95ff23
warc2text-runner/two/fastertext_lid/proto_langid.py
Line 7 in f95ff23
The text was updated successfully, but these errors were encountered: