Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster json parsing #1

Open
ZJaume opened this issue Mar 15, 2024 · 5 comments
Open

Faster json parsing #1

ZJaume opened this issue Mar 15, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@ZJaume
Copy link
Contributor

ZJaume commented Mar 15, 2024

orjson is several times faster than ujson or json from the standard library, and it is a drop-in replacement.

import ujson as json

@nvanva
Copy link
Collaborator

nvanva commented Mar 15, 2024

Moved to orjson.

@nvanva nvanva closed this as completed Mar 15, 2024
@nvanva nvanva reopened this Mar 20, 2024
@nvanva
Copy link
Collaborator

nvanva commented Mar 20, 2024

The problem is orjson.dumps() is not drop-in replacement. Unlike json and ujson, its dumps() method returns a byte string:
image

I've made a comparison of speed, orjson is really faster, esp. its dumps() method.
image

image

orjson.loads() is 20% faster than ujson.loads() , but orjson.dumps() is >4x faster than ujson.dumps().

@nvanva
Copy link
Collaborator

nvanva commented Mar 20, 2024

The final performance of traf.py (and thus stage 2) will differ by less than 5%. The first one run with import orjson as json, the second is with import ujson as json in traf.py:
image

@nvanva
Copy link
Collaborator

nvanva commented Mar 20, 2024

For now I've reverted to using ujson instead of orjson, orjson is not an in-place replacement because its dumps() returns a byte string instead of a string, so requires modification of printing it to stdout. Then we should change that and test again on 1% of all data, which seems to be not worth the potential speedup.

@ZJaume
Copy link
Contributor Author

ZJaume commented Mar 21, 2024

Yes, my bad, I didn't remember that byte thing. Maybe not a drop-in replacement because of that and serializing numpy float from fasttext requires enabling an option:

json_bytes = orjson.dumps(
        {'lang': self._postprocess_prediction(prediction), 'prob': round(prediction[1][0], 4)},
        option=orjson.OPT_SERIALIZE_NUMPY)

but serializing to bytes, in my opinion, it is just fine, just skip the additional string decoding step that others are doing:

sys.stdout.buffer.write(json_bytes)
sys.stdout.buffer.write(b'\n')

But also, it has another advantage, that is non-ascii characters not being escaped, which allow reading all non-english, non-latin languages directly without having to use jq.

@nvanva nvanva added the enhancement New feature or request label May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants