Faster json parsing #1

ZJaume · 2024-03-15T09:30:46Z

orjson is several times faster than ujson or json from the standard library, and it is a drop-in replacement.

warc2text-runner/two/trafilatura/traf.py

Line 4 in f95ff23

import ujson as json

warc2text-runner/two/fastertext_lid/proto_langid.py

Line 7 in f95ff23

import json

The text was updated successfully, but these errors were encountered:

nvanva · 2024-03-15T19:05:04Z

Moved to orjson.

nvanva · 2024-03-20T20:09:53Z

The problem is orjson.dumps() is not drop-in replacement. Unlike json and ujson, its dumps() method returns a byte string:

I've made a comparison of speed, orjson is really faster, esp. its dumps() method.

orjson.loads() is 20% faster than ujson.loads() , but orjson.dumps() is >4x faster than ujson.dumps().

nvanva · 2024-03-20T20:23:15Z

The final performance of traf.py (and thus stage 2) will differ by less than 5%. The first one run with import orjson as json, the second is with import ujson as json in traf.py:

nvanva · 2024-03-20T20:27:34Z

For now I've reverted to using ujson instead of orjson, orjson is not an in-place replacement because its dumps() returns a byte string instead of a string, so requires modification of printing it to stdout. Then we should change that and test again on 1% of all data, which seems to be not worth the potential speedup.

ZJaume · 2024-03-21T10:00:50Z

Yes, my bad, I didn't remember that byte thing. Maybe not a drop-in replacement because of that and serializing numpy float from fasttext requires enabling an option:

json_bytes = orjson.dumps(
        {'lang': self._postprocess_prediction(prediction), 'prob': round(prediction[1][0], 4)},
        option=orjson.OPT_SERIALIZE_NUMPY)

but serializing to bytes, in my opinion, it is just fine, just skip the additional string decoding step that others are doing:

sys.stdout.buffer.write(json_bytes)
sys.stdout.buffer.write(b'\n')

But also, it has another advantage, that is non-ascii characters not being escaped, which allow reading all non-english, non-latin languages directly without having to use jq.

nvanva closed this as completed Mar 15, 2024

nvanva reopened this Mar 20, 2024

nvanva added the enhancement New feature or request label May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster json parsing #1

Faster json parsing #1

ZJaume commented Mar 15, 2024

nvanva commented Mar 15, 2024

nvanva commented Mar 20, 2024

nvanva commented Mar 20, 2024 •

edited

Loading

nvanva commented Mar 20, 2024

ZJaume commented Mar 21, 2024

Faster json parsing #1

Faster json parsing #1

Comments

ZJaume commented Mar 15, 2024

nvanva commented Mar 15, 2024

nvanva commented Mar 20, 2024

nvanva commented Mar 20, 2024 • edited Loading

nvanva commented Mar 20, 2024

ZJaume commented Mar 21, 2024

nvanva commented Mar 20, 2024 •

edited

Loading