Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-126024: optimize UTF-8 decoder for short non-ASCII string #126025

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

methane
Copy link
Member

@methane methane commented Oct 27, 2024

  • Test input UTF-8 is ASCII before allocating ASCII buffer.
  • If error handler is strict:
    • If input is not ASCII, estimate kind using first non-ASCII code unit.
    • Count number of codepoints before allocating the first buffer string.

This optimization works only for strict error handler, because other error handler may remove or replace invalid UTF-8 sequence.

Benchmark

code
import pyperf
import _testlimitedcapi

ascii10 = "hellohello".encode()
latin1_10 = "hello\u00e0\u00e1\u00e2\u00e3\u00e4".encode()
ucs2_10 = "こんにちはこんにちは".encode()
ucs4_10 = ("こんにちは" + "".join([chr(i) for i in range(0x1F0A0, 0x1F0A0+5)])).encode()

runner = pyperf.Runner()

def add_funcs(name, arg):
    assert len(arg.decode()) == 10
    runner.bench_func(f"{name}   10", _testlimitedcapi.unicode_decodeutf8, arg)
    runner.bench_func(f"{name}  100", _testlimitedcapi.unicode_decodeutf8, arg*10)
    runner.bench_func(f"{name} 1000", _testlimitedcapi.unicode_decodeutf8, arg*100)

for i in [0, 1, 2, 5, 8]:
    runner.bench_func(f"ASCII    {i}", _testlimitedcapi.unicode_decodeutf8, ascii10[:i])

add_funcs("ASCII", ascii10)
add_funcs("latin1", latin1_10)
add_funcs("ucs2", ucs2_10)
add_funcs("ucs4", ucs4_10)

Result (wit --enable-optimizations --with-lto):

Benchmark main-opt patched-5o
ASCII 0 87.1 ns 89.8 ns: 1.03x slower
ASCII 1 88.5 ns 89.8 ns: 1.01x slower
ASCII 2 100 ns 103 ns: 1.02x slower
ASCII 5 104 ns 103 ns: 1.01x faster
ASCII 8 100.0 ns 105 ns: 1.05x slower
ASCII 10 101 ns 104 ns: 1.02x slower
ASCII 100 110 ns 110 ns: 1.01x faster
ASCII 1000 239 ns 245 ns: 1.03x slower
latin1 10 220 ns 170 ns: 1.29x faster
latin1 100 385 ns 320 ns: 1.21x faster
latin1 1000 2.13 us 1.92 us: 1.11x faster
ucs2 10 217 ns 178 ns: 1.22x faster
ucs2 100 615 ns 473 ns: 1.30x faster
ucs2 1000 3.15 us 3.21 us: 1.02x slower
ucs4 10 268 ns 241 ns: 1.11x faster
ucs4 100 725 ns 581 ns: 1.25x faster
ucs4 1000 3.79 us 3.85 us: 1.02x slower
Geometric mean (ref) 1.07x faster

@methane methane linked an issue Oct 27, 2024 that may be closed by this pull request
@methane methane force-pushed the opt_decode_utf8_nonascii_numchars branch from 5344340 to 9b47c2b Compare October 27, 2024 01:41
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
Objects/unicodeobject.c Outdated Show resolved Hide resolved
@methane
Copy link
Member Author

methane commented Oct 27, 2024

orjson.loads() 1000 times with twitter.json in orjson benchmark suite:

  • with original orjson: 2.028s
  • patched orjson that uses PyUnicode_FromStringAndSize(): 2.397s
  • patched orjson + this branch: 2.142s

orjson's implementation is still faster.
But this PR reduces temptation of having own UTF-8 decoder and use of PEP 393 API.

@methane
Copy link
Member Author

methane commented Oct 27, 2024

Comparing to DuckDB's decoder.

  • code
  • Decoding short UTF-8 (10 codepoints & 30 bytes)
    • with duckdb: 82ns
    • with main branch: 115ns
    • with this branch: 92ns
  • Decoding long ASCII (1000 bytes)
    • with duckdb: 605ns
    • with main branch: 157ns
    • with this branch: 151ns

When benchmarking short ASCII, performance is unstable because unicode_dealloc is slower than decoding. speed is vary on where the object is allocated.

Objects/unicodeobject.c Outdated Show resolved Hide resolved
@methane methane force-pushed the opt_decode_utf8_nonascii_numchars branch from 800452a to b0ce85c Compare October 29, 2024 04:31
@methane methane force-pushed the opt_decode_utf8_nonascii_numchars branch from b0ce85c to 37715b6 Compare October 29, 2024 04:33
This reverts commit c47d574.
@methane
Copy link
Member Author

methane commented Oct 29, 2024

This is tree I played microbenchmarks.
https://github.com/methane/notes/tree/master/c/first_nonascii

@methane
Copy link
Member Author

methane commented Oct 29, 2024

orjson's benchmark_load result:

0001: Python 3.13 (python-build-standalone) + orjson (customized to use PyUnicode_FromStringAndSize).
0002: Python 3.13 + orjson (original)
0003: Python 3.14 (this PR) + orjson (original)
0004: Python 3.14 (this PR) + orjson (customized)
0005: Python 3.14 (main) + orjson (customized)

---------------------- benchmark 'canada.json deserialization': 5 tests ---------------------
Name (time in ms)                        Min               Mean                 OPS
---------------------------------------------------------------------------------------------
loads[orjson-canada.json] (0001)     12.6357 (1.37)     12.9228 (1.31)      77.3825 (0.77)
loads[orjson-canada.json] (0002)     12.6432 (1.37)     13.1927 (1.33)      75.7996 (0.75)
loads[orjson-canada.json] (0003)      9.2428 (1.00)      9.9514 (1.00)     100.4887 (1.00)
loads[orjson-canada.json] (0004)      9.2235 (1.0)       9.9023 (1.0)      100.9865 (1.0)
loads[orjson-canada.json] (0005)      9.2686 (1.00)      9.9312 (1.00)     100.6926 (1.00)
---------------------------------------------------------------------------------------------

--------------------- benchmark 'citm_catalog.json deserialization': 5 tests --------------------
Name (time in ms)                             Min              Mean                 OPS
-------------------------------------------------------------------------------------------------
loads[orjson-citm_catalog.json] (0001)     4.3377 (1.18)     4.3424 (1.18)     230.2881 (0.85)
loads[orjson-citm_catalog.json] (0002)     4.2391 (1.16)     4.2407 (1.15)     235.8106 (0.87)
loads[orjson-citm_catalog.json] (0003)     3.6644 (1.0)      3.6756 (1.0)      272.0645 (1.0)
loads[orjson-citm_catalog.json] (0004)     3.7146 (1.01)     3.7166 (1.01)     269.0665 (0.99)
loads[orjson-citm_catalog.json] (0005)     3.7173 (1.01)     3.7198 (1.01)     268.8317 (0.99)
-------------------------------------------------------------------------------------------------

------------------------- benchmark 'github.json deserialization': 5 tests ------------------------
Name (time in us)                         Min                Mean            OPS (Kops/s)
---------------------------------------------------------------------------------------------------
loads[orjson-github.json] (0001)     133.5345 (1.12)     133.6741 (1.12)           7.4809 (0.89)
loads[orjson-github.json] (0002)     133.9510 (1.12)     134.0037 (1.12)           7.4625 (0.89)
loads[orjson-github.json] (0003)     119.2316 (1.0)      119.3161 (1.0)            8.3811 (1.0)
loads[orjson-github.json] (0004)     134.1154 (1.12)     134.3008 (1.13)           7.4460 (0.89)
loads[orjson-github.json] (0005)     134.7232 (1.13)     135.2986 (1.13)           7.3911 (0.88)
---------------------------------------------------------------------------------------------------

-------------------- benchmark 'twitter.json deserialization': 5 tests ---------------------
Name (time in ms)                        Min              Mean                 OPS
--------------------------------------------------------------------------------------------
loads[orjson-twitter.json] (0001)     2.0658 (1.26)     2.0670 (1.26)     483.7979 (0.79)
loads[orjson-twitter.json] (0002)     1.7103 (1.04)     1.7108 (1.04)     584.5319 (0.96)
loads[orjson-twitter.json] (0003)     1.6404 (1.0)      1.6411 (1.0)      609.3393 (1.0)
loads[orjson-twitter.json] (0004)     1.8574 (1.13)     1.8588 (1.13)     537.9842 (0.88)
loads[orjson-twitter.json] (0005)     2.0286 (1.24)     2.0289 (1.24)     492.8792 (0.81)
--------------------------------------------------------------------------------------------

When seeing 0003 vs 0004 vs 0005 on twitter.json benchmark, this PR makes PyString_FromStringAndSize from 19% slower to 12% slower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve UTF-8 decode speed
3 participants