gh-126024: optimize UTF-8 decoder for short non-ASCII string #126025

methane · 2024-10-27T01:40:01Z

Test input UTF-8 is ASCII before allocating ASCII buffer.
If error handler is strict:
- If input is not ASCII, estimate kind using first non-ASCII code unit.
- Count number of codepoints before allocating the first buffer string.

This optimization works only for strict error handler, because other error handler may remove or replace invalid UTF-8 sequence.

Benchmark

code

import pyperf
import _testlimitedcapi

ascii10 = "hellohello".encode()
latin1_10 = "hello\u00e0\u00e1\u00e2\u00e3\u00e4".encode()
ucs2_10 = "こんにちはこんにちは".encode()
ucs4_10 = ("こんにちは" + "".join([chr(i) for i in range(0x1F0A0, 0x1F0A0+5)])).encode()

runner = pyperf.Runner()

def add_funcs(name, arg):
    assert len(arg.decode()) == 10
    runner.bench_func(f"{name}   10", _testlimitedcapi.unicode_decodeutf8, arg)
    runner.bench_func(f"{name}  100", _testlimitedcapi.unicode_decodeutf8, arg*10)
    runner.bench_func(f"{name} 1000", _testlimitedcapi.unicode_decodeutf8, arg*100)

for i in [0, 1, 2, 5, 8]:
    runner.bench_func(f"ASCII    {i}", _testlimitedcapi.unicode_decodeutf8, ascii10[:i])

add_funcs("ASCII", ascii10)
add_funcs("latin1", latin1_10)
add_funcs("ucs2", ucs2_10)
add_funcs("ucs4", ucs4_10)

Result (wit --enable-optimizations --with-lto):

Benchmark	main-opt	patched-5o
ASCII 0	87.1 ns	89.8 ns: 1.03x slower
ASCII 1	88.5 ns	89.8 ns: 1.01x slower
ASCII 2	100 ns	103 ns: 1.02x slower
ASCII 5	104 ns	103 ns: 1.01x faster
ASCII 8	100.0 ns	105 ns: 1.05x slower
ASCII 10	101 ns	104 ns: 1.02x slower
ASCII 100	110 ns	110 ns: 1.01x faster
ASCII 1000	239 ns	245 ns: 1.03x slower
latin1 10	220 ns	170 ns: 1.29x faster
latin1 100	385 ns	320 ns: 1.21x faster
latin1 1000	2.13 us	1.92 us: 1.11x faster
ucs2 10	217 ns	178 ns: 1.22x faster
ucs2 100	615 ns	473 ns: 1.30x faster
ucs2 1000	3.15 us	3.21 us: 1.02x slower
ucs4 10	268 ns	241 ns: 1.11x faster
ucs4 100	725 ns	581 ns: 1.25x faster
ucs4 1000	3.79 us	3.85 us: 1.02x slower
Geometric mean	(ref)	1.07x faster

Issue: Improve UTF-8 decode speed #126024

Objects/unicodeobject.c

methane · 2024-10-27T12:21:29Z

orjson.loads() 1000 times with twitter.json in orjson benchmark suite:

with original orjson: 2.028s
patched orjson that uses PyUnicode_FromStringAndSize(): 2.397s
patched orjson + this branch: 2.142s

orjson's implementation is still faster.
But this PR reduces temptation of having own UTF-8 decoder and use of PEP 393 API.

methane · 2024-10-27T12:46:31Z

Comparing to DuckDB's decoder.

code
Decoding short UTF-8 (10 codepoints & 30 bytes)
- with duckdb: 82ns
- with main branch: 115ns
- with this branch: 92ns
Decoding long ASCII (1000 bytes)
- with duckdb: 605ns
- with main branch: 157ns
- with this branch: 151ns

When benchmarking short ASCII, performance is unstable because unicode_dealloc is slower than decoding. speed is vary on where the object is allocated.

Objects/unicodeobject.c

This reverts commit c47d574.

methane · 2024-10-29T07:13:45Z

This is tree I played microbenchmarks.
https://github.com/methane/notes/tree/master/c/first_nonascii

methane · 2024-10-29T09:24:29Z

orjson's benchmark_load result:

0001: Python 3.13 (python-build-standalone) + orjson (customized to use PyUnicode_FromStringAndSize).
0002: Python 3.13 + orjson (original)
0003: Python 3.14 (this PR) + orjson (original)
0004: Python 3.14 (this PR) + orjson (customized)
0005: Python 3.14 (main) + orjson (customized)

---------------------- benchmark 'canada.json deserialization': 5 tests ---------------------
Name (time in ms)                        Min               Mean                 OPS
---------------------------------------------------------------------------------------------
loads[orjson-canada.json] (0001)     12.6357 (1.37)     12.9228 (1.31)      77.3825 (0.77)
loads[orjson-canada.json] (0002)     12.6432 (1.37)     13.1927 (1.33)      75.7996 (0.75)
loads[orjson-canada.json] (0003)      9.2428 (1.00)      9.9514 (1.00)     100.4887 (1.00)
loads[orjson-canada.json] (0004)      9.2235 (1.0)       9.9023 (1.0)      100.9865 (1.0)
loads[orjson-canada.json] (0005)      9.2686 (1.00)      9.9312 (1.00)     100.6926 (1.00)
---------------------------------------------------------------------------------------------

--------------------- benchmark 'citm_catalog.json deserialization': 5 tests --------------------
Name (time in ms)                             Min              Mean                 OPS
-------------------------------------------------------------------------------------------------
loads[orjson-citm_catalog.json] (0001)     4.3377 (1.18)     4.3424 (1.18)     230.2881 (0.85)
loads[orjson-citm_catalog.json] (0002)     4.2391 (1.16)     4.2407 (1.15)     235.8106 (0.87)
loads[orjson-citm_catalog.json] (0003)     3.6644 (1.0)      3.6756 (1.0)      272.0645 (1.0)
loads[orjson-citm_catalog.json] (0004)     3.7146 (1.01)     3.7166 (1.01)     269.0665 (0.99)
loads[orjson-citm_catalog.json] (0005)     3.7173 (1.01)     3.7198 (1.01)     268.8317 (0.99)
-------------------------------------------------------------------------------------------------

------------------------- benchmark 'github.json deserialization': 5 tests ------------------------
Name (time in us)                         Min                Mean            OPS (Kops/s)
---------------------------------------------------------------------------------------------------
loads[orjson-github.json] (0001)     133.5345 (1.12)     133.6741 (1.12)           7.4809 (0.89)
loads[orjson-github.json] (0002)     133.9510 (1.12)     134.0037 (1.12)           7.4625 (0.89)
loads[orjson-github.json] (0003)     119.2316 (1.0)      119.3161 (1.0)            8.3811 (1.0)
loads[orjson-github.json] (0004)     134.1154 (1.12)     134.3008 (1.13)           7.4460 (0.89)
loads[orjson-github.json] (0005)     134.7232 (1.13)     135.2986 (1.13)           7.3911 (0.88)
---------------------------------------------------------------------------------------------------

-------------------- benchmark 'twitter.json deserialization': 5 tests ---------------------
Name (time in ms)                        Min              Mean                 OPS
--------------------------------------------------------------------------------------------
loads[orjson-twitter.json] (0001)     2.0658 (1.26)     2.0670 (1.26)     483.7979 (0.79)
loads[orjson-twitter.json] (0002)     1.7103 (1.04)     1.7108 (1.04)     584.5319 (0.96)
loads[orjson-twitter.json] (0003)     1.6404 (1.0)      1.6411 (1.0)      609.3393 (1.0)
loads[orjson-twitter.json] (0004)     1.8574 (1.13)     1.8588 (1.13)     537.9842 (0.88)
loads[orjson-twitter.json] (0005)     2.0286 (1.24)     2.0289 (1.24)     492.8792 (0.81)
--------------------------------------------------------------------------------------------

When seeing 0003 vs 0004 vs 0005 on twitter.json benchmark, this PR makes PyString_FromStringAndSize from 19% slower to 12% slower.

add find_first_nonascii

5a71387

bedevere-app bot added the awaiting core review label Oct 27, 2024

bedevere-app bot mentioned this pull request Oct 27, 2024

Improve UTF-8 decode speed #126024

Open

methane linked an issue Oct 27, 2024 that may be closed by this pull request

Improve UTF-8 decode speed #126024

Open

utf8_count_codepoints

9b47c2b

methane force-pushed the opt_decode_utf8_nonascii_numchars branch from 5344340 to 9b47c2b Compare October 27, 2024 01:41

rruuaanng reviewed Oct 27, 2024

View reviewed changes

Objects/unicodeobject.c Outdated Show resolved Hide resolved

Objects/unicodeobject.c Outdated Show resolved Hide resolved

methane added 7 commits October 27, 2024 03:37

fixup

b65bbb2

fixup

b759ca6

fix warning

096b1fd

add comment

ea97629

add news

73c381e

ascii_new

c47d574

optimize find_first_nonascii

08ce01c

picnixz reviewed Oct 27, 2024

View reviewed changes

methane added 2 commits October 27, 2024 11:12

cosmetic changes

7d5f4d2

update news

8e58bf2

methane commented Oct 28, 2024

View reviewed changes

Objects/unicodeobject.c Outdated Show resolved Hide resolved

methane force-pushed the opt_decode_utf8_nonascii_numchars branch from 800452a to b0ce85c Compare October 29, 2024 04:31

optimize unaligned memory load

37715b6

methane force-pushed the opt_decode_utf8_nonascii_numchars branch from b0ce85c to 37715b6 Compare October 29, 2024 04:33

Revert "ascii_new"

c3a22b6

This reverts commit c47d574.

methane added 2 commits November 1, 2024 18:37

fix warning

e3adab4

micro optimization for x86

f563b42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-126024: optimize UTF-8 decoder for short non-ASCII string #126025

gh-126024: optimize UTF-8 decoder for short non-ASCII string #126025

methane commented Oct 27, 2024 •

edited

Loading

methane commented Oct 27, 2024 •

edited

Loading

methane commented Oct 27, 2024 •

edited

Loading

methane commented Oct 29, 2024

methane commented Oct 29, 2024

gh-126024: optimize UTF-8 decoder for short non-ASCII string #126025

Are you sure you want to change the base?

gh-126024: optimize UTF-8 decoder for short non-ASCII string #126025

Conversation

methane commented Oct 27, 2024 • edited Loading

Benchmark

methane commented Oct 27, 2024 • edited Loading

methane commented Oct 27, 2024 • edited Loading

methane commented Oct 29, 2024

methane commented Oct 29, 2024

methane commented Oct 27, 2024 •

edited

Loading

methane commented Oct 27, 2024 •

edited

Loading

methane commented Oct 27, 2024 •

edited

Loading