gh-107137: Add _PyTupleBuilder API to the internal C API #107139

vstinner · 2023-07-23T17:10:10Z

Add _PyTupleBuilder structure and functions:

_PyTupleBuilder_Init()
_PyTupleBuilder_Alloc()
_PyTupleBuilder_Append()
_PyTupleBuilder_AppendUnsafe()
_PyTupleBuilder_Finish()
_PyTupleBuilder_Dealloc()

The builder tracks the size of the tuple and resize it in _PyTupleBuilder_Finish() if needed. Don't allocate empty tuple. Allocate an array of 16 objects on the stack to avoid allocating small tuple. _PyTupleBuilder_Append() overallocates the tuple by 25% to reduce the number of _PyTuple_Resize() calls.

Do no track the temporary internal tuple by the GC before _PyTupleBuilder_Finish() creates the final complete and consistent tuple object.

Use _PyTupleBuilder API in itertools batched_traverse(), PySequence_Tuple() and initialize_structseq_dict().

Add also helper functions:

_PyTuple_ResizeNoTrack()
_PyTuple_NewNoTrack()

Issue: C API: Add internal C API to build a tuple: _PyTupleBuilder #107137

vstinner · 2023-07-23T17:11:06Z

I chose to use unsigned size_t in the structure in _PyTupleBuilder_Alloc() size argument. I'm not sure if it's better or worse than classic signed Py_ssize_t type.

cc @serhiy-storchaka @methane @pablogsal @erlend-aasland @corona10

vstinner · 2023-07-23T18:27:07Z

HPy API to build a tuples and lists: https://docs.hpyproject.org/en/latest/api-reference/builder.html HPy API seems to only support creating tuple of a size known in advance: it's not possible to extend or shrink the tuple.

vstinner · 2023-07-23T18:28:32Z

I'm considering to add a "Set" function later, and/or maybe a "GetUnsafe" function. But I prefer to start with a minimum API :-)

gvanrossum · 2023-07-23T21:53:09Z

-1. This feels unnecessarily elaborate.

vstinner · 2023-07-23T22:14:12Z

This API is a fix for this old issue: #59313 "Incomplete tuple created by PyTuple_New() and accessed via the GC can trigged a crash" which was closed as "not a bug" in 2021 by @pablogsal.

vstinner · 2023-07-23T22:16:18Z

@gvanrossum:

-1. This feels unnecessarily elaborate.

Do you mean that capi-workgroup/problems#56 is not an issue, or that this API is too complicated to use?

gvanrossum · 2023-07-23T22:33:08Z

I believe we should review the problem before jumping to a solution. And if this is the solution, I'm not sure that it's worth fixing the problem. So please hold your horses here.

vstinner · 2023-07-23T23:19:40Z

The root issue is that PyTuple_New() tracks directly the tuple by the GC. Moreover, _PyTuple_Resize() tracks also the tuple by the GC. My PR adds _PyTuple_NewNoTrack() and _PyTuple_ResizeNoTrack() to avoid this issue.

The _PyTupleBuilder is built on top of it to wrap the memory allocations.

Add _PyTupleBuilder structure and functions: * _PyTupleBuilder_Init() * _PyTupleBuilder_Alloc() * _PyTupleBuilder_Append() * _PyTupleBuilder_AppendUnsafe() * _PyTupleBuilder_Finish() * _PyTupleBuilder_Dealloc() The builder tracks the size of the tuple and resize it in _PyTupleBuilder_Finish() if needed. Don't allocate empty tuple. Allocate an array of 16 objects on the stack to avoid allocating small tuple. _PyTupleBuilder_Append() overallocates the tuple by 25% to reduce the number of _PyTuple_Resize() calls. Do no track the temporary internal tuple by the GC before _PyTupleBuilder_Finish() creates the final complete and consistent tuple object. Use _PyTupleBuilder API in itertools batched_traverse(), PySequence_Tuple() and initialize_structseq_dict(). Add also helper functions: * _PyTuple_ResizeNoTrack() * _PyTuple_NewNoTrack()

vstinner · 2023-07-23T23:21:30Z

PR rebased to fix a merge conflict.

gvanrossum · 2023-07-24T00:55:32Z

All I am asking is that you hold off for now.

vstinner · 2023-07-24T00:58:26Z

I marked this PR as a draft.

corona10 · 2023-07-24T10:10:30Z

Guido, even if there is a better way to solve this issue, adding the internal private API doesn't look harmful.
If we decide to adopt a better solution in the future, we can replace the API with a better one at any time.

gvanrossum · 2023-07-24T15:51:41Z

@corona10 I worry that APIs are forever, even internal ones. So I recommend that we have a discussion somewhere so we're all on the same page about the problem we're trying to solve and then we can how to solve it, rather than jumping the gun. (I don't require unanimity, just more people having thought about it and come to roughly the same conclusion than just Victor.)

corona10 · 2023-07-24T16:35:43Z

So I recommend that we have a discussion somewhere so we're all on the same page about the problem we're trying to solve and then we can how to solve it, rather than jumping the gun

Thank you for the answer. CPython seems to be on the brink of change in many topics.
All topics are very controversial right now, and the time looks like some core team members have to stop everything for a while and watch. But I hope that the time is not that long :)

vstinner · 2023-07-24T16:36:50Z

In the main Python branch, you can still crash PySequence_Tuple() because it creates a temporary tuple with PyTuple_New() which is immediately tracked by the GC:

import gc
TAG = object()

def monitor():
    lst = [x for x in gc.get_referrers(TAG)
           if isinstance(x, tuple)]
    t = lst[0]   # this *is* the result tuple
    print(t)     # full of nulls !
    return t     # Keep it alive for some time

def my_iter():
    yield TAG    # 'tag' gets stored in the result tuple
    t = monitor()
    for x in range(10):
        yield x  # SystemError when the tuple needs to be resized

tuple(my_iter())

code from: #59313 (comment)

Program in gdb:

vstinner@mona$ gdb -args ./python x.py 
(...)

(<object object at 0x7fffea574af0>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>)

Breakpoint 1, _PyTuple_Resize (pv=0x7fffffffb8b0, newsize=25)
    at Objects/tupleobject.c:910
910	        PyErr_BadInternalCall();

(gdb) up
#1  0x00000000004cd05e in PySequence_Tuple (
    v=<generator at remote 0x7fffea585a90>) at Objects/abstract.c:2135
2135	            if (_PyTuple_Resize(&result, n) != 0) {

The crash occurs in _PyTuple_Resize(): that's why I added no only _PyTuple_NewNoTrack(), but also _PyTuple_ResizeNoTrack().

The problem is that there is a second strong reference to the tuple (created by monitor()): (Py_SIZE(v) != 0 && Py_REFCNT(v) != 1) test failed in _PyTuple_Resize().

vstinner · 2023-07-24T16:38:56Z

In the main Python branch, you can still crash PySequence_Tuple() because it creates a temporary tuple with PyTuple_New() which is immediately tracked by the GC

This PR fix this bug.

With this PR, monitor() cannot fail the tuple anymore, since it's no longer tracked by the GC while the tuple is being filled, and so it's no longer possible to crash Python in PySequence_Tuple().

monitor() fails at:

Traceback (most recent call last):
  File "/home/vstinner/python/main/x.py", line 17, in <module>
    tuple(my_iter())
  File "/home/vstinner/python/main/x.py", line 13, in my_iter
    t = monitor()
        ^^^^^^^^^
  File "/home/vstinner/python/main/x.py", line 7, in monitor
    t = lst[0]   # this *is* the result tuple
        ~~~^^^
IndexError: list index out of range

gvanrossum · 2023-07-24T16:49:15Z

@vstinner Please stop pushing. We've lived with this forever. It doesn't have to be fixed today.

vstinner · 2023-07-24T17:06:27Z

I extracted the non-controversial part of this PR, only _PyTuple_NewNoTrack() and _PyTuple_ResizeNoTrack(), in a new PR fixing the PySequence_Tuple() crash: PR #107183.

Modules/itertoolsmodule.c

rhettinger · 2023-07-28T18:53:55Z

Can tuples be made robust enough to survive being called while partially filled? The tuple_dealloc code already uses Py_XDECREF to support NULL elements. Perhaps the other tuple methods could be similarly fortified.

If not, tools like itertools.batched still have another reasonable defense against the likes of gc.get_referrers(). When PyTuple_New(n) is called, it can be immediately filled with Py_None objects. Then as new data arrives, it can be swapped in. That way the tuple is always in a consistent state even if accessed by GC before completion.

gvanrossum · 2023-07-28T23:55:17Z

Raymond's idea of filling with None is more viable than it used to be since None is now immortal (as of 3.12) so we won't need to worry about its refcount. Still, I worry that there might be C code that creates a new tuple and somehow relies on the items being NULL.

vstinner · 2023-08-26T02:58:08Z

It seems like issues solved by the proposed _PyTupleBuilder API have different solutions discussed at PR #107183. This API doesn't seem to be the preferred option, so I prefer to close my PR and investigate other options first.

vstinner · 2023-10-30T11:34:41Z

Follow-up: I created issue #111489 to make _PyTuple_FromArraySteal() and _PyList_FromArraySteal() functions public.

vstinner added the skip news label Jul 23, 2023

vstinner requested a review from rhettinger as a code owner July 23, 2023 17:10

bedevere-bot mentioned this pull request Jul 23, 2023

C API: Add internal C API to build a tuple: _PyTupleBuilder #107137

Closed

bedevere-bot added the awaiting core review label Jul 23, 2023

vstinner mentioned this pull request Jul 23, 2023

C API: Remove private C API functions (move them to the internal C API) #106320

Closed

vstinner mentioned this pull request Jul 23, 2023

Disallow creation of incomplete/inconsistent objects capi-workgroup/problems#56

Open

vstinner force-pushed the tuple_builder branch from 38a4c0c to e55f80a Compare July 23, 2023 23:21

vstinner marked this pull request as draft July 24, 2023 00:58

bedevere-bot removed the awaiting core review label Jul 24, 2023

vstinner mentioned this pull request Jul 24, 2023

gh-107137: Add _PyTuple_NewNoTrack() internal C API #107183

Closed

rhettinger removed their request for review July 25, 2023 20:53

rhettinger reviewed Jul 25, 2023

View reviewed changes

Modules/itertoolsmodule.c Show resolved Hide resolved

vstinner closed this Aug 26, 2023

vstinner deleted the tuple_builder branch August 26, 2023 02:58

This was referenced Oct 30, 2023

gh-106168: Check allocated instead of size index bounds in PyList_SET_ITEM() #111480

Merged

[C API] Make _PyList_FromArraySteal() function public #111489

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-107137: Add _PyTupleBuilder API to the internal C API #107139

gh-107137: Add _PyTupleBuilder API to the internal C API #107139

vstinner commented Jul 23, 2023 •

edited by bedevere-bot

Loading

vstinner commented Jul 23, 2023

vstinner commented Jul 23, 2023

vstinner commented Jul 23, 2023

gvanrossum commented Jul 23, 2023

vstinner commented Jul 23, 2023

vstinner commented Jul 23, 2023

gvanrossum commented Jul 23, 2023

vstinner commented Jul 23, 2023

vstinner commented Jul 23, 2023

gvanrossum commented Jul 24, 2023

vstinner commented Jul 24, 2023

corona10 commented Jul 24, 2023 •

edited

Loading

gvanrossum commented Jul 24, 2023

corona10 commented Jul 24, 2023

vstinner commented Jul 24, 2023

vstinner commented Jul 24, 2023

gvanrossum commented Jul 24, 2023

vstinner commented Jul 24, 2023

rhettinger commented Jul 28, 2023

gvanrossum commented Jul 28, 2023

vstinner commented Aug 26, 2023

vstinner commented Oct 30, 2023

gh-107137: Add _PyTupleBuilder API to the internal C API #107139

gh-107137: Add _PyTupleBuilder API to the internal C API #107139

Conversation

vstinner commented Jul 23, 2023 • edited by bedevere-bot Loading

vstinner commented Jul 23, 2023

vstinner commented Jul 23, 2023

vstinner commented Jul 23, 2023

gvanrossum commented Jul 23, 2023

vstinner commented Jul 23, 2023

vstinner commented Jul 23, 2023

gvanrossum commented Jul 23, 2023

vstinner commented Jul 23, 2023

vstinner commented Jul 23, 2023

gvanrossum commented Jul 24, 2023

vstinner commented Jul 24, 2023

corona10 commented Jul 24, 2023 • edited Loading

gvanrossum commented Jul 24, 2023

corona10 commented Jul 24, 2023

vstinner commented Jul 24, 2023

vstinner commented Jul 24, 2023

gvanrossum commented Jul 24, 2023

vstinner commented Jul 24, 2023

rhettinger commented Jul 28, 2023

gvanrossum commented Jul 28, 2023

vstinner commented Aug 26, 2023

vstinner commented Oct 30, 2023

vstinner commented Jul 23, 2023 •

edited by bedevere-bot

Loading

corona10 commented Jul 24, 2023 •

edited

Loading