Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memoryview is a Sequence but does not implement the full Sequence API #125420

Closed
JelleZijlstra opened this issue Oct 13, 2024 · 23 comments
Closed
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@JelleZijlstra
Copy link
Member

JelleZijlstra commented Oct 13, 2024

Bug report

Bug description:

The memoryview builtin is registered as a Sequence, but doesn't implement the full API, because it doesn't inherit from Sequence at runtime and doesn't implement the mixin methods.

>>> import collections.abc
>>> issubclass(memoryview, collections.abc.Sequence)
True
>>> collections.abc.Sequence.index
<function Sequence.index at 0x10148d250>
>>> memoryview.index
Traceback (most recent call last):
  File "<python-input-4>", line 1, in <module>
    memoryview.index
AttributeError: type object 'memoryview' has no attribute 'index'

Among the methods listed for Sequence in the documentation, memoryview is missing __contains__, __reversed__, index, and count. This is causing problems for typing; in the typeshed stubs we have to either lie one way by claiming that memoryview has methods it doesn't have, or lie another way by claiming it is not a Sequence when issubclass() says otherwise at runtime (python/typeshed#12800). To fix this, we should either make memoryview not a Sequence, or add the missing methods. The former has compatibility implications, so I propose to add the methods in 3.14.

CPython versions tested on:

CPython main branch

Operating systems tested on:

No response

Linked PRs

@JelleZijlstra JelleZijlstra added the type-bug An unexpected behavior, bug, or error label Oct 13, 2024
@JelleZijlstra
Copy link
Member Author

cc @rhettinger for ABCs and @AlexWaygood for prior discussion on typeshed.

@picnixz picnixz added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Oct 14, 2024
@picnixz
Copy link
Member

picnixz commented Oct 14, 2024

I think it would be nice to have such addition. If we were to add them in 3.14, I'd like to help! (at least with one method; I think it's better if we don't add them all in one go).

@JelleZijlstra
Copy link
Member Author

Go ahead, I haven't started coding yet.

@ZeroIntensity
Copy link
Member

IIRC, the sequence protocol on memoryview works in the C API. You guys should be able to just use those functions to implement the methods.

@picnixz
Copy link
Member

picnixz commented Oct 14, 2024

By the way:

>>> reversed(memoryview(b'a'))
<reversed object at 0x7f6ece3fd8b0>

works and we actually have tests for that. However, I think there is somewhere an implicit cast or a generic reversed that is being used via indices + length maybe? So I don't know whether we should have a dedicated reversed method or just dispatch to the generic implementation.

@JelleZijlstra
Copy link
Member Author

Yes, reversed() works through len and getitem lookup. It may be worth adding that method if it makes reversing more efficient, but I'd also be OK with skipping that one and using some hacks in the stub; after all, it's unlikely people are calling __reversed__ directly.

__contains__ is in a similar situation; in already works on memoryviews. However, there it's likely that a contains implementation written directly for memoryview is significantly more efficient.

@picnixz
Copy link
Member

picnixz commented Oct 14, 2024

Well... the __contains__ I used is just by iterating over the object for now. I actually haven't checked that __contains__ worked beforehand. I can definitely make it more efficient though if performances matter (it's just that the __contains__ does not exist per se).

@ZeroIntensity
Copy link
Member

I took a quick glance at the PRs and they seem fine to me, but I would be very cautious about making sure that e.g. memoryview.index does the same thing as PySequence_Index, otherwise you're bound to break untold amounts of code relying on the buffer protocol.

@JelleZijlstra
Copy link
Member Author

That's a good point (though I'm not sure there are untold amounts of code using PySequence_Index on memoryviews). PySequence_Index delegates to _PySequence_IterSearch in abstract.c, which works similar to the implementation in Bénédikt's PRs.

@picnixz
Copy link
Member

picnixz commented Oct 14, 2024

... I forgot to finish the multi-dimensional case. And the fact that it's not only basic memoryviews that should be supported but the entire protocol... I'll need to check the multi-dimensional case carefully.

you're bound to break untold amounts of code relying on the buffer protocol.

(I'll try passing on that achievement)

@JelleZijlstra
Copy link
Member Author

It appears iteration on multidimensional memoryviews currently isn't implemented at all, so we don't need to support it for these new methods either.

@picnixz
Copy link
Member

picnixz commented Oct 15, 2024

I've implemented the missing (naive) interface. The performances should be better than just the generic implementations since we have less checks but some methods can be improved by directly accessing the underlying buffer instead of creating iterators.

If such performance gain is needed, we'll address those concerns in a separate issue.

@ZeroIntensity
Copy link
Member

I can look closer at the PRs sometime today. I have a few general comments at a quick glance:

  • You can probably get some micro optimizations by using the thread state and the private API (if we even do that in the standard library, I'm not sure).
  • There any new assertions--those are good sanity checks, and especially useful when debugging.
  • Be careful to make sure that the buffer has everything it should. If you just copied over from the existing implementation, this should be fine (though, assertions are your friend here).

@picnixz
Copy link
Member

picnixz commented Oct 15, 2024

Performances should really be addressed separately. We try to avoid using the private API if this is not performance critical or if it is exposed "almost directly" to the user. Before using the private API, we need benchmarks but I think it's first better not to do it in those PRs. We will definitely gain performances by directly accessing the buffers but those should be investigated.

For assertions, which place do you have in in mind? I can find some in the __reversed__ implementation since I'm making assumption on the inputs, so I can update it later but otherwise I can't find other places where assertions need to be required.

@ZeroIntensity
Copy link
Member

ZeroIntensity commented Oct 15, 2024

When writing C code, I like to put assertions everywhere and anywhere. They add no overhead at runtime (they get compiled out on release builds) and make debugging a lot easier. Generally, if you make an assumption, add an assertion.

@JelleZijlstra
Copy link
Member Author

The performances should be better

That's a dangerous statement; it would be good to run some microbenchmarks on operations like x in memoryview(), reversed(memoryview()).

I am going to wait a bit before merging any of the PRs here in case other core devs want to voice their opinion on whether this is a good idea.

@rhettinger
Copy link
Contributor

I'm conflicted about this one.

On the plus side, this makes the APIs more consistent and will simplify the typeshed stubs.

On the minus side, it feels a bit off to implement (and have to test and maintain) an actual concrete implementation for functionality that no user has ever needed. It seems a little like the tail wagging the dog. In the past, we've usually applied a principle of parsimony and resisted API expansions until good use cases have arisen. That has helped avoid cruft accumulating for features than never get used.

@picnixz
Copy link
Member

picnixz commented Oct 18, 2024

On the plus side, we could also benefit from possible improvements though that's just a gut feeling.

For instance materializing the reverse iterator is now slower but iterating over it is faster if the buffer is large enough.

This could also serve for the multidimensional case but I have no idea if this is a feature that we plan to have or not. The APIs were simple enough to implement so it didn't bother me btw.

@JelleZijlstra
Copy link
Member Author

I feel that we should add count and index. memoryview is registered as a Sequence, and Sequence is documented as having those methods, so to me it seems like a bug that the methods don't exist on memoryview.

I don't feel as strongly about the missing dunders, __contains__ and __reversed__. Adding them makes the typing a bit more tidy, but we could manage without them. Adding those methods may produce performance wins, but there probably isn't a lot of real code that relies on reversing memoryviews or doing containment checks on them.

@picnixz
Copy link
Member

picnixz commented Oct 18, 2024

Yeah index and count seem important since they are public and documented. For contains and reversed I think we can just rely on the default dunders.

I'm actually wondering but does "being a sequence" means "having a __contains__ method" or does it mean "x in seq" works? or is it not detailed on purpose?

@picnixz
Copy link
Member

picnixz commented Dec 15, 2024

@JelleZijlstra @rhettinger Some questions:

  • Do you want the count and index being backported or not? They are half a feature, half a bug fix.

  • Do you think we should add the dunder methods. The PRs are ready but there is room to improvements at the cost of PR's readability (see the following discussion: gh-125420: implement Sequence.__contains__ API on memoryview objects #125441 (comment)).

  • Do you think we should strive for an implementation using indices instead of materializing iterators? The two PRs that were merged are using the "simplest" (but not necessarily the fastest) approach since we do not support multi-dimensional slicing. Considering the lack of development for multi-dimensional slicing for the past ten years, I assumed that implementing them is either a rarer use-case or has been (possibly) delegated to third-party libs (maybe numpy or perhaps tensorflow, I don't know if they implement this).

  • Finally, could someone enlighten me (formally) about the following:

    I'm actually wondering but does "being a sequence" means "having a __contains__ method" or does it mean "x in seq" works? or is it not detailed on purpose?

@JelleZijlstra
Copy link
Member Author

* Do you want the `count` and `index` being backported or not?

We should not backport these changes. Adding a new method is a feature.

* Do you think we should add the dunder methods. The PRs are ready but there is room to improvements at the cost of PR's readability (see the following discussion: [gh-125420: implement `Sequence.__contains__` API on `memoryview` objects #125441 (comment)](https://github.com/python/cpython/pull/125441#discussion_r1885415227)).

I think it's probably not necessary, though we could add these methods if they significantly improve performance. For what it's worth, there are various other builtin classes that implement Sequence but don't have the dunders (results on main):

>>> [x for x in object.__subclasses__() if issubclass(x, collections.abc.Sequence) and not hasattr(x, "__contains__")]
[<class 'memoryview'>, <class 'sqlite3.Row'>]
>>> [x for x in object.__subclasses__() if issubclass(x, collections.abc.Sequence) and not hasattr(x, "__reversed__")]
[<class 'bytearray'>, <class 'bytes'>, <class 'memoryview'>, <class 'tuple'>, <class 'str'>, <class 'sqlite3.Row'>]
* Do you think we should strive for an implementation using indices instead of materializing iterators? The two PRs that were merged are using the "simplest" (but not necessarily the fastest) approach since we do not support multi-dimensional slicing. Considering the lack of development for multi-dimensional slicing for the past ten years, I assumed that implementing them is either a rarer use-case or has been (possibly) delegated to third-party libs (maybe numpy or perhaps tensorflow, I don't know if they implement this).

I think it would be nice as a performance optimization, but it doesn't seem that urgent to me.

* Finally, could someone enlighten me (formally) about the following:
  > I'm actually wondering but does "being a sequence" means "having a `__contains__` method" or does it mean "`x in seq`" works? or is it not detailed on purpose?

I think in practice, it's the latter: x in seq works. What matters is the user-visible operation; the dunders are an implementation detail. This makes life harder for static type checking, but we'll survive. I wouldn't be opposed to an effort to add those missing dunders consistently to builtin classes, but that seems like a risky project.

@picnixz picnixz added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error 3.14 new features, bugs and security fixes labels Dec 15, 2024
@picnixz
Copy link
Member

picnixz commented Dec 15, 2024

We should not backport these changes. Adding a new method is a feature.

I've changed the labels to reflect this.

For what it's worth, there are various other builtin classes that implement Sequence

Considering this, I'll close this issue as being completed and prefer re-opening later if we need the dunders since the corresponding implementation could be discussed more in details. WDYT?

it's the latter: x in seq works

Thanks for the explanation.

@picnixz picnixz closed this as completed Dec 15, 2024
srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Jan 8, 2025
srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants