Performance of PyO3 (by the example of an `indexmap` wrapper) #4085

inducer · 2024-04-16T14:47:59Z

inducer
Apr 16, 2024

Hi all,

First off, thank you for making and maintaining PyO3! It's been a joy to use in general, despite being a total Rust newbie.

For ... reasons, I have a need for a set type that remembers its insertion order, just like Python's built-in dict. I came across https://github.com/indexmap-rs/indexmap and thought, well, let's just wrap that into Python, how hard could it be? The results of my efforts are here: https://github.com/inducer/indexset/.

Unfortunately, that wrapper gets demolished a bit in terms of performance by Python's set. This is what I get (with maturin develop -r) for a simple benchmark:

$ python -O examples/benchmark.py
create <class 'set'>: 29.7784799 ns/round
add: <class 'set'>: 34.2262695 ns/round
create <class 'builtins.IndexSet'>: 74.6236501 ns/round
add: <class 'builtins.IndexSet'>: 181.2864645 ns/round

Here's that benchmark script:

https://github.com/inducer/indexset/blob/0c765a23a2dbe679156560c63a364431ef2a3019/examples/benchmark.py

Benchmarking `create`

For the create operation, this is where perf says the time is being spent:

  16,31%  python   python3.12                                [.] _PyEval_EvalFrameDefault                                                                                                                       
  16,13%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] pyo3::gil::ReferencePool::update_counts                                                                                                        
  10,24%  python   ld-linux-x86-64.so.2                      [.] __tls_get_addr                                                                                                                                 
   5,56%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] <pyo3::gil::GILPool as core::ops::drop::Drop>::drop                                                                                            
   3,91%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] indexset::IndexSet::__pymethod___new____                                                                                                       
   3,75%  python   python3.12                                [.] PyLong_FromLong                                                                                                                                
   3,56%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] pyo3::impl_::extract_argument::FunctionDescription::extract_arguments_tuple_dict                                                               
   3,54%  python   python3.12                                [.] _PyObject_MakeTpCall                                                                                                                           
   3,48%  python   python3.12                                [.] PyObject_Free                                                                                                                                  
   3,40%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] pyo3::pyclass_init::PyClassInitializer<T>::create_class_object_of_type                                                                         
   3,33%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] pyo3::impl_::trampoline::trampoline                                                                                                            
   3,08%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] pyo3::impl_::trampoline::trampoline_unraisable                                                                                                 
   2,33%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] __tls_get_addr@plt                                                                                                                             
   2,22%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] <pyo3::pycell::impl_::PyClassObject<T> as pyo3::pycell::impl_::PyClassObjectLayout<T>>::tp_dealloc                                             
   2,01%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] <pyo3::pyclass_init::PyNativeTypeInitializer<T> as pyo3::pyclass_init::PyObjectInit<T>>::into_new_object::inner

It looks like GIL wrangling is a major culprit here. At the same time, the object creation codepath looks pretty textbook:

https://github.com/inducer/indexset/blob/0c765a23a2dbe679156560c63a364431ef2a3019/src/lib.rs#L55-L57

I don't know what I could/should be doing differently.

Benchmarking `add`

For the add operation, this is where perf says the time is being spent:

  21,80%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] indexmap::map::core::IndexMapCore<K,V>::insert_full                                                                                            
  10,87%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] _ZN9hashbrown3raw5inner21RawTable$LT$T$C$A$GT$14reserve_rehash17hf53f97d765955497E.llvm.11277930219631171371                                   
  10,34%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] <pyo3::instance::Bound<pyo3::types::any::PyAny> as pyo3::types::any::PyAnyMethods>::eq                                                         
   7,81%  python   python3.12                                [.] _PyEval_EvalFrameDefault                                                                                                                       
   6,21%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] pyo3::gil::ReferencePool::update_counts                                                                                                        
   4,56%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] _ZN8indexset156_$LT$impl$u20$pyo3..impl_..pyclass..PyMethods$LT$indexset..IndexSet$GT$$u20$for$u20$pyo3..impl_..pyclass..PyClassImplCollector$L
   2,95%  python   ld-linux-x86-64.so.2                      [.] __tls_get_addr                                                                                                                                 
   2,73%  python   python3.12                                [.] PyLong_FromLong                                                                                                                                
   2,60%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] <indexset::MyPyObject as core::hash::Hash>::hash                                                                                               
   2,23%  python   [kernel.kallsyms]                         [k] clear_page_erms                                                                                                                                
   2,20%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] _ZN8indexmap3map25IndexMap$LT$K$C$V$C$S$GT$4hash17h04dfa31590eac1abE.llvm.16485818830032576859                                                 
   2,18%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] pyo3::impl_::extract_argument::FunctionDescription::extract_arguments_fastcall                                                                 
   1,41%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] pyo3::impl_::pyclass::lazy_type_object::LazyTypeObjectInner::get_or_try_init                                                                   
   1,25%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] __tls_get_addr@plt                                                                                                                             
   1,23%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] <pyo3::pycell::PyRefMut<T> as pyo3::conversion::FromPyObject>::extract_bound                                                                   
   1,12%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] <pyo3::gil::GILPool as core::ops::drop::Drop>::drop                                                                                            
   1,00%  python   [kernel.kallsyms]                         [k] __list_del_entry_valid_or_report

So the underlying data type appears to be responsible for some of the damage here, but the equality comparison also features prominently. There's some GIL wrangling in there, too.

I had a hunch that storing the hash along with the object might be profitable (by avoiding some of the GIL acquisitions here, but that did not pan out, it made things even slower. See this PR.)

Discussion

The wrapper code is not obnoxiously long and pretty simple:

https://github.com/inducer/indexset/blob/0c765a23a2dbe679156560c63a364431ef2a3019/src/lib.rs

At one point, I guessed that memory allocation would feature prominently, but it doesn't even show up in the profile.

I'm a bit out of ideas what I could try to improve matters, and I'd be grateful for any advice.

Edit: I just realized I should have specified PyO3 versions and other environment specifics. Most of that is in the lock file. Otherwise, I'm using Python 3.12 on a Raptor Lake laptop on Debian testing/unstable with

stable-x86_64-unknown-linux-gnu (default)
rustc 1.74.0 (79e9716c9 2023-11-13)

cc @matthiasdiener

birkenfeld · 2024-04-16T15:34:54Z

birkenfeld
Apr 16, 2024
Collaborator

Not that it's not worth trying to speed this up, but wouldn't you be covered by a dict with values set to None? Memory use should be very comparable, at least if the set implementation is still similar to what it was originally.

2 replies

inducer Apr 16, 2024
Author

Good question! We tried that. It comes out about a factor of 3x slower (IIRC) than built-in sets on many operations. (@matthiasdiener can probably fill in more precise benchmark data) Unfortunately, this is enough of a performance hit in our overall project (something something code transformation) that it doesn't totally seem like the way to go.

matthiasdiener Apr 16, 2024

Yeah, unfortunately, using a dict instead of a set is slower in many cases, see e.g. https://matthiasdiener.github.io/orderedsets/speed.html. This is not just due to a different implementation, but also due to the additional Python layer to make the dict behave like a set.

adamreichold · 2024-04-17T21:12:07Z

adamreichold
Apr 17, 2024
Maintainer

I think with such fine-grained operations, you are at the limits of what can currently be achieved with PyO3, i.e. the work we do to provide type-safe access to generic Rust code does imply a certain overhead at the Python-Rust boundary which requires a certain minimum amount of work on the Rust side to still result in speed-ups.

That said, we are continuously working to decrease that overhead. For example, we will remove the GILPool completely in an upcoming release.

But then gain, accesses to thread-local storage figure prominently in your profiles which I think is actually one of our coping mechanisms: We keep a private flag in thread-local storage to indicate that the GIL is held so that we do not have to call into the Python interpreter to verify this for nested calls to Python::with_gil.

But for example in your create benchmark, you paying to set the thread-local variable, but never gain from with_gil calls becoming cheaper. However, for complex code bases doing a lot of work per Python-Rust transition, this optimization is worth it.

As for ReferencePool::update_counts, we could use a lockless implementation that further penalizes the producer side of such out-of-band reference count updates without the GIL to make the consumer side holding the GIL faster, but we are already close to a simple TLS access so I do not see order-of-magnitude improvements here as we do need global synchronization to enable Cloneing pointers into the Python heap without the GIL being held. (But the upcoming GIL-free CPython builds might reshuffle the overheads significantly, but for all Python code, not just PyO3-based extensions.)

Finally, you should be able to avoid some GIL wrangling overhead using unsafe code, e.g. provide both

#[repr(transparent)]
struct MyPyObject(PyObject);

and

#[repr(transparent)]
struct MyBoundAny<'py>(Bound<'py, PyAny>);

and in your add implementation, unsafely cast &'py mut self.0 from &'py mut IndexSet<MyPyObject> to &'py mut IndexSet<MyBoundAny<'py>> so that all operations (cloning, hashing, equality comparison) on MyBoundAny can assume the GIL being held at least while inside the call to add.

This should work because Py<PyAny> and Bound<'py, PyAny> are layout compatible and should be safer than just using Python::assume_gil_acquired to implement MyPyObject unconditionally.

However, this will not reduce the constant overhead of the outermost setup we do to make with_gil cheaper, nor will it remove the runtime borrow checking we have to do to provide a &mut Self in the first place, catch panics or verify type objects.

6 replies

adamreichold Apr 18, 2024
Maintainer

Looking at your profiles, I think a feature flag to conditionally leave out the ability to Clone Py without the GIL being held (e.g. panic! instead of enqueue reference count updates for later), i.e. remove ReferencePool::update_counts, might be a good option that would not have any safety costs (as panic! is a non-GIL-holder calls Clone is safe)?

adamreichold Apr 18, 2024
Maintainer

@inducer Could you check whether #4095 would even work for you (i.e. whether possibly leaking objects on Drop is palatable) and whether it significantly improves your benchmarks?

adamreichold Apr 18, 2024
Maintainer

(I guess this is already implied by my first response, to be make this explicit for later readers: Your best bet to improve performance is not to fight these overheads, but to identify operations much larger than a single call to IndexSet::insert to move into Rust and end up with a more coarse-grained Python-Rust interface so that these overheads are paid off by the efficiency gains on the Rust side.)

inducer May 29, 2024
Author

Thanks @adamreichold and everyone for the in-depth discussion and the follow-on work, including #4095. And sorry about the long silence, I only just now had the chance to follow up. In short, #4095 (and whatever other changes there have been since 0.21.1) seems to help quite a bit. Before:

create <class 'set'>: 33.6112619 ns/round
add: <class 'set'>: 34.9631243 ns/round
create <class 'builtins.IndexSet'>: 80.9717803 ns/round
add: <class 'builtins.IndexSet'>: 195.8259873 ns/round

With inducer/indexset#2 (i.e. using the py-clone feature, and using near-main pyo3):

create <class 'set'>: 34.20385 ns/round
add: <class 'set'>: 35.1811786 ns/round
create <class 'builtins.IndexSet'>: 59.0445457 ns/round
add: <class 'builtins.IndexSet'>: 194.4876426 ns/round

The perf profile for create is looking a lot better now:

  17,53%  python   python3.12                                [.] _PyEval_EvalFrameDefault                                                                                               
   7,55%  python   python3.12                                [.] PyObject_Free                                                                                                          
   6,04%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] pyo3::impl_::extract_argument::FunctionDescription::extract_arguments_tuple_dict                                       
   5,47%  python   ld-linux-x86-64.so.2                      [.] __tls_get_addr                                                                                                         
   4,04%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] indexset::IndexSet::__pymethod___new____                                                                               
   4,00%  python   python3.12                                [.] _PyObject_MakeTpCall                                                                                                   
   3,93%  python   python3.12                                [.] PyObject_Malloc                                                                                                        
   3,78%  python   indexset.cpython-312-x86_64-linux-gnu.so  [.] <pyo3::pycell::impl_::PyClassObject<T> as pyo3::pycell::impl_::PyClassObjectLayout<T>>::tp_dealloc                     
   3,70%  python   python3.12                                [.] PyType_GenericAlloc                                                                                                    
   3,17%  python   python3.12                                [.] PyObject_Vectorcall

I also tried your unsafe-cast suggestion in add, but that's not helping much because essentially all the cost of that is already in the underlying data structure. (see the perf data in the linked PR)

davidhewitt May 30, 2024
Maintainer

Aha interesting. So I think the next step that could help create might be python/cpython#100554 - but that's in the Python interpreter. Certainly I imagine that the builtin set has an advantage on creation by being able to use a more efficient calling convention.

As for add - I think #4026 will yet remove some PyO3 overhead from the call. I also wonder if for eq whether the underlying cause is the fact that PyO3 might be doing additional refcounting operations which aren't strictly necessary due to the use of the ToPyObject trait. Will have to think about that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of PyO3 (by the example of an `indexmap` wrapper) #4085

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Performance of PyO3 (by the example of an indexmap wrapper) #4085

inducer Apr 16, 2024

Benchmarking create

Benchmarking add

Discussion

Replies: 2 comments · 8 replies

birkenfeld Apr 16, 2024 Collaborator

inducer Apr 16, 2024 Author

matthiasdiener Apr 16, 2024

adamreichold Apr 17, 2024 Maintainer

adamreichold Apr 18, 2024 Maintainer

adamreichold Apr 18, 2024 Maintainer

adamreichold Apr 18, 2024 Maintainer

inducer May 29, 2024 Author

davidhewitt May 30, 2024 Maintainer

Performance of PyO3 (by the example of an `indexmap` wrapper) #4085

inducer
Apr 16, 2024

Benchmarking `create`

Benchmarking `add`

Replies: 2 comments 8 replies

birkenfeld
Apr 16, 2024
Collaborator

inducer Apr 16, 2024
Author

adamreichold
Apr 17, 2024
Maintainer

adamreichold Apr 18, 2024
Maintainer

adamreichold Apr 18, 2024
Maintainer

adamreichold Apr 18, 2024
Maintainer

inducer May 29, 2024
Author

davidhewitt May 30, 2024
Maintainer