Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make some PyMutex functions public #117511

Closed
colesbury opened this issue Apr 3, 2024 · 13 comments
Closed

Make some PyMutex functions public #117511

colesbury opened this issue Apr 3, 2024 · 13 comments
Labels
topic-free-threading type-feature A feature request or enhancement

Comments

@colesbury
Copy link
Contributor

colesbury commented Apr 3, 2024

Feature or enhancement

Overview

The PyMutex APIs are currently internal-only in pycore_lock.h. This proposes making the type and two functions public in as part of the general, non-limited API in Include/cpython/lock.h.

The APIs to be included in the public header are:

typedef struct PyMutex {
    uint8_t v;
} PyMutex;

static inline void PyMutex_Lock(PyMutex *m) { ... }
static inline void PyMutex_Unlock(PyMutex *m) { ... }

// (private) slow path for locking the mutex
PyAPI_FUNC(void) _PyMutex_LockSlow(PyMutex *m);

// (private) slow path for unlocking the mutex
PyAPI_FUNC(void) _PyMutex_UnlockSlow(PyMutex *m);

Motivation

  • With the free-threaded build, C API extensions are more likely to require locking for thread-safety
  • The PyMutex API is easier to use correctly than the existing PyThread_type_lock. The PyMutex APIs release the GIL before blocking, whereas with the PyThread_type_lock APIs you often want to use a two step approach to locking to release the GIL before blocking.
  • The PyMutex API enables a more efficient implementation: it's faster to acquire when the lock is not contended and does not require allocation.

C-API WG issue

cc @ngoldbaum

Linked PRs

@colesbury colesbury added type-feature A feature request or enhancement topic-free-threading labels Apr 3, 2024
@da-woods
Copy link
Contributor

da-woods commented Apr 3, 2024

What's the private slow path for? I'm a bit puzzled why you'd ever choose that?

@colesbury
Copy link
Contributor Author

The PyMutex_Lock and PyMutex_Unlock are static inline functions, but really only want to inline the simple fast-path case when the mutex is not contended. The slow path functions exist for PyMutex_Lock and PyMutex_Unlock to call.

static inline void
PyMutex_Lock(PyMutex *m)
{
uint8_t expected = _Py_UNLOCKED;
if (!_Py_atomic_compare_exchange_uint8(&m->v, &expected, _Py_LOCKED)) {
_PyMutex_LockSlow(m);
}
}
// Unlocks the mutex.
static inline void
PyMutex_Unlock(PyMutex *m)
{
uint8_t expected = _Py_LOCKED;
if (!_Py_atomic_compare_exchange_uint8(&m->v, &expected, _Py_UNLOCKED)) {
_PyMutex_UnlockSlow(m);
}
}

@ngoldbaum
Copy link
Contributor

ngoldbaum commented Apr 3, 2024

There was a discussion about this late last year on discourse: https://discuss.python.org/t/exposing-atomic-operations-tooling-in-the-public-c-api/39973

I’m now looking at adding support for free-threaded python in NumPy. My current plan is to use PyThreadType_lock or when that is not performant enough, look at using platform-specific synchronization primitives or e.g. C11 threads.h.

I think just using PyMutex instead would be better since it’s a simple API and is fast enough for many use cases. I understand there’s a desire to avoid maintaining a suboptimal API forever, but I think CPython providing a migration story for C extension authors would save a lot of time distributed across the community.

@vstinner
Copy link
Member

vstinner commented Apr 3, 2024

typedef struct _PyMutex { uint8_t v; } PyMutex;

Would it be possible to allocate it on the heap memory, and make the structure members opaque?

@colesbury
Copy link
Contributor Author

colesbury commented Apr 3, 2024

A heap allocated mutex is less useful and harder to use. For example:

  • Sometimes you need a global lock. You can't do a heap allocation when you declare a global variable in C. If you do the allocation lazily before the PyMutex_Lock() it's too late -- now you need a lock to protect your lock initialization.
  • Embedding a PyMutex is cheap, but embedding a heap allocated mutex in a small object can be prohibitive.

I don't think making it opaque and heap allocated provides any concrete advantages in this case. Oftentimes, opaqueness gives us more flexibility in adjusting the implementation, but in this case it restricts the implementation in ways that matter.

@vstinner
Copy link
Member

vstinner commented Apr 4, 2024

My concern is that tomorrow we might find more efficient mutex and need new more specific mutex (recursive, "adaptative", "error check", etc.), and we might be blocked by the structure which can no longer be modified.

There are two aspects: API compatibility (we can change the structure) and ABI compatibility (stable ABI, we cannot change the structure if it's exposed).

Maybe we can support the stable ABI by not exposing the structure in the limited C API, and only support heap allocation for the limited C API.


The Linux ptread mutex implementation supports heap allocation and static initialization, and the structure is opaque. Extract:

#ifdef __x86_64__
# if __WORDSIZE == 64
#  define __SIZEOF_PTHREAD_MUTEX_T 40
# else
#  define __SIZEOF_PTHREAD_MUTEX_T 32
# endif
#else
# define __SIZEOF_PTHREAD_MUTEX_T 24
#endif

typedef union
{
  struct __pthread_mutex_s __data;
  char __size[__SIZEOF_PTHREAD_MUTEX_T];
  long int __align;
} pthread_mutex_t;

#define PTHREAD_MUTEX_INITIALIZER \
 { {  __PTHREAD_MUTEX_INITIALIZER (PTHREAD_MUTEX_TIMED_NP) } }
#ifdef __USE_GNU
# define PTHREAD_RECURSIVE_MUTEX_INITIALIZER_NP \
 { {  __PTHREAD_MUTEX_INITIALIZER (PTHREAD_MUTEX_RECURSIVE_NP) } }
# define PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP \
 { {  __PTHREAD_MUTEX_INITIALIZER (PTHREAD_MUTEX_ERRORCHECK_NP) } }
# define PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP \
 { {  __PTHREAD_MUTEX_INITIALIZER (PTHREAD_MUTEX_ADAPTIVE_NP) } }
#endif

// ---- Internal ---------------------

struct __pthread_mutex_s
{
  int __lock;
  unsigned int __count;
  int __owner;
#ifdef __x86_64__
  unsigned int __nusers;
#endif
  /* KIND must stay at this position in the structure to maintain
     binary compatibility with static initializers.  */
  int __kind;
#ifdef __x86_64__
  short __spins;
  short __elision;
  __pthread_list_t __list;
# define __PTHREAD_MUTEX_HAVE_PREV      1
#else
  unsigned int __nusers;
  __extension__ union
  {
    struct
    {
      short __espins;
      short __eelision;
# define __spins __elision_data.__espins
# define __elision __elision_data.__eelision
    } __elision_data;
    __pthread_slist_t __list;
  };
# define __PTHREAD_MUTEX_HAVE_PREV      0
#endif
};

#ifdef __x86_64__
# define __PTHREAD_MUTEX_INITIALIZER(__kind) \
  0, 0, 0, 0, __kind, 0, 0, { 0, 0 }
#else
# define __PTHREAD_MUTEX_INITIALIZER(__kind) \
  0, 0, 0, __kind, 0, { { 0, 0 } }
#endif

I see that pthread_mutex contains way more information: lock, owner, nusers, spins, elision, and are single linked list.

Maybe the list is used to solve the hard problem is keeping locks consistent at fork: see issue gh-50970. Do you plan to solve this problem or leave it aside?

@colesbury
Copy link
Contributor Author

colesbury commented Apr 4, 2024

The mutex is already adaptive, both in the sense that it dynamically switches from spinning to blocking and from unfair (barging) to fair (handoff). We should not shoehorn a recursive mutex into PyMutex -- a recursive mutex should have it's own data type -- but that would be feasible too. The primary means of extensibility is the wait entry and that's private and we can add or remove fields as desired.

Linux's pthread mutex isn't really any more opaque than this 1: note that struct __pthread_mutex_s is embedded in pthread_mutex_t. It's not an opaque pointer. We could do the same union, but it just adds unnecessary complexity without any flexibility. Linux (glibc) is stuck with 40 byte mutexes on x86-64. If we made this part of the stable ABI, which is not part of this proposal, than we'd be stuck with 1 byte PyMutex. If you can do the same things with 1 byte as with 40 bytes, than 1 byte is better.

We already clear the wait entries at fork, which is better than the pthreads implementations. With pthreads, if you unlock a mutex you own after a fork, you might still deadlock due to trying to wake up dead threads. We don't have that problem with the PyMutex implementation because the wait entries are kept separately. But there's still no magic way to solve locks + forks because it's not enough to keep the locks consistent -- you really want to keep the data protected by the locks consistent too.

Footnotes

  1. If you #include <pthread.h> you get the full definition of struct __pthread_mutex_s. It's private (not documented), but not "internal" in the way that our pycore_xxx.h headers are internal.

@erlend-aasland
Copy link
Contributor

Would it be possible to allocate it on the heap memory [...]

This issue is about exposing the existing mutex API, not about designing a new kind of mutex API.

Sam, you have a big +1 from me for exposing the existing PyMutex API.

@vstinner
Copy link
Member

vstinner commented Apr 5, 2024

I'm fine with adding proposed API, but not to added it the limited C API.

I suggest to rename struct _PyMutex to struct PyMutex.


The limited C API (and stable ABI) would need an alloc/free API similar to PyThread_allocate_lock() / PyThread_free_lock(), hide the structure, and expose PyMutex_Lock() and PyMutex_Unlock() in the limited C API (but still implement them as static inline in the public C API). Pseudo-API for an hypothetical limited C API:

struct PyMutex PyMutex;
PyMutex *PyMutex_Alloc(void);
void PyMutex_Lock(PyMutex *mutex);
void PyMutex_Unlock(PyMutex *mutex);
void PyMutex_Free(PyMutex *mutex);
  • No public structure member.
  • No static inline function or macro, only opaque function calls.

An alternative is to reuse the PyThread API to expose PyMutex there in a stable ABI.

@dpdani
Copy link
Contributor

dpdani commented Apr 19, 2024

What about adding an initialization API (albeit trivial)?

Right now, this is how I'm initializing a PyMutex, and here how it's currently done in main for dict.

Of course it's trivial to set one field to 0, but it sounds like an implementation detail a programmer shouldn't need to worry about.
Would it make sense to also add a public init routine?

@colesbury
Copy link
Contributor Author

colesbury commented Apr 19, 2024

No, I think an init routine just adds complexity. There are standard ways to zero (or default-value) initialize a structure in C and C++ that do not require knowledge of the fields:

PyMutex m = {};  // C23 and C++11
static PyMutex;
PyMutex m = {0};  // C99 and C11 idiom. Note: zero-initializes all fields
self->mutex = (PyMutex){0};

@dpdani
Copy link
Contributor

dpdani commented Apr 19, 2024

I can't really see how it adds complexity, but that might be my own short sightedness.
Do you mean that making the call would be computationally more costly for a routine that will be called once per object allocation?

If not for that, it seems to me more future-proof to have an init routine to call, in case at some point whatever different value is needed at initialization time for the mutex to function.
(e.g. some flag bits or whatever may be found to be useful)

@colesbury
Copy link
Contributor Author

colesbury commented Apr 19, 2024

It adds API complexity (not computation complexity). It is redundant with standard idioms for zero initialization. It doesn't add real flexibility. If you allow zero-initialization, people will rely on it, so you can't change the representation without breaking backwards compatibility even if you also provide an initialization APIs that does the same thing.

You would also need at least two APIs: one for static initialization and one for dynamically allocated objects.

colesbury added a commit to colesbury/cpython that referenced this issue Jun 17, 2024
colesbury added a commit to colesbury/cpython that referenced this issue Jun 20, 2024
colesbury added a commit to colesbury/cpython that referenced this issue Jun 20, 2024
…ythonGH-117731)

(cherry picked from commit 3af7263)

Co-authored-by: Sam Gross <colesbury@gmail.com>
colesbury added a commit to colesbury/cpython that referenced this issue Jun 20, 2024
…ythonGH-117731)

(cherry picked from commit 3af7263)

Co-authored-by: Sam Gross <colesbury@gmail.com>
colesbury added a commit that referenced this issue Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-free-threading type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

6 participants