Python Accelerations

Here's a whiteboard diagram we came up with which seemed to communicate our current thinking on how we are proposing doing Python performance acceleration options on large vector operations.

It's kind of "choose your own adventure". For any given vector computation, one can go as far to the right as one is capable. One can write a NumPy implementation and call it good for now. One can go further and do a Numba implementation. Or go further and do a ctypes compiled C extension. Or go a bit further and make this faster with OpenMP. Or go even further and make it faster with a SIMD implementation.

Some notes:

At the end of the day, from NumPy to SIMD, these are all just ways of operating on contiguous arrays of numbers. All these solutions operate on the same datatypes really.
numba works on for loops, not numpy operations.
In theory, if you have GPU hardware, you can go from Numba to Numba+Cuda with just the cuda extension to Numba, a decorator and "a little setup code". We have yet to truly demonstrate this for ourselves though.
Numba doesn't really start to show its worth unless you use prange().
Going from NumPy to Numba or ctypes requires starting to thinking about datatypes.
Developing a ctypes extension requires a compiler. (Numba has a built in compiler.)
Speeding up a C implementation with OpenMP is truly simple.
OpenMP naturally uses all the cores it finds, without recompiling, which is very useful for making sure one is leveraging all available hardware, but also means that performance gains tested & recorded will depend on what machine it is being run on.
Note that the benefits of parallelization can be mostly eliminated if there is a critical section (e.g., updating a common counter) somewhere in the for loop being parallelized. Though this can usually be fixed by using thread-local storage and then adding an accumulator section at the end. GPT can help with this.
SIMD acceleration requires a certain low-level programming comfort. Working code can all be provided by GPT/Copilot, but it can take some trial and error.
We expect to provide SIMD implementations for AVX2, AVX512 and SSE. One can compile all of these together but one can only test on the hardware one has. Testing all 3 implementations requires getting a bit fancy, though presumably GHA can support this.
We are actively working on a reference, stress test which demonstrates performance numbers on each of these implementations.

Examples

Let's look at some sample code for a function that ages everyone in our model. We have some expansion slots for the "not yet born" with negative age so we co a condition check in addition to an addition, which makes our code at least not totally trivial. We'll also show some sample wallclock times for each one, though a lot can vary depending on what hardware one runs on. For now we're use a 16-core DeepNote VM with 128GB of RAM and run with an array of 1e7 integers uniformly distributed between -8365 and +24365.

Naive Python

def naive_one(data: np.ndarray) -> None:
    one_day = np.float32(1.0 / 365)
    for i in range(data.shape[0]):
        if data[i] >= 0:
            data[i] += one_day
    return

Duration: 15.91715s (15,917ms)

NumPy

def vectorized(data: np.ndarray) -> None:
    one_day = np.float32(1.0 / 365)
    data[data >= 0] += one_day
    return

Duration: 0:00:00.09175s (92ms)

Numba

@nb.njit((nb.float32[:],), parallel=True, nogil=True)
def numbaized(data: np.ndarray) -> None:
    one_day = np.float32(1.0 / 365)
    for i in nb.****prange****(data.shape[0]):
        if data[i] >= 0:
            data[i] += one_day
    return

Duration: 0.00314s (3ms) 30x faster than NumPy!

Compiled C (ctypes extension)

Python part:

# 'import' the module
update_ages_lib = ctypes.CDLL('./update_ages.so')
# define the interface to the function (datatypes)
update_ages_lib.update_ages_vanilla.argtypes = [
    ctypes.c_size_t,  # start_idx
    ctypes.c_size_t,  # stop_idx
    np.ctypeslib.ndpointer(dtype=np.float32, flags='C_CONTIGUOUS')
]
# Invoke
def compiled_c(data: np.ndarray) -> None:
    update_ages_lib.update_ages_vanilla(
        0,
        len( data ),
        data
    )

C Part:

const float one_day = 1.0f/365.0f;
void update_ages_vanilla(unsigned long int start_idx, unsigned long int stop_idx, float *ages) {
    for (size_t i = start_idx; i <= stop_idx; i++) {
        if( ages[i] >= 0 )
        {
            ages[i] += one_day;
        }
    }
}

Duration: 0.028020s (28ms)

Slower than Numba without parallelization, but faster than NumPy

C with OpenMP

#include <omp.h>

const float one_day = 1.0f/365.0f;
void update_ages_vanilla(unsigned long int start_idx, unsigned long int stop_idx, float *ages) {
    #pragma omp parallel for simd
    for (size_t i = start_idx; i <= stop_idx; i++) {
        if( ages[i] >= 0 )
        {
            ages[i] += one_day;
        }
    }
}

Duration: 0.00268s (2.7ms)

15% faster than Numba.

With -march=native and -ffast-math ... Duration: 0.000648s! (0.65ms)

SIMD (AVX2)

// avxv2
void update_ages(unsigned long int start_idx, unsigned long int stop_idx, float *ages) {
    #pragma omp parallel for
    for (size_t i = start_idx; i <= stop_idx; i += 8) {
        // Load age values into SIMD registers
        __m256 ages_vec = _mm256_loadu_ps(&ages[i]);

        // Create a vector with the one_day value replicated 8 times
        __m256 one_day_vec = _mm256_set1_ps(one_day);

        // Mask for elements greater than or equal to zero
        __m256 mask = _mm256_cmp_ps(ages_vec, _mm256_setzero_ps(), _CMP_GE_OQ);

        // Increment age values by one_day for elements greater than or equal to zero
        ages_vec = _mm256_blendv_ps(ages_vec, _mm256_add_ps(ages_vec, one_day_vec), mask);

        // Store the result back to memory
        _mm256_storeu_ps(&ages[i], ages_vec);
    }
}

Duration: 0:00:00.000543s (0.54ms) 5-7x faster than Numba

AVX512
Coming soon

Duration: 0.000346s (0.35ms)

Other Machines

k8s 16 core VM

naive =     15.740068
numpy =      0.090909
numba =      0.003329
c =          0.001735
simd (512) = 0.000998

Codespaces (16-core)

naive =     17.416576
numpy =      0.078130
numba =      0.004040
c =          0.001324
simd (512) = 0.000883

12-core Ubuntu laptop

naive =     13.695447
numpy =      0.072277
numba =      0.003344
c =          0.002402
avx2 =       0.002355

SIMD (SSE & AVX2 & AVX512 with run-time instruction set detection)

https://gist.github.com/clorton/16d8ca411c83058b31c6275a4e54f352

Testing a different function

I started a comparison of naive/numpy/numba/c on a new function, a little more interesting than the "conditionally add float to floats". This function is from "progress_immunities" and implements:

for each person in the population:
    if person is immune and their immunity timer is greater than 0:
        decrement their immunity timer by 1
        if their immunity timer has reached 0:
            set their immunity status to False

On 12-core laptop:

naive = 0:00:28.693374
numpy = 0:00:00.057744
numba = 0:00:00.428175
c = 0:00:00.003224

The 10x outperformance of numba by numpy is a reversal of what we were seeing before. Note that this function operates on 2 vectors, not just one, and does 1 conditional math operation and 1 conditional boolean flip. So there's a little more going on.

Some PRELIMINARY Conclusions

NumPy looks very fast compared to naive for-loops. But not much else.
Numba is typically 20x faster than NumPy on 16 cores.
NumPy presumably uses SIMD according to documentation but is not multi-core.
Naive, Numba and C all involve some variation of "universally understandable" for loops, for what that's worth.
Numba should probably be preferred over NumPy in all cases.
Compiled C (with OpenMP) is often between 2x and 5x faster than Numba and should be preferred if someone has a C compiler available.
Compiled C (with OpenMP) performance can be majorly impacted by gcc flags including -ffast-math. On some platforms 5x, but on others just 15%. But there are some serious concerns that the corner-cutting in fast-math are unacceptable in terms of accuracy.
AVX2 intrinsics seem to be about 25% faster than compiled C with OpenMP and the best gcc flags.
AVX512 intrinsics seem to be about 50% faster than compiled C with OpenMP and the best gcc flags.
We are yet to see the SIMD extensions to the OpenMP API provide any benefit.

Caveats

All of the data so far is for just one function which is simple though not entirely trivial. We should really collect these metrics for about 4 different types of vector math functions.

Next Steps

Do perf comparison on more "types" of functions, e.g., reductions, critical section updates, etc.
Understand more about -ffast-math and if there are sub-flags in there that are responsible for the gain without unacceptable inaccuracies.
See if there's some way we can get simd code on the cheap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly