Exploring how to develop a fast genetic algorithm implementation in Python.
I really love genetic algorithms, always have.
When I start using a new programming language, I typically develop a genetic algorithm to prove that I understand the language well enough.
Optimization algorithms have to be fast and I had a thought:
Can I get Python genetic algorithm to run about as fast as C?
This project explores how we might develop an implementation of the "simple genetic algorithm" in Python that is about as fast as an implementation in ANSI C. It is interesting and to see how close can we get, then how far we can go.
I explored this question years ago, I've just decided to make it public (someone asked me about it). I've since added some more versions and pushed the time closer to parity.
A naive port from C to Python is about 10x slower, work is required to get the execution time back down again to something sane. I expect numpy masters and code-golfers could cut more milliseconds and lines (if so, let me know).
Version | Time (sec) | Speedup (c) | Speedup (v01) |
---|---|---|---|
ANSI C | 0.539 | n/a | n/a |
Version 01 | 4.874 | 0.111x | n/a |
Version 02 | 3.339 | 0.161x | 1.460x |
Version 03 | 20.590 | 0.026x | 0.237x |
Version 04 | 1.211 | 0.445x | 4.025x |
Version 05 | 0.787 | 0.685x | 6.193x |
Version 06 | 0.633 | 0.852x | 7.700x |
Version 07 | 0.625 | 0.862x | 7.798x |
Version 08 | 0.602 | 0.895x | 8.096x |
Version 09 | 0.452 | 1.192x | 10.783x |
Version 10 | 0.448 | 1.203x | 10.879x |
Version 11 | 0.380 | 1.418x | 12.826x |
Version 12 | 0.377 | 1.430x | 12.928x |
Version 13 | 0.343 | 1.571x | 14.210x |
Version 14 | 0.328 | 1.643x | 14.860x |
Version 15 | 0.347 | 1.553x | 14.046x |
Version 16 | 0.308 | 1.750x | 15.825x |
- Execution time is taken from the best of 3 sequential runs on my workstation.
- Speedup (c) is the speedup factor of the python version over the ANSI-C version.
- Speedup (v01) is the speedup factor of the Python version over version 01 Python version.
- A speedup factor below 1 means the implementation is slower, above means it's faster.
- The fastest version if highlighted in bold.
We will define the "simple genetic algorithm" as follows:
- Uses a binary string or "bitstring" representation.
- Uses tournament selection.
- Uses one-point crossover.
- Uses point mutation.
- Solves the one max (sum of bits) problem.
- No early stopping as we pretend we don't know the solution.
All implementations use the same configuration for direct comparison:
- 1 random seed.
- 1,000 bits in each string.
- 100 sized population.
- 3 round tournament selection.
- 500 epochs.
- 1/n mutation probability.
- 95% crossover.
Each implementation should solve the problem before the maximum number of epochs. If not, a bug may have been introduced.
The reference implementation is provided in C.
We are tying to develop a Python version that is about as fast as the C version.
My ANSI C is a little rusty so the implementation is naive (at best) and not highly optimized.
- Uses a struct for each candidate solution to track bits, length, and fitness.
- Mallocs each candidate on demand and candidates are not reused.
- Does not free memory.
- Lots of for-loop iteration
- No fancy memory tricks or bitshifts.
- Reports best result each epoch on stdout.
That being said, if you want to develop an optimized version of this implementation, I'd love to see it.
You can review the code here:
It can be compiled as follows:
gcc simple_ga.c -Wall -o simple_ga
It can be benchmarked as follows:
time ./simple_ga
A sample of results is provided below. This is the target for the Python version.
...
>495, fitness=1000
>496, fitness=1000
>497, fitness=1000
>498, fitness=1000
>499, fitness=1000
Done!
real 0m0.539s
user 0m0.489s
sys 0m0.046s
Developing a Python implementation is incremental.
This is a port of the C version to the Python standard library with no optimization.
- Uses an integer representation for bitstrings
- Uses a dict structure for each candidate solution.
The source code is available here:
time python ./version01.py
A sample of results is provided below.
...
>495 fitness=1000
>496 fitness=1000
>497 fitness=1000
>498 fitness=1000
>499 fitness=1000
Done
real 0m4.874s
user 0m4.785s
sys 0m0.034s
This version seeks to simplify the Python version.
- It uses fewer functions.
- It uses multiple assignment.
The source code is available here:
time python ./version02.py
A sample of results is provided below.
...
>495 fitness=1000
>496 fitness=1000
>497 fitness=1000
>498 fitness=1000
>499 fitness=1000
Done
real 0m3.339s
user 0m3.285s
sys 0m0.026s
This version ports the Python version to use the NumPy API.
- Use numpy arrays for bitstrings.
- Use functions on numpy arrays, e.g. sum().
- Uses numpy random functions.
The source code is available here:
time python ./version03.py
A sample of results is provided below.
...
>495 fitness=1000
>496 fitness=1000
>497 fitness=1000
>498 fitness=1000
>499 fitness=1000
Done
real 0m20.590s
user 0m19.569s
sys 0m0.273s
This version optimizes the NumPy version.
- Vectorized version of point mutation operation.
- Deleted bitstring copy function moved into crossover.
- Simplified slice in crossover.
The source code is available here:
time python ./version04.py
A sample of results is provided below.
...
>495 fitness=1000
>496 fitness=1000
>497 fitness=1000
>498 fitness=1000
>499 fitness=1000
Done
real 0m1.211s
user 0m1.162s
sys 0m0.037s
This version further optimizes the NumPy version.
- Populations are arrays of dicts.
- Preallocate parent and child memory and reuse each iteration.
- Specify bitstring array data type as int64 (seems fastest in testing).
- Vectorized tournament selection for entire population.
- Vectorized crossover point selection for entire population.
The source code is available here:
time python ./version05.py
A sample of results is provided below.
...
>495 fitness=1000
>496 fitness=1000
>497 fitness=1000
>498 fitness=1000
>499 fitness=1000
Done
real 0m0.787s
user 0m0.744s
sys 0m0.035s
This version further optimizes the NumPy version.
- Vectorized choice to crossover or not.
- Vectorized choice to mutate all bits per generation.
- Preallocate arrays for random choices.
- Use float32 for probabilistic decisions (crossover and mutate)
The source code is available here:
time python ./version06.py
A sample of results is provided below.
...
>495 fitness=1000
>496 fitness=1000
>497 fitness=1000
>498 fitness=1000
>499 fitness=1000
Done
real 0m0.633s
user 0m0.592s
sys 0m0.033s
This version further optimizes the NumPy version.
- Preallocate array for selected parents each iteration.
- Removed fitness function, made it inline.
- Simplify crossover, remove redundant copying and return.
- Use positional over named arguments where possible.
- Move crossover and mutation to be inline.
The source code is available here:
time python ./version07.py
A sample of results is provided below.
...
>495 fitness=1000
>496 fitness=1000
>497 fitness=1000
>498 fitness=1000
>499 fitness=1000
Done
real 0m0.625s
user 0m0.583s
sys 0m0.034s
This version further optimizes the NumPy version.
- Moved tournament selection to be inline.
- Simplified iteration for tournament selection.
The source code is available here:
time python ./version08.py
A sample of results is provided below.
...
>495 fitness=1000
>496 fitness=1000
>497 fitness=1000
>498 fitness=1000
>499 fitness=1000
Done
real 0m0.602s
user 0m0.563s
sys 0m0.034s
This version explores using a matrix to hold all bitstrings in each generation.
- One matrix for current and one for next generation.
- Separate array for fitness scores.
- Vectorized onemax (fitness) and best in population (argmax).
- Population-wide vectorization of mutation.
The source code is available here:
time python ./version09.py
A sample of results is provided below.
...
>495 fitness=1000
>496 fitness=1000
>497 fitness=1000
>498 fitness=1000
>499 fitness=1000
Done
real 0m0.452s
user 0m0.415s
sys 0m0.031s
This version explores optimizing use of the matrix used to hold all bitstrings.
- Copy all selected parents over child then only copy second parents bits if crossed over.
The source code is available here:
time python ./version10.py
A sample of results is provided below.
...
>495 fitness=1000
>496 fitness=1000
>497 fitness=1000
>498 fitness=1000
>499 fitness=1000
Done
real 0m0.448s
user 0m0.410s
sys 0m0.032s
This version explores further numpy optimization via vectorization of loops.
- Vectorized version of tournament selection with argmax.
The source code is available here:
time python ./version11.py
A sample of results is provided below.
...
>495 fitness=1000
>496 fitness=1000
>497 fitness=1000
>498 fitness=1000
>499 fitness=1000
Done
real 0m0.380s
user 0m0.340s
sys 0m0.033s
This version explores minor tweaks.
- Use // operator instead of casting floats.
- Preallocate array for best solution found so far.
- Preallocate array for arrange used in tournament selection.
- Added more comments.
The source code is available here:
time python ./version12.py
A sample of results is provided below.
...
>495 fitness=1000
>496 fitness=1000
>497 fitness=1000
>498 fitness=1000
>499 fitness=1000
Done
real 0m0.377s
user 0m0.337s
sys 0m0.033s
This version explores minor tweaks.
- Use xor operator for bit flipping.
- Simplify some calls to rng.integers().
- Use numpy.ushort and numpy.uintc data types for ints.
The source code is available here:
time python ./version13.py
A sample of results is provided below.
...
>495 fitness=1000
>496 fitness=1000
>497 fitness=1000
>498 fitness=1000
>499 fitness=1000
Done
real 0m0.343s
user 0m0.305s
sys 0m0.031s
This version explores minor tweaks.
- Represent bitstrings with ubyte.
- Specific imports rather than import all.
- Replace copyto() with broadcast.
- Simplified fitness sum().
- Simplified array init to be all zeros()
- Disable Python garbage collector.
- Use logical xor function to apply mutations with mask.
The source code is available here:
time python ./version14.py
A sample of results is provided below.
...
>495 fitness=1000
>496 fitness=1000
>497 fitness=1000
>498 fitness=1000
>499 fitness=1000
Done
real 0m0.328s
user 0m0.292s
sys 0m0.030s
This version explores the fastest version of vectorized one-point crossover I could come up with.
It is slower than the prior version, but might lead to better ideas.
- Vectorized version of one-point crossover (slower)
The source code is available here:
time python ./version15.py
A sample of results is provided below.
...
>495 fitness=1000
>496 fitness=1000
>497 fitness=1000
>498 fitness=1000
>499 fitness=1000
Done
real 0m0.347s
user 0m0.310s
sys 0m0.031s
This version explores minor optimizations and tweaks.
- Use empty() over zeros(), faster in testing.
- Pre-allocate all random ints for tournament selection draws across all epochs.
- Pre-allocate all random ints for point mutations across all epochs.
- Use bool type for bitstring arrays.
- Use bitwise_xor() to apply point mutations.
- Don't assign parent indexes to variables during crossover.
The source code is available here:
time python ./version16.py
A sample of results is provided below.
...
>495 fitness=1000
>496 fitness=1000
>497 fitness=1000
>498 fitness=1000
>499 fitness=1000
Done
real 0m0.308s
user 0m0.271s
sys 0m0.031s
- Do not re-evaluate if child is a copy of parent and not mutated.
- More vectorization somehow?
- Benchmark more consistently (repeat 3+ times and take mean execution time).
- Optimize array data types (bitstrings, random numbers), see if speed improvement for small/different types.
- Precompute all random numbers for all epochs (madness!)
- Use threadpool to prepare the next generation in parallel (numpy calls will release the gil)
- Starting and closing the thread pool adds a massive overhead
- Find a randint function that populates array instead of create+populate (slower).
- Can we create one big mask for crossover and apply it to the entire population matrices? Surely!
- Tried and it was 100ms slower, worked though...
- Explore JIT with numba, jax, or similar.
- Explore Cython.
I have genetic algorithm implementations in a bunch of other languages on disk, I may add them to this project if there's interest.
Do you have more ideas? Do you see a bug? Let's chat!
Email me: Jason.Brownlee05@gmail.com