Skip to content

Latest commit

 

History

History
72 lines (59 loc) · 2.57 KB

parallelization.md

File metadata and controls

72 lines (59 loc) · 2.57 KB
jupytext kernelspec
text_representation
extension format_name format_version jupytext_version
.md
myst
0.12
1.9.1
display_name language name
Python 3
python
python3
:tags: [remove-cell]
import msprime
import numpy as np
import tskit

def create_notebook_data():
    pass

# create_notebook_data()  # uncomment to recreate the tree seqs used in this notebook

(sec_parallelization)=

Parallelization

% remove underscores in title when tutorial is complete or near-complete

When performing large calculations it's often useful to split the work over multiple processes or threads. The tskit API can be used without issues across multiple processes, and the Python {mod}multiprocessing module often provides a very effective way to work with many replicate simulations in parallel.

When we wish to work with a single very large dataset, however, threads can offer better resource usage because of the shared memory space. The Python {mod}threading library gives a very simple interface to lightweight CPU threads and allows us to perform several CPU intensive tasks in parallel. The tskit API is designed to allow multiple threads to work in parallel when CPU intensive tasks are being undertaken.

:::{note} In the CPython implementation the Global Interpreter Lock ensures that only one thread executes Python bytecode at one time. This means that Python code does not parallelise well across threads, but avoids a large number of nasty pitfalls associated with multiple threads updating data structures in parallel. Native C extensions like numpy and tskit release the GIL while expensive tasks are being performed, therefore allowing these calculations to proceed in parallel. :::

:::{todo} This tutorial previously used code with an old interface, and hence has been removed. We must recreate an example of parallel processing, giving examples of both threads and processes (but see this stackoverflow post for why it may be difficult to get {mod}multiprocessing working in this notebook). A reasonable example might be to calculate many pairwise statistics between sample sets in parallel.

We should also show how, for large tree sequences that it is better to pass the filenames to each subprocess, and load the tree sequence, rather than transferring the entire tree sequence (via pickle) to the subprocesses. :::