Skip to content

Commit

Permalink
Merge pull request #258 from argonne-lcf/parsl
Browse files Browse the repository at this point in the history
Update to Parsl docs for Polaris
  • Loading branch information
felker authored Aug 23, 2023
2 parents 98ce520 + 76c0ec7 commit 7ddac4f
Showing 1 changed file with 23 additions and 24 deletions.
47 changes: 23 additions & 24 deletions docs/polaris/workflows/parsl.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@ For many applications, managing an ensemble of jobs into a workflow is a critica

## Getting Parsl on Polaris

Polaris is newer than ``parsl``, and some changes to the source code were required to correctly use ``parsl`` on Polaris. For that reason, on Polaris, a minimum parsl version of ``1.3.0-dev`` or higher is required on Polaris.

You can install parsl building off of the ``conda`` modules. You have some flexibility in how you want to extend the ``conda`` module to include parsl, but here is an example way to do it:

```python
Expand All @@ -25,13 +23,13 @@ python -m venv --system-site-packages /path/to/your/virtualenv
source /path/to/your/virtualenv/bin/activate

# Install parsl (only once)
pip install parsl==1.3.0.dev0
pip install parsl

```

## Using Parsl on Polaris

Parsl has a variety of system configurations available already, though as a new system, Polaris has not been included as of Fall 2022. As an example, we provide the configuration below:
Parsl has a variety of possible configuration settings. As an example, we provide the configuration below that will run one task per GPU:

```python
from parsl.config import Config
Expand All @@ -42,25 +40,27 @@ from parsl.providers import PBSProProvider
from parsl.executors import HighThroughputExecutor
# You can use the MPI launcher, but may want the Gnu Parallel launcher, see below
from parsl.launchers import MpiExecLauncher, GnuParallelLauncher
# address_by_hostname is best on polaris:
from parsl.addresses import address_by_hostname
# address_by_interface is needed for the HighThroughputExecutor:
from parsl.addresses import address_by_interface
# For checkpointing:
from parsl.utils import get_all_checkpoints

# Adjust your user-specific options here:
run_dir="/lus/grand/projects/yourproject/yourrundir/"

user_opts = {
"worker_init": "module load conda; conda activate; module load cray-hdf5; source /path/to/your/virtualenv/bin/activate",
"scheduler_options":"" ,
"account": "YOURACCOUNT",
"worker_init": f"source /path/to/your/virtualenv/bin/activate; cd {run_dir}", # load the environment where parsl is installed
"scheduler_options":"#PBS -l filesystems=home:eagle:grand" , # specify any PBS options here, like filesystems
"account": "YOURPROJECT",
"queue": "debug-scaling",
"walltime": "1:00:00",
"run_dir": "/lus/grand/projects/yourproject/yourrundir/"
"nodes_per_block": 3, # think of a block as one job on polaris, so to run on the main queues, set this >= 10
"cpus_per_node": 32, # Up to 64 with multithreading
"strategy": simple,
"available_accelerators": 4, # Each Polaris node has 4 GPUs, setting this ensures one worker per GPU
"cores_per_worker": 8, # this will set the number of cpu hardware threads per worker.
}

checkpoints = get_all_checkpoints(user_opts["run_dir"])
checkpoints = get_all_checkpoints(run_dir)
print("Found the following checkpoints: ", checkpoints)

config = Config(
Expand All @@ -70,13 +70,14 @@ config = Config(
heartbeat_period=15,
heartbeat_threshold=120,
worker_debug=True,
max_workers=user_opts["cpus_per_node"],
cores_per_worker=1, # How many workers per core dictacts total workers per node
address=address_by_hostname(),
cpu_affinity="alternating",
available_accelerators=user_opts["available_accelerators"], # if this is set, it will override other settings for max_workers if set
cores_per_worker=user_opts["cores_per_worker"],
address=address_by_interface("bond0"),
cpu_affinity="block-reverse",
prefetch_capacity=0,
start_method="spawn", # Needed to avoid interactions between MPI and os.fork
provider=PBSProProvider(
launcher=MpiExecLauncher(),
launcher=MpiExecLauncher(bind_cmd="--cpu-bind", overrides="--depth=64 --ppn 1"),
# Which launcher to use? Check out the note below for some details. Try MPI first!
# launcher=GnuParallelLauncher(),
account=user_opts["account"],
Expand All @@ -97,9 +98,8 @@ config = Config(
),
],
checkpoint_files = checkpoints,
run_dir=user_opts["run_dir"],
run_dir=run_dir,
checkpoint_mode = 'task_exit',
strategy=user_opts["strategy"],
retries=2,
app_cache=True,
)
Expand All @@ -108,9 +108,8 @@ config = Config(

## Special notes for Polaris

On Polaris, we are currently investigating a hang that occurs in some parallel applications. The hang occurs under these circumstances, as far as we can tell at this time:
- The application is launched on multiple nodes using `mpi`.
- The application uses ``fork`` on each node to spawn processes
- (Likely, but unconfirmed) the application has some aspect of locking involved, perhaps not visible to the user.
On Polaris, there is a known bug where python applications launched with `mpi` and that use ``fork`` to spawn processes can sometimes have unexplaned hangs. For this reason, it is recommended to use ``start_method="spawn"`` on Polaris when using the ``MpiExecLauncher`` as is shown in the example config above. Alternatively, another solution is to use the ``GNUParallelLauncher`` which uses ``GNU Parallel`` to spawn processes. ``GNU Parallel`` can be loaded in your environment with the command ``module load gnu-parallel``. Both of these approaches will circumvent the hang issue from using ``fork``.

## Updates

Under these circumstances, we have observed that rank 0 will proceed as far as possible, but ranks 1 to N will hang. For ``parsl``, which uses ``python.multiprocessing`` (which calls ``fork``), hangs can occur for some workloads on remote ranks. In this case, you can install ``GNU Parallel`` instead and use that as a launcher in ``parsl`` - you can install it into the virtual env as parsl is involved above. This will circumvent the hang issue if your ``parsl`` workloads require ``fork`` calls.
For ``parsl`` versions after July 2023, the ``address`` passed in the ``HighThroughputExecutor`` needs to be set to ``address = address_by_interface("bond0")``. With ``parsl`` versions prior to July 2023, it was recommended to use ``address = address_by_hostname()`` on Polaris, but with later versions this will not work on Polaris (or any other machine).

0 comments on commit 7ddac4f

Please sign in to comment.