Prototypes and Functionality tests for GPU & MPI programming models relevant to plasma modeling code Uses HIP programming model to target both NVIDIA and AMD GPUs and GPU pointers with MPI One-sided methods.
Note: These commands have been tested on NERSC Perlmutter's default module environment except as follows:
module load hip
Note: These commands have been tested on OLCF Frontier's default module environment except as follows:
module load PrgEnv-amd
module load craype-accel-amd-gfx90a
The Makefile
builds three variants of the microapp driver that uses MPI one-side "PUT" messaging mimicing an unstructured ghost face exchange
- Canonical same-source host and kernel code compiled at same time
- The runtime load shared library module approach
- Runtime comilation of kernels from source code strings
srun --gpu-bind=none --time-min=2 -n16 --ntasks-per-node=4 -C gpu --gpus-per-task=1 ./microapp.mpi_hip_compile_time.exe <numElements> initialize_element_kernel
srun --gpu-bind=none --time-min=2 -n16 --ntasks-per-node=4 -C gpu --gpus-per-task=1 ./microapp.mpi_hip_rtc.exe <numElements> ./ initialize_element_kernel
#note the runtime shared-library module version requires additional argument for the shared library path. Absolute path may be best.
srun --gpu-bind=none --time-min=2 -n16 --ntasks-per-node=4 -C gpu --gpus-per-task=1 ./microapp.mpi_hip_runtime_module.exe <numElements> ./dir.mpi_hip_runtime_module/ initialize_element_kernel
NOTE that on perlmutter, a bug in cray-mpich through at least version 8.1.24 requires disabling GPU binding --gpu-bind=none
or you'll see an error signature like the following.
MPICH ERROR [Rank 1] [job id 6040415.0] [Mon Mar 13 08:52:40 2023] [nid004013] - Abort(673826050) (rank 1 in comm 0): Fatal error in PMPI_Put: Invalid count, error stack:
PMPI_Put(188).........................: MPI_Put(origin_addr=0x1513c3200020, origin_count=2, MPI_DOUBLE, target_rank=0, target_disp=4, target_count=2, MPI_DOUBLE, win=0xa0000000) failed
(unknown)(): Invalid count