Example scripts for Slurm prolog / epilog

This is a collection of scripts that could be added to a Slurm prolog / epilog. These are not intended to be "drop in" solutions. These scripts will need to be modified to fit various system configurations.

DCGM Stats

prolog-dcgmstats
epilog-dcgmstats

DCGM utilies must be installed. https://developer.nvidia.com/data-center-gpu-manager-dcgm

These two scripts will collect GPU stats during a job. nv-hostengine and dcgmi are executed as the user running the job.

The collected stats will be written to the job output dir. One file per host.

ECC

prolog-ecc
epilog-ecc

These two scripts will disable ECC if requested by the user. The user makes this request by adding --comment=ecc to their job submission.

MPS

prolog-mps
prolog-mps-per-gpu
epilog-mps

These scripts will start (and stop) the mps server if requested by the user. The prolog-mps script will, if the user passes --comment=mps, start a single MPS daemon on the node.
The prolog-mps-per-gpu will, if the user passes --comment=mps-per-gpu, start one MPS daemon and MPS server per GPU. The MPS server will be bound to the appropriate CPU cores.

DCGM Health Checks

prolog-dcgmhealth

DCGM utilies must be installed. https://developer.nvidia.com/data-center-gpu-manager-dcgm

This script will run a quick (few seconds) health check of the GPUs on a node. If the health check fails it will put the node into a drain state. Output from the health check will be written to /tmp/dcgm.out on the compute nodes. The job will fail and depending upon the slurm configuration it may or may not be requeued. An interactive job will fail and not be requeued.

The contents of /tmp/dcgm.out should look like:

Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Deployment  --------+------------------------------------------------|
| Blacklist                 | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement           | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Pass                                           |

A failed health check might look like:

Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Deployment  --------+------------------------------------------------|
| Blacklist                 | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Fail                                           |
| Environment Variables     | Fail                                           |
| Page Retirement           | Fail                                           |
| Graphics Processes        | Fail                                           |
| Inforom                   | Fail                                           |
+-----  Hardware  ----------+------------------------------------------------+
+-----  Integration  -------+------------------------------------------------+
+-----  Stress  ------------+------------------------------------------------+
+---------------------------+------------------------------------------------+

GPU Reset

prolog-gpureset

This script will reset the application clocks on the GPUs, activate accounting, and clear the current logs. Inside this script there is a section that is commented out. That commented out section would execute a reset of the gpu. That is an action is is not always guaranteed to succeed and should be done with caution.

GPU Check

prolog-lspci

This script will check that lspci sees all the GPUs that a node should have according to Slurm. This script requires that slurm be configured with GPUs as a consumable resource (gres). If this check fails the node will be put into a drain state. The job will fail and depending upon the slurm configuration it may or may not be requeued. An interactive job will fail and not be requeued.

General cleanup

epilog-cleanup

This script will run some non-gpu specific cleanup tasks. Kill user processes, sync cached writes, and drop caches. It will also check for processes running on the GPUs. If processes are found it will send them SIGKILL. If the processes are still running after 5 seconds the node will be drained.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Example scripts for Slurm prolog / epilog

DCGM Stats

ECC

MPS

DCGM Health Checks

GPU Reset

GPU Check

General cleanup

Files

README.md

Latest commit

History

README.md

File metadata and controls

Example scripts for Slurm prolog / epilog

DCGM Stats

ECC

MPS

DCGM Health Checks

GPU Reset

GPU Check

General cleanup