Skip to content

Latest commit

 

History

History
157 lines (130 loc) · 9.63 KB

reason-codes.md

File metadata and controls

157 lines (130 loc) · 9.63 KB

Job Status and Reason Codes

The squeue command details a variety of information on an active job’s status with state and reason codes. Job state codes describe a job’s current state in queue (e.g. pending, completed). Job reason codes describe the reason why the job is in its current state.

The following tables outline a variety of job state and reason codes you may encounter when using squeue to check on your jobs.

Job State Codes

Status Code Explaination
CANCELLED CA The job was explicitly cancelled by the user or system administrator.
COMPLETED CD The job has completed successfully.
COMPLETING CG The job is finishing but some processes are still active.
DEADLINE DL The job terminated on deadline
FAILED F The job terminated with a non-zero exit code and failed to execute.
NODE_FAIL NF The job terminated due to failure of one or more allocated nodes
OUT_OF_MEMORY OOM The Job experienced an out of memory error.
PENDING PD The job is waiting for resource allocation. It will eventually run.
PREEMPTED PR The job was terminated because of preemption by another job.
RUNNING R The job currently is allocated to a node and is running.
SUSPENDED S A running job has been stopped with its cores released to other jobs.
STOPPED ST A running job has been stopped with its cores retained.
TIMEOUT TO Job terminated upon reaching its time limit.

A full list of these Job State codes can be found in squeue documentation. or sacct documentation.

Job Reason Codes

Reason Code Explaination
Priority One or more higher priority jobs is in queue for running. Your job will eventually run.
Dependency This job is waiting for a dependent job to complete and will run afterwards.
Resources The job is waiting for resources to become available and will eventually run.
InvalidAccount The job’s account is invalid. Cancel the job and rerun with correct account.
InvaldQoS The job’s QoS is invalid. Cancel the job and rerun with correct account.
QOSGrpCpuLimit All CPUs assigned to your job’s specified QoS are in use; job will run eventually.
QOSGrpMaxJobsLimit Maximum number of jobs for your job’s QoS have been met; job will run eventually.
QOSGrpNodeLimit All nodes assigned to your job’s specified QoS are in use; job will run eventually.
PartitionCpuLimit All CPUs assigned to your job’s specified partition are in use; job will run eventually.
PartitionMaxJobsLimit Maximum number of jobs for your job’s partition have been met; job will run eventually.
PartitionNodeLimit All nodes assigned to your job’s specified partition are in use; job will run eventually.
AssociationCpuLimit All CPUs assigned to your job’s specified association are in use; job will run eventually.
AssociationMaxJobsLimit Maximum number of jobs for your job’s association have been met; job will run eventually.
AssociationNodeLimit All nodes assigned to your job’s specified association are in use; job will run eventually.

A full list of these Job Reason Codes can be found in Slurm’s documentation.

Running Job Statistics Metrics

The sstat command allows users to easily pull up status information about their currently running jobs. This includes information about CPU usage, task information, node information, resident set size (RSS), and virtual memory (VM). We can invoke the sstat command as such:

# /!\ ADAPT <jobid> accordingly
$ sstat --jobs=<jobid>

By default, sstat will pull up significantly more information than what would be needed in the commands default output. To remedy this, we can use the --format flag to choose what we want in our output. A chart of some these variables are listed in the table below:

Variable Description
avecpu Average CPU time of all tasks in job.
averss Average resident set size of all tasks.
avevmsize Average virtual memory of all tasks in a job.
jobid The id of the Job.
maxrss Maximum number of bytes read by all tasks in the job.
maxvsize Maximum number of bytes written by all tasks in the job.
ntasks Number of tasks in a job.

For an example, let's print out a job's average job id, cpu time, max rss, and number of tasks. We can do this by typing out the command:

# /!\ ADAPT <jobid> accordingly
sstat --jobs=<jobid> --format=jobid,cputime,maxrss,ntasks

A full list of variables that specify data handled by sstat can be found with the --helpformat flag or by visiting the slurm page on sstat.

Past Job Statistics Metrics

You can use the custom susage function in /etc/profile.d/slurm.sh to collect statistics information.

$ susage -h
Usage: susage [-m] [-Y] [-S YYYY-MM-DD] [-E YYYT-MM-DD]
  For a specific user (if accounting rights granted):    susage [...] -u <user>
  For a specific account (if accounting rights granted): susage [...] -A <account>
Display past job usage summary

But by default, you should use the sacct command allows users to pull up status information about past jobs. This command is very similar to sstat, but is used on jobs that have been previously run on the system instead of currently running jobs.

# /!\ ADAPT <jobid> accordingly
$ sacct [-X] --jobs=<jobid> [--format=metric1,...]
# OR, for a user, eventually between a Start and End date
$ sacct [-X] -u $USER  [-S YYYY-MM-DD] [-E YYYY-MM-DD] [--format=metric1,...]
# OR, for an account - ADAPT <account> accordingly
$ sacct [-X] -A <account> [--format=metric1,...]

Use -X to aggregate the statistics relevant to the job allocation itself, not taking job steps into consideration.

The main metrics code you may be interested to review are listed below.

Variable Description
account Account the job ran under.
avecpu Average CPU time of all tasks in job.
averss Average resident set size of all tasks in the job.
cputime Formatted (Elapsed time * CPU) count used by a job or step.
elapsed Jobs elapsed time formated as DD-HH:MM:SS.
exitcode The exit code returned by the job script or salloc.
jobid The id of the Job.
jobname The name of the Job.
maxdiskread Maximum number of bytes read by all tasks in the job.
maxdiskwrite Maximum number of bytes written by all tasks in the job.
maxrss Maximum resident set size of all tasks in the job.
ncpus Amount of allocated CPUs.
nnodes The number of nodes used in a job.
ntasks Number of tasks in a job.
priority Slurm priority.
qos Quality of service.
reqcpu Required number of CPUs
reqmem Required amount of memory for a job.
reqtres Required Trackable RESources (TRES)
user Userna

A full list of variables that specify data handled by sacct can be found with the --helpformat flag or by visiting the slurm page on sacct.