Skip to content

Latest commit

 

History

History
388 lines (301 loc) · 13.3 KB

playground_cluster.md

File metadata and controls

388 lines (301 loc) · 13.3 KB

Testing cluster

  1. Make sure to upload post_install.bash script to S3
aws s3 cp setup/playground/post_install.bash s3://aws-parallel-cluster-slurm/playground/post_install.bash
  1. Create cluster:
pcluster create aws-playground-cluster -c configs/playground

pcluster status aws-playground-cluster -c configs/playground

If cluster creation is failed, try with "no rollback" option:

pcluster create aws-playground-cluster -c configs/playground --norollback

Then you can use ssh to connect to the head node and check /var/log/chef-client.log, which should confirm where the creation is stuck on or /var/log/parallelcluster/clustermgtd that contains the reason why capacity cannot be provisioned"

1.1 Check connection to the head node

pcluster ssh aws-playground-cluster -i ~/.ssh/aws-playground-cluster.pem

1.2 Get master node public IP:

aws ec2 describe-instances --filters 'Name=instance-state-name,Values=running' --query 'Reservations[*].Instances[*].[Tags[?Key==`Name`]|[0].Value,PublicDnsName]' --output text
  1. Check cluster

Connect to the cluster and execute:

which sinfo
>
/opt/slurm/bin/sinfo

sinfo
>
PARTITION             AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpu-compute-ondemand*    up   infinite      4  idle~ cpu-compute-ondemand-dy-t3micro-[1-4]
cpu-compute-spot         up   infinite      4  idle~ cpu-compute-spot-dy-t3micro-[1-4]
gpu-compute-ondemand     up   infinite      8  idle~ gpu-compute-ondemand-dy-g4dn12xlarge-[1-4],gpu-compute-ondemand-dy-g4dnxlarge-[1-4]
gpu-compute-spot         up   infinite      8  idle~ gpu-compute-spot-dy-g4dn12xlarge-[1-4],gpu-compute-spot-dy-g4dnxlarge-[1-4]

hostname
>
ip-10-0-0-16

srun -N 2 -n 2 -l --partition=cpu-compute-spot hostname
>
0: cpu-compute-spot-dy-t3micro-1
1: cpu-compute-spot-dy-t3micro-2

2.1 Module Environment

List available modules (below command can be unavailable):

module av
>
------------------------------------------------------------------ /usr/share/modules/modulefiles ------------------------------------------------------------------
dot  libfabric-aws/1.13.0amzn1.0  module-git  module-info  modules  null  openmpi/4.1.1  use.own

---------------------------------------------------------- /opt/intel/impi/2019.8.254/intel64/modulefiles ----------------------------------------------------------
intelmpi

2.2 NFS Shares

List mounted volumes:

showmount -e localhost
>
Export list for localhost:
/opt/slurm 10.0.0.0/16
/opt/intel 10.0.0.0/16
/home      10.0.0.0/16
/shared    10.0.0.0/16
  1. Setup conda environment

Connect to the cluster. To enable access to aws-parallel-cluster-slurm repository, add id_rsa.pub to project's deploy keys: https://github.com/pytorch-ignite/aws-parallel-cluster-slurm/settings/keys .

cat ~/.ssh/id_rsa.pub

Clone the repository:

git clone git@github.com:pytorch-ignite/aws-parallel-cluster-slurm.git
cd aws-parallel-cluster-slurm/setup/playground/
sh install_miniconda.sh

source ~/.bashrc
conda env list
>
# conda environments:
#
base                  *  /shared/conda
conda create -y -n test
conda activate test
conda install -y  pytorch cpuonly -c pytorch
srun -N 2 -l --partition=cpu-compute-spot conda env list
>
0: # conda environments:
0: #
0: base                     /shared/conda
0: test                  *  /shared/conda/envs/test
0:
1: # conda environments:
1: #
1: base                     /shared/conda
1: test                  *  /shared/conda/envs/test
1:
  1. Examples

Connect to the cluster and activate "test" environment:

conda activate test
cd aws-parallel-cluster-slurm/slurm-examples/pytorch
sbatch script1.sbatch
squeue
head -100 slurm_<i>.out
>
Mon Sep 20 21:26:38 UTC 2021
cpu-compute-spot-dy-t3micro-1
/home/ubuntu/aws-parallel-cluster-slurm/slurm-examples/pytorch
# conda environments:
#
base                     /shared/conda
test                  *  /shared/conda/envs/test

/shared/conda/envs/test/bin/python
Python 3.9.7
# packages in environment at /shared/conda/envs/test:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
_openmp_mutex             4.5                       1_gnu
blas                      1.0                         mkl
ca-certificates           2021.7.5             h06a4308_1
certifi                   2021.5.30        py39h06a4308_0
cpuonly                   1.0                           0    pytorch
...
1.9.0
Mon Sep 20 21:26:40 UTC 2021

Cluster AWS dashboard

Update cluster

To modify configuration and apply it to the existing cluster:

pcluster stop aws-playground-cluster -c configs/playground

pcluster update aws-playground-cluster -c configs/playground

...

pcluster start aws-playground-cluster -c configs/playground

pcluster status aws-playground-cluster -c configs/playground

GPU examples

  1. Setup conda environment

Connect to the cluster and execute:

cd aws-parallel-cluster-slurm/setup/conda_envs
conda env list

conda env create -f pytorch_ignite_vision.yml -n pytorch_ignite_vision
  1. Submit a GPU job (PyTorch)
conda activate pytorch_ignite_vision
cd aws-parallel-cluster-slurm/slurm-examples/pytorch
sbatch script5.sbatch
squeue
  1. Submit a CPU job with PyTorch-Ignite
conda activate pytorch_ignite_vision
cd aws-parallel-cluster-slurm/slurm-examples/ignite
sbatch script1.sbatch
squeue

Output:

2021-09-20 22:06:36,240 ignite.distributed.launcher.Parallel INFO: Initialized processing group with backend: 'gloo'
2021-09-20 22:06:36,240 ignite.distributed.launcher.Parallel INFO: - Run '<function main_fn at 0x7f0a65206160>' in 2 processes
2021-09-20 22:06:36,441 ignite.distributed.launcher.Parallel INFO: End of run
2021-09-20 22:06:36,441 ignite.distributed.launcher.Parallel INFO: Finalized processing group with backend: 'gloo'
1 ignite version: 0.4.6
[cpu-compute-spot-dy-t3micro-1:15007], [gloo], process 1/2
0 ignite version: 0.4.6
[cpu-compute-spot-dy-t3micro-1:15007], [gloo], process 0/2
  1. Submit a GPU job with PyTorch-Ignite
conda activate pytorch_ignite_vision
cd aws-parallel-cluster-slurm/slurm-examples/ignite
sbatch -v script2.sbatch
squeue

Output:

[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
2021-09-20 22:35:03,406 ignite.distributed.launcher.Parallel INFO: Initialized processing group with backend: 'nccl'
2021-09-20 22:35:03,406 ignite.distributed.launcher.Parallel INFO: - Run '<function main_fn at 0x7f1919f2b160>' in 2 processes
2021-09-20 22:35:03,616 ignite.distributed.launcher.Parallel INFO: End of run
2021-09-20 22:35:03,616 ignite.distributed.launcher.Parallel INFO: Finalized processing group with backend: 'nccl'
1 ignite version: 0.4.6
[gpu-compute-ondemand-dy-g4dnxlarge-1:15011], [nccl], process 1/2
0 ignite version: 0.4.6
[gpu-compute-ondemand-dy-g4dnxlarge-1:15011], [nccl], process 0/2
  1. Train resnet18 on CIFAR10 with PyTorch-Ignite

Using Docker images

  1. Submit a CPU job using docker images:
srun --partition=cpu-compute-spot --container-image=ubuntu:latest grep PRETTY /etc/os-release
>
pyxis: importing docker image ...
PRETTY_NAME="Ubuntu 20.04.3 LTS"

or

srun -N 2 -l --partition=cpu-compute-spot --container-image=ubuntu:latest grep PRETTY /etc/os-release
>
1: pyxis: importing docker image ...
0: pyxis: importing docker image ...
1: PRETTY_NAME="Ubuntu 20.04.3 LTS"
0: PRETTY_NAME="Ubuntu 20.04.3 LTS"
  1. Submit a CPU job using pytorchignite/base:latest docker image and sbatch

2.1 Import docker image in sqsh format:

enroot import -o /shared/enroot_data/pytorchignite+vision+latest.sqsh docker://pytorchignite/vision:latest

2.2 Execute a command from the container:

  • Current working directory: --container-workdir=$PWD
  • Shared folder is still at /shared
NVIDIA_VISIBLE_DEVICES="" srun --partition=cpu-compute-spot --container-name=ignite-vision --container-image=/shared/enroot_data/pytorchignite+vision+latest.sqsh --container-workdir=$PWD --no-container-remap-root bash -c 'echo "Current working directory: $PWD, $(ls $PWD)" && echo && echo "Shared directory at /shared : $(ls /shared)" && echo && pip list | grep torch && echo && echo "Enroot pytorch hook applied: $WORLD_SIZE:$RANK:$LOCAL_RANK:$MASTER_ADDR:$MASTER_PORT"'

Remove existing cluster

pcluster delete aws-playground-cluster -c configs/playground

Enroot and srun commands:

enroot import docker://python:latest

enroot start --mount $PWD:/ws python+latest.sqsh /bin/bash

Debug enroot configuration from a worker node:

srun --partition=cpu-compute-spot --pty bash
srun --partition=cpu-compute-spot --container-image=ubuntu:latest --pty bash
srun --partition=cpu-compute-spot --container-image=python:3.9-alpine --pty ash


SLURM_DEBUG=2 srun --partition=gpu-compute-spot -w gpu-compute-spot-dy-g4dnxlarge-1 --pty bash

enroot start /shared/enroot_data/pytorchignite+vision+latest.sqsh /bin/bash

SLURM_DEBUG=2 NVIDIA_VISIBLE_DEVICES= srun --partition=cpu-compute-spot --container-name=ignite-vision --container-image=/shared/enroot_data/pytorchignite+vision+latest.sqsh pip list | grep torch

SLURM_DEBUG=2 NVIDIA_VISIBLE_DEVICES="" srun --partition=cpu-compute-spot --container-name=ignite-vision --container-image=/shared/enroot_data/pytorchignite+vision+latest.sqsh --pty bash

srun -N 2 -n 2 -p cpu-compute-spot -l bash -c 'echo "$SLURM_JOB_ID,$SLURM_NTASKS,$SLURM_PROCID,$SLURM_LOCALID,$SLURM_STEP_TASKS_PER_NODE"'

Enable PyTorch hook:

sudo cp /usr/local/share/enroot/hooks.d/50-slurm-pytorch.sh /usr/local/etc/enroot/hooks.d/

WEIRD SPOT INSTANCE ERROR MESSAGES

@ingestionTime
1632138698570
@log
201193730185:/aws/parallelcluster/aws-playground-cluster
@logStream
ip-10-0-0-179.i-073d7e1c2820a64e5.slurm_resume
@message
2021-09-20 11:51:36,615 - [slurm_plugin.common:add_instances_for_nodes] - ERROR - Encountered exception when launching instances for nodes (x1) ['gpu-compute-spot-dy-g4dnxlarge-1']: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 1): There is no Spot capacity available that matches your request.
@timestamp
1632138696615


@ingestionTime
1632209819818
@log
201193730185:/aws/parallelcluster/aws-playground-cluster
@logStream
ip-10-0-0-16.i-0e994a2110d84941d.slurm_resume
@message
2021-09-21 07:36:57,162 - [slurm_plugin.common:add_instances_for_nodes] - ERROR - Encountered exception when launching instances for nodes (x2) ['gpu-compute-spot-dy-g4dnxlarge-1', 'gpu-compute-spot-dy-g4dnxlarge-2']: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 1): We currently do not have sufficient g4dn.xlarge capacity in the Availability Zone you requested (us-east-2c). Our system will be working on provisioning additional capacity. You can currently get g4dn.xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2a, us-east-2b.
@timestamp
1632209817162
1: slurmstepd: error: pyxis: container start failed with error code: 1
1: slurmstepd: error: pyxis: printing contents of log file ...
1: slurmstepd: error: pyxis:     /usr/local/etc/enroot/hooks.d/50-slurm-pytorch.sh: line 39: [: 1(x2): integer expression expected
1: slurmstepd: error: pyxis:     /etc/profile.d/01-locale-fix.sh: line 2: /usr/bin/locale-check: No such file or directory
1: slurmstepd: error: pyxis:     /etc/rc: line 4: cd: /workspace: No such file or directory
1: slurmstepd: error: pyxis: couldn't start container
1: slurmstepd: error: pyxis: if the image has an unusual entrypoint, try using --no-container-entrypoint
1: slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
1: slurmstepd: error: Failed to invoke spank plugin stack
srun: error: gpu-compute-ondemand-dy-g4dnxlarge-2: task 1: Exited with exit code 1
export REGION="us-east-2"
export INSTANCE="g4dn.xlarge"
export NOW=$(date -Iseconds)

aws --region $REGION ec2 describe-spot-price-history --instance-types $INSTANCE --product-description "Linux/UNIX" --start-time "$NOW" | sort -n -k5 | awk '{print $2}' | sort | uniq

aws --region $REGION ec2 describe-subnets --filters "Name=defaultForAz,Values=true"

aws --region $REGION autoscaling describe-auto-scaling-groups | egrep autoScalingGroupName | egrep playground-ComputeFleet | awk '{print $3}'