This tutorial is based on the Alvis tutorial.
An example of how to log in to Alvis:
If you are not connected to Chalmers internet, you need to use a Chalmers VPN.
The command:
will show you the amount of storage you have left. And be aware of you storage quota!
For downloading and uploading files, you can use scp
- System status
jobinfo -u lovhag
- Connect with VS code
- "Remote Explorer"
- Capacity of GPUs
- Memory overload "CUDA out of memory" "OOM"
- Modules
python3 -m venv <path>
- Use symbolic links!
#!/usr/bin/env bash
#SBATCH -A SNIC2022-22-1040 -p alvis
#SBATCH -n 1
#SBATCH --gpus-per-node=A100:1
#SBATCH --job-name=FiD-NQ-eval
#SBATCH -o /cephyr/users/lovhag/Alvis/projects/SKR-project/FiD/NQ-eval.out
#SBATCH -t 4:00:00
set -eo pipefail
module load GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch/1.7.1-fosscuda-2020b
source venv/bin/activate
python \
--model_path "${DATA_ROOT}/models/nq_reader_base" \
--eval_data "${DATA_ROOT}/NQ/test.json" \
--per_gpu_batch_size 8 \
--n_context 100 \
--name NQ_test_base_batch_8 \
--checkpoint_dir "${DATA_ROOT}/test_checkpoints" \
--write_results \ JOBID
Make sure that it is runnable!
Monitor with scruffy: "This job can be monitored from:"
Make sure that all allocated resources are used
sbatch jobscript
To run multiple jobs at once:
sbatch --array=25,50,75,100 {{model_name_script}}
: submit batch jobssrun
: submit interactive jobsjobinfo (squeue)
: view the job-queue and the state of jobs in queue, shows amount of idling resourcesscontrol show job <jobid>
: show details about job, including reasons why it's pending sprio: show all your pending jobs and their priorityscancel
: cancel a running or pending job sinfo: show status for the partitions (queues): how many nodes are free, how many are down, busy, etc.projinfo
: show the projects you belong to, including monthly allocation and usage For details, refer to the -h flag, man pages, or google!
srun -p alvis -A SNIC2022-22-1040 -N 1 --gpus-per-node=T4:1 --job-name=demo -t 4:00:00 --pty bash
module load GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5 PyTorch/1.8.1-fosscuda-2020b torchvision/0.9.1-fosscuda-2020b-PyTorch-1.8.1
source ../venv/bin/activate
jupyter notebook ../data/larger-visual-commonsense-eval-experiments/normdata-evaluation-results.ipynb
Read more here:
- Logfile
- Environment
- Storage full
- What to do if something doesn't work
- Read docs
Contact Lovisa if anything is unclear or if you have any questions!