Connected to the naplab, we have several GPU resources. This page will give you a short introduction to working remotely with GPU resources.
Resources:
- Nap01 Server
- HPC IDUN Cluster (hpc.ntnu.no)
****
For the nap01 server, please follow the following rules:
- Use nvidia-docker to run jobs
- When starting a docker container, name the container with {ntnu_username}_...
- When creating a docker image, name the image {ntnu_username}/image_name
- ALWAYS check nvidia-smi, to be certain that nobody is using the GPU you want to use
You can connect to the server by using ssh:
ssh ntnu_username@nap01.idi.ntnu.no
Nap01 has two NVIDIA V100-32GB GPUs and 2x Intel Xeon Gold 6132 CPUs.
There is two places to store data on the server:
- /lhome/ntnu_username: This is a 1TB disk where you should launch your programs from. However, you should NOT store large amount of data on this disk! It is also taken backup of this disk
- /work/ntnu_username: This disk is for storing larger datasets. If you don't have a directory there, contact Frank or Håkon.
- df -h: View disk space on the server
- htop: View cpu/RAM usage on the server
- nvidia-smi: View VRAM/GPU usage on the V100 cards.
To get a short introduction to docker, we recommend you to read through the Open AI Lab's tutorial on docker.
Docker commands can become very long, with several static settings. To make your life easier, you can create simple python scripts to start a docker container. For example:
{% code title="run_docker" %}
#!/usr/bin/env python3
import sys
import os
import random
gpu_id = sys.argv[1]
python_args = " ".join(sys.argv[2:])
docker_name = str(random.randint(0, 10000))
docker_container = "haakohu_{}".format(docker_name) # Replace username with your ntnu username
pwd = os.path.dirname(os.path.abspath(__file__))
cmd = [
"nvidia-docker",
"run",
f"-u 1123514", # Set your user ID.
f"--name {docker_container}", # Set name of docker container
"--ipc host", # --ipc=host is recommended from nvidia
"--rm", # remove container when exited / killed
f"-v {pwd}:/workspace", # mount directories. This mounts current directory to /workspace in the container
f"-e CUDA_VISIBLE_DEVICES={gpu_id}", # Set GPU ID
"--log-opt max-size=50m", # Reduce memory usage from logs
"-it", # Interactive
"haakohu/pytorch", # Docker image
f"{python_args}" # python command
]
command = " ".join(cmd)
print(command)
os.system(command)
{% endcode %}
There is a couple of important settings to change here:
- Change docker_name to your ntnu username
{NTNU-USERNAME}_...
- Change the
-u
argument in the cmd list. You can find your ID by logging onto the server, then runid -u ntnu_username
, for exampleid -u haakohu
. This is to prevent the docker container to save files as administrator, which can easily mess up your project files. - The
-v
argument to mount folders. In the script, we are only mounting your current directory to /workspace in the docker container. If you need to mount something else, you can add several -v arguments - The docker image.
Save this with the filename run_docker
and make it an executable by running
chmod +x run_docker
Then, I can start the training script with on GPU id 0:
./run_docker_example 0 python -m deep_privacy.train
If you want to start a job without GPU, you can run:
./run_docker_example "" python -m deep_privacy.train
This will execute the following docker cmd:
nvidia-docker run --name haakohu_5556_other --ipc host --rm -v /home/haakohu/DeepPrivacy:/workspace -e CUDA_VISIBLE_DEVICES=8 --log-opt max-size=50m -it haakohu/pytorch python -m deep_privacy.train
Nvidia GPU Cloud has several pre-built docker images for Nvidia systems:
{% embed url="https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow" %}
Working remotely can be a hassle without mounting the remote filesystem. If you mount a folder to your local computer, you can use your favorite texteditor to work on it.
We recommend you to use sshfs:
{% embed url="https://github.com/libfuse/sshfs" %}
For larger datasets, we recommend you to store your datasets on a different disk than the main SSD.
This can be found under /work/ntnu_username
. To get a directory for your username, contact Håkon.
The V-100 cards are extremely powerful, and requires optimized code to realise the full computing potential.
You can see the utilization of the gpu's by running watch -n 0,5 nvidia-smi
. Your code should be running at 90% + utilization most of the time.
The V100 cards has about 600 tensor cores, which has some weird requirements. Most DL libraries run your operations automatic on tensor cores if you satisfy the following requirements:
- Number of filters in your CNN is divisible by 8
- Your batch size is divisble by 8
- Your parameters/input data is floating point 16
The first two requirements are rather easy to satisfy, however, training a CNN with floating point 16 is hard. To get proper training of your network with 16 bit floating point, you are required to train with mixed precision training. Therefore, we recommend the following two resources to get started with this:
- https://devblogs.nvidia.com/video-mixed-precision-techniques-tensor-cores-deep-learning/#part2
- https://github.com/NVIDIA/apex - Highly recommended for Pytorch users!
With my code, I got a 220% speed up without loosing any performance.
If your code is running slow and you can't find the bottleneck, profiling is your best friend.
You can use tools like nvprof, but there exists profiling tools to different DL libraries. For pytorch, we have the module torch.utils.bottleneck
: https://pytorch.org/docs/stable/bottleneck.html.